Page MenuHomePhabricator

Figure out validation and well-formedness
Closed, ResolvedPublic

Description

The function model assumed that we would have two levels to check ZObjects, well-formedness and valid.

https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Function_model

It seems that this is not exactly sufficient. In this task we try to figure out what conditions of quality control may be relevant, where they should happen, and to give them maybe names, etc.

Well-formedness: is a simple local test which is the same for every ZObject. Well-formedness means that the ZObject can be serialized to a JSON representation that fulfills the following condition:

https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Function_model#Syntax

That doesn't mean that we need to actually serialize it and then check the grammar of the serialization. We can also do that check in ZObject directly, by checking the keys, values, etc.

Valid: every ZObject has a Z1K1/type which points to a Z4/Type, which in turn has a Z4K3/validator that points to a Z8/Function. A ZObject is valid if the application of that function to the ZObject returns an empty list (i.e. returns no errors).

This is a potentially very heavy operation that requires plenty of look-ups. Also, it assumes that the core functions and types are all fine, or else we can't even run functions. That is why we need some inalienable truths, as defined by T260314 - these guarantee that we can actually run the validator.

So we need some term for this level of checking: this object conforms to the inalienable truths defined in T260314

Strawman proposal: call this level "conformant" or "conforming".

Further possible levels of checking:

  • all references in a ZObject can be resolved (let's call this "linked")
  • Z1K1 has a value of type Z4, which in turn has a Z4K3 of type Z8 (which is a necessary precondition for validation), let's call this "checkable"

All valid ZObjects must be both checkable and linked.
All checkable ZObjects must be conforming.
All conforming ZObjects must be well-formed.

For Z8/Functions and Z14/Implementations there is additionally the notion of Z20/Testers, and there should probably should be a term for an Z14/Implementation that passes all Z20/Testers of the given Z8/Function it implements. Let's call that "Tested". That is only relevant for Implementations. Suggestions: All tested Implementations must be valid (or, stated differently, only valid Implementations may be tested).

There are (at least) two major questions:

  1. when a ZObject is being edited (or created), what condition does the ZObject need to fulfil in order to be stored by the wiki?
  2. when a ZObject is being used in an evaluation, what conditions does the ZObject need to fulfil?

Let's quickly note down two features of each of the levels of checking described above: how fast and how self-contained is each level?

  • Well-formedness: fast and self-contained. Changes to other ZObjects in the wiki cannot impact the well-formedness of a given ZObject.
  • Conforming: fast and self-contained. Changes to other ZObjects in the wiki cannot impact the conformicity of a given ZObject. This is only true because changes to other objects that break the inalienable truths are not allowed.
  • Linked: fast and not self-contained. Referenced ZObjects may be deleted, which would turn a linked ZObject into a non-linked one, and creating ZObjects may turn a non-linked ZObject into a linked one.
  • Checkable: variable speed, but likely fast-ish, and not self-contained. As this depends on the value of Z1K1, which often will be a reference to a Z4/Type stored in the wiki. This Z4/Type may be deleted or have itself an unlinked validator.
  • Valid: variable speed, potentially slow-ish, and not self-contained. The validator may be changed, rendering perfectly valid objects invalid and the other way around.
  • Tested: variable speed, potentially very slow, and not self-contained. The testers may change, or be removed, rendering tested implementations failing the tests and the other way around.

So, regarding the questions above:

  1. a ZObject being stored must be well-formed and conforming. It should probably be linked at the time of storing. It should be checkable at the time of storing. Should it be valid? (I would suggest yes, until the validation checks start taking so long that this becomes unfeasible, and then we revisit the issue). Do implementations need to be tested?

This seems to recommend that we need to have special pages that list unlinked, uncheckable, and invalid ZObjects, so that these can be maintained.

  1. it seems that a ZObject should only be evaluated if it is valid (otherwise implementations become super-defensive and hard to read and write). Now, doing the validation is a function call, so that is kind a self-referential definition in that case.

We probably need some kind of pragmatic resolution of that knot. Something like "yes, ZObjects must always be valid when being evaluated, but if the function call is the validation itself then this condition is cut". Given that the validation is done by the evaluation engine itself, it should be possible to resolve that issue inside the evaluation engine itself, and not call itself in an infinite loop.

Thoughts?

Event Timeline

The Function Model syntax requires a Z1K1/type with a value that evaluates to a Z4/Type. Including this within the "well formed" definition seems to be at odds with the proposition that "Changes to other ZObjects in the wiki cannot impact the well-formedness of a given ZObject." Where the referenced Z4/Type requires evaluation (e.g. for a Z8/Function or a Z10/List), its value depends on the function it references and upon that function's dependencies etc. I suggest that an object could be called "well formed" so long as its Z1K1 is "well formed", meaning that it could be dependent on a function that cannot, in fact, be evaluated (including one that has never existed but excluding any that cannot exist).

Thinking out loud:

I've been a bit stuck on whether the validation requirement would impede programming aware newcomers. I think if there is a default validator, whose only requirement is the prior three conditions (well formed, conforming, linked) are met then it doesn't impede them. I suspect that might then lead to more validation type behavior being implemented directly in the function implementations (as noted for people writing code that's ultra defensive), if I'm understanding this correctly (and I may not be understanding it correctly). I forget, were we thinking to have a default validator? I guess I should verify: when we ask "Should it be valid?" In the Description for this task, is that mostly looking for satisfiability against all four conditions or is there some other unstated fifth condition?

Of course the "correct" thing to do would be to require a validator as a precondition as suggested.

So I wonder if part of the answer is to have a system default validator that's basically a no-op / returns an empty list (and that an be baked into the JSON schema for all functions as opposed to being magical depending on what to encourage?), and which can be replaced with a custom validator, maybe with some strong UI hint.

Of course the other way around might be possible and it might encourage use of the system in a way that's less prone to breakage later and might allow for some types of function discovery plus maybe some performance gains in some contexts.

I don't know the answer, but I think that's the main stuff I've been struggling with. It may be that some recent code changes are addressing this, though, too.

Otherwise: the work lists idea is good. And the requirement that a function be "valid" before it gets executed seems reasonable as a design principle. It seems like the main question is about the validator function and whether to require it to be user supplied and, if not, whether to have a default validator (which basically returns an empty list if the prior conditions of well formed, conforming, and linked are met). I can see how it might influence certain behaviors in the system.

Let's consider lazy evaluation.

I suppose I should be able to create an object that cannot be fully validated because its Z4K2/keys cannot be fully expanded/evaluated? So how deeply should we evaluate an instance of a Z4/Type, for validation? My instinct is, "not very deeply", but my instincts are certifiably unreliable.

To take a simple example, if there is a validation requiring a string to have length 2, we can say an object is valid if it happens to reference a function whose Z8K1/return type is a string with length 2, and we can say it is invalid if the Z8K1/return type is not a string or is a string with length <2 or >2. But if the return type is just Z6/String, we don't know whether it's valid or not, unless we evaluate it. In such a case, I think we should return an error (so it's not valid). Then the contributor can

  1. change the referenced function's return type (having saved the "invalid" object) or
  2. wrap the referenced function in a function with an appropriate return type.
This comment was removed by GrounderUK.
DVrandecic triaged this task as Medium priority.Jan 6 2021, 5:51 PM

The Function Model syntax requires a Z1K1/type with a value that evaluates to a Z4/Type. Including this within the "well formed" definition seems to be at odds with the proposition that "Changes to other ZObjects in the wiki cannot impact the well-formedness of a given ZObject." Where the referenced Z4/Type requires evaluation (e.g. for a Z8/Function or a Z10/List), its value depends on the function it references and upon that function's dependencies etc. I suggest that an object could be called "well formed" so long as its Z1K1 is "well formed", meaning that it could be dependent on a function that cannot, in fact, be evaluated (including one that has never existed but excluding any that cannot exist).

You are right. Well-formedness should not require a valid Z4 on the Z1K1 for exactly the reasons you describe. I fixed that as suggested. https://meta.wikimedia.org/w/index.php?title=Abstract_Wikipedia/Function_model&diff=next&oldid=21100248&diffmode=source

Thinking out loud:

I've been a bit stuck on whether the validation requirement would impede programming aware newcomers. I think if there is a default validator, whose only requirement is the prior three conditions (well formed, conforming, linked) are met then it doesn't impede them. I suspect that might then lead to more validation type behavior being implemented directly in the function implementations (as noted for people writing code that's ultra defensive), if I'm understanding this correctly (and I may not be understanding it correctly). I forget, were we thinking to have a default validator? I guess I should verify: when we ask "Should it be valid?" In the Description for this task, is that mostly looking for satisfiability against all four conditions or is there some other unstated fifth condition?

Of course the "correct" thing to do would be to require a validator as a precondition as suggested.

So I wonder if part of the answer is to have a system default validator that's basically a no-op / returns an empty list (and that an be baked into the JSON schema for all functions as opposed to being magical depending on what to encourage?), and which can be replaced with a custom validator, maybe with some strong UI hint.

Of course the other way around might be possible and it might encourage use of the system in a way that's less prone to breakage later and might allow for some types of function discovery plus maybe some performance gains in some contexts.

I don't know the answer, but I think that's the main stuff I've been struggling with. It may be that some recent code changes are addressing this, though, too.

Otherwise: the work lists idea is good. And the requirement that a function be "valid" before it gets executed seems reasonable as a design principle. It seems like the main question is about the validator function and whether to require it to be user supplied and, if not, whether to have a default validator (which basically returns an empty list if the prior conditions of well formed, conforming, and linked are met). I can see how it might influence certain behaviors in the system.

Yes to all. Those are great thoughts.

So, agreed, we should require validation. But in order to not impede newcomers, it should be possible to save an object even if certain levels are not validated.

I would suggest that:

  • in the normal case, when the contributor tries to save, we run a full validation.
  • we will revisit that decision once validation starts to take consistently longer than 2 seconds for any type.
  • if the validation passes, we store the object.
  • if the validation does not pass, we return the error message.
  • now the contributor can redo the change.
  • if the object is at least conforming they can also check a checkbox that allows them to force storage of the object.

This should allow for some errors to store and to allow other contributors to work for it. Also, as stated above, we would have special pages that list all unlinked, uncheckable, and invalid objects.

Let's consider lazy evaluation.

I suppose I should be able to create an object that cannot be fully validated because its Z4K2/keys cannot be fully expanded/evaluated? So how deeply should we evaluate an instance of a Z4/Type, for validation? My instinct is, "not very deeply", but my instincts are certifiably unreliable.

To take a simple example, if there is a validation requiring a string to have length 2, we can say an object is valid if it happens to reference a function whose Z8K1/return type is a string with length 2, and we can say it is invalid if the Z8K1/return type is not a string or is a string with length <2 or >2. But if the return type is just Z6/String, we don't know whether it's valid or not, unless we evaluate it. In such a case, I think we should return an error (so it's not valid). Then the contributor can

  1. change the referenced function's return type (having saved the "invalid" object) or
  2. wrap the referenced function in a function with an appropriate return type.

In this case we would create a new Type, "String with length two". Now we can plug-in any Function whose return type is "String with length two" into any slot that requires a "String with length two". We don't need to evaluate the Function: because its return type is "String with length two" we know it will return "String with length two". We have a nominal type checking here.

We would not allow for a Function that declares it returns Z6/String in this place, and hope that it has length 2. We would insist on the Function to return "String with length two".

(Note that in a real setting we would be probably having a Function called "String of specific length" that takes a positive integer as an input and returns a Type. We would still require a nominal equality - that is why we have the identity as a key on Types, so that we can use this identity to compare two Types for equality)

I hope that makes sense.

I think that makes sense. For my first option, I should have stated that the change to the referenced function’s return type would be from the more general to the more specific (e.g. from Z6/String to Z?/2-string).

My point, though, was that I seem to have a problem with the workflow. It is the referenced function that needs to change, not the one I’m editing now. I can’t save my “correct” function, because the referenced function’s incorrect return type makes my function “invalid” (for now). So do I save it with this error (my first option)? Or do I pretend to coerce the Z6 to a 2-string (my second option)? I don’t like either option, because they both assume I’m reliable, but at least the error with the first option will clear itself if I make the correct change to the other function. So I’d rather save a “correct” object with an error than avoid the error by making the object less correct (for the time being). Under your latest proposal, I would get a single error initially and then choose to force the save anyway, so we’re good.

DVrandecic raised the priority of this task from Medium to High.Mar 17 2021, 4:38 PM

The following tasks will implement the changes suggested here:

  • block publishing / storing if validation fails (this already happens)
  • display error messages from validation (this happens now)
  • T273124 - call evaluator for validation
  • T273125 - migrate hard-coded validators
  • T278316 - provide validation as a service / API
  • T278318 - possibility to call validation on an unstored object
  • T278319- possibility to call validation on a stored object
  • T278320 - override the blocking on storing an object even if it doesn't validate
  • T278321 - special page with all invalid pages
  • T278325 - store invalidity