Page MenuHomePhabricator

Normalize ZObjects
Open, Needs TriagePublic

Description

The canonical view developed in T259932 is good for storing and for a reasonably compact view of ZObjects for human consumption. For machines to process it, it is more convenient to have it in a more uniform presentation.

The normalized representation of a ZObject is a tree where all leaves are ZObjects of the type Z6/String or Z9/Reference. No other types appear on the leaves, but the inner nodes could be of any type.

This also means, no escaping of strings will be needed as all strings are explicitly typed.

All Z10/Lists are represented by ZObjects of type Z10/List and not by arrays in the JSON representation. All objects of type Z10/List have either both the Z10K1 and Z10K2 key, or neither (i.e. it is an empty list).

So the following steps need to be done by the normalizer:

  • trim keys
  • sort keys
  • ensure that all strings are represented as explicit ZObjects of type Z6/String
  • ensure that all references are represented as explicit ZObjects of type Z9/Reference
  • ensure that lists represented by ZObjects of type Z10/List
  • ensure that all ZObjects of type Z10/List either have both keys or none
  • ensure that all ZObjects are trees with all leaves being either of type Z6/String or Z9/Reference (that should be a given if all previous steps are fulfilled)

Normalization must be performed on any input to any evaluation. It will considerably simplify writing function implementations.

Event Timeline

DVrandecic moved this task from Phase δ to Phase γ on the Abstract Wikipedia board.

Does normalisation include the suppression of all unnecessary whitespaces? It does not matter if the underlying storage uses a compact form, not easily readable by a human, given that the wiki editor could as well expand it with proper indendation for easier editing, and recompact it after.

As well, will we be able to insert comments in JSON-like objects, or will there be a way to separate the canonicalized and compacted form of the object separately from its source, to preserve some comments (which which syntax: like HTML/XMS with <!-- ... -->, C/C++/Java/JS with /* ... */ or // ... \n, SQL with -- ... \n, Lua with --[[ ... ]], sh/ksh/bash with # ... \n, Pascal with (* ... *), or others)/
May be that JSON form could be just stored in a cache, while the actual pages could use one of several other languages, the conversion being performed by a "Parser function", a new kind of Function, jsut like there are Renderers; and as there will bne a REPL, it will certainly not use that JSON syntax, but could as well use a Ture-like or Python-like syntax.

In my opinion, canonicalisation of the input can be made with a function as well. And to improve the speed of the function evaluator, you'll need to support a cache of results, so this cache should be pritable as well to store the canonicalized version of objects: there's no requirement for objects to use exclusively the JSON-like format with cryptic keys ands there are certainly better ways to represent it in a much simpler structure (e.g. we don't need many implicit or required keys such as Z1K1)

As well, the JSON data is also fully representable as a list of RDF triples (whose processing at large scale is very efficient and can be easily distributed and parallelized; for RDF, you would just need to create "reference" objects)