@cscott hijacked this task and rewrote the description for a slightly different use case:
During edits, we don't ship the data-parsoid attributes for a given parse to the client (to save bandwidth and time). Instead we store it persistently server-side, and then recombine it with the client's edited HTML when the edit is converted back to wikitext.
It has been suggested that our storage infrastructure would be a lot simpler if, instead of persistent storage (with ACID guarantees, replication, etc) we could use a simple cache, and recreate the missing data-parsoid on-the-fly if the relevant entry happened to expire from the cache (probably rare).
Unfortunately, that requires a 100% deterministic parse, so that we are guaranteed to be able to recreate exactly the same data-parsoid that we'd originally supplied. This is an "epic" scale task. The basic idea is:
- Basic Parsoid execution should be deterministic (almost the case today, but see T206222)
- Maintain a unique "timestamp of parse" ID for every parse (RESTBase does this already)
- Use this to deterministically fetch appropriate templates/media etc based on the revision active at that timestamp
- The Memento Extension does this (but neither Parsoid nor the PHP parser currently do)
- Some interesting issues w/r/t administratively-deleted content between original parse and present (not blockers, but sometimes regenerating an old parse should deliberately fail)
- Use this to deterministically execute all parser functions and scribunto code
- Time/date functions easy, just use the passed timestamp
- "Random" numbers can be replaced with hash(timestamp, secret seed)
- "Current # of articles" and similar are (much?) harder
- Probably lots of hard corner cases where Lua senses the environment (network requests, etc)
Determinism is probably impossible to guarantee on 100% of our content, but we could probably do the easy 80% without too much pain and pass around a flag saying whether the parse encountered any possible nondeterminism (ie, called a blacklisted parser function, invoked a scribunto module, called {{NUMBEROFARTICLES}}, would have used a revision that has since been deleted, etc).
Then we'd have to decide what to do in those 20% of the cases---mitigated perhaps that this is only 20% of the cache misses, and that the "hard 20% of mediawiki features" are probably not used by anything close to 20% of WM articles. Options include:
- Perhaps it's not too bad if we just say "the edit fails" in the hard cases.
- We might be able to bound the nondeterminism to a DOM tree (aka balanced templates) and only fail the edit if it involved that subtree.
- Maybe VE tries to replay its transaction log against a newly-parsed version (akin to an "edit conflict"), and that's then successful "most" of the time.
- In the worst case, we end up having to fall back to persistent storage for the "20%", and we haven't simplified our cache infrastructure at all. We may have reduced the amount of expensive reliable persistent storage, however.
Current planning is that we should integrate appropriate persistent editing-session storage into the existing PHP ParserCache mechanism (or continue to use RESTBase), and not attempt this "epic scale" determinism task. But we can continue to discuss the idea here and collect subtasks to scope the work that would be required.
Old task description by @GWicke:
In RESTBase we'd like to avoid storing template updates if none of the content actually changed. I started to look into using simple string equality to avoid storing a template update, but noticed that basically every re-render differs from the previous render in its about attribute values and citation links. It seems that about id assignment changes depending on async execution order. Even if there was an actual change in page rendering, these random about attributes makes the RESTBase compression ratios significantly worse than they could be.
So, my request are:
- make sure that two subsequent parses using the same input data (article, templates etc) always result in the same HTML string
- minimize the differences introduced by re-renders with different templates; for example, it would be great to keep id & about attribute changes as local as possible (don't re-number all following ids if one element was added); see also T87556.