Page MenuHomePhabricator

[EPIC] Make Parsoid HTML output completely deterministic
Open, MediumPublic

Description

@cscott hijacked this task and rewrote the description for a slightly different use case:

During edits, we don't ship the data-parsoid attributes for a given parse to the client (to save bandwidth and time). Instead we store it persistently server-side, and then recombine it with the client's edited HTML when the edit is converted back to wikitext.

It has been suggested that our storage infrastructure would be a lot simpler if, instead of persistent storage (with ACID guarantees, replication, etc) we could use a simple cache, and recreate the missing data-parsoid on-the-fly if the relevant entry happened to expire from the cache (probably rare).

Unfortunately, that requires a 100% deterministic parse, so that we are guaranteed to be able to recreate exactly the same data-parsoid that we'd originally supplied. This is an "epic" scale task. The basic idea is:

  • Basic Parsoid execution should be deterministic (almost the case today, but see T206222)
  • Maintain a unique "timestamp of parse" ID for every parse (RESTBase does this already)
  • Use this to deterministically fetch appropriate templates/media etc based on the revision active at that timestamp
    • The Memento Extension does this (but neither Parsoid nor the PHP parser currently do)
    • Some interesting issues w/r/t administratively-deleted content between original parse and present (not blockers, but sometimes regenerating an old parse should deliberately fail)
  • Use this to deterministically execute all parser functions and scribunto code
    • Time/date functions easy, just use the passed timestamp
    • "Random" numbers can be replaced with hash(timestamp, secret seed)
    • "Current # of articles" and similar are (much?) harder
    • Probably lots of hard corner cases where Lua senses the environment (network requests, etc)

Determinism is probably impossible to guarantee on 100% of our content, but we could probably do the easy 80% without too much pain and pass around a flag saying whether the parse encountered any possible nondeterminism (ie, called a blacklisted parser function, invoked a scribunto module, called {{NUMBEROFARTICLES}}, would have used a revision that has since been deleted, etc).

Then we'd have to decide what to do in those 20% of the cases---mitigated perhaps that this is only 20% of the cache misses, and that the "hard 20% of mediawiki features" are probably not used by anything close to 20% of WM articles. Options include:

  • Perhaps it's not too bad if we just say "the edit fails" in the hard cases.
  • We might be able to bound the nondeterminism to a DOM tree (aka balanced templates) and only fail the edit if it involved that subtree.
  • Maybe VE tries to replay its transaction log against a newly-parsed version (akin to an "edit conflict"), and that's then successful "most" of the time.
  • In the worst case, we end up having to fall back to persistent storage for the "20%", and we haven't simplified our cache infrastructure at all. We may have reduced the amount of expensive reliable persistent storage, however.

Current planning is that we should integrate appropriate persistent editing-session storage into the existing PHP ParserCache mechanism (or continue to use RESTBase), and not attempt this "epic scale" determinism task. But we can continue to discuss the idea here and collect subtasks to scope the work that would be required.


Old task description by @GWicke:
In RESTBase we'd like to avoid storing template updates if none of the content actually changed. I started to look into using simple string equality to avoid storing a template update, but noticed that basically every re-render differs from the previous render in its about attribute values and citation links. It seems that about id assignment changes depending on async execution order. Even if there was an actual change in page rendering, these random about attributes makes the RESTBase compression ratios significantly worse than they could be.

So, my request are:

  • make sure that two subsequent parses using the same input data (article, templates etc) always result in the same HTML string
  • minimize the differences introduced by re-renders with different templates; for example, it would be great to keep id & about attribute changes as local as possible (don't re-number all following ids if one element was added); see also T87556.

Related Objects

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added projects: Parsoid, RESTBase.
GWicke added subscribers: GWicke, mobrovac.
GWicke set Security to None.
GWicke edited projects, added Parsoid; removed RESTBase.
GWicke added a subscriber: RESTBase.

Quick comment. This should not be very difficult. A very crude implementation could just do a final pass on the DOM and reassign about ids deterministically. But, there are probably better ways than that.

Summary of IRC discussion:

  • There are two different issues here. Cite ref ids being non-deterministic is a bug. I'll investigate that.
  • As for about ids, this is a little bit more complex. If we assign about ids before kicking off subpipelines (or maybe even pre-assigned in the tokenizer), and use hierarchical ids for nested about ids, as long as we keep about ids assigned in top-level order, about ids should be deterministic. This task is up for grabs right now. If no one has taken it while I am investigating the cite ref id issue, I'll pick it up.

For us the most pressing issue is the non-determinism in cite links. The about attributes are relatively straightforward to work around by ignoring their value in comparisons.

marcoil subscribed.

I'm investigating the Cite refs part of the issue. It seems that the changes are in the ids for the <reference> entries, which changes the reflinks and the data-mw.body.id fields.

An example found by Subbu was [[en:Minneapolis]], which changed

  • <a href="#cite_note-GR3-3">[3]</a> to <a href="#cite_note-GR3-269">[3]</a>
  • "body":{"id":"mw-reference-text-cite_note-GR3-3"} to "body":{"id":"mw-reference-text-cite_note-GR3-269"}

Change 199643 had a related patch set uploaded (by Marcoil):
T93715: Ensure reference index is reset at the end of document

https://gerrit.wikimedia.org/r/199643

GWicke renamed this task from Make HTML output deterministic to Make HTML output as deterministic as possible.Mar 30 2015, 7:10 PM
GWicke updated the task description. (Show Details)
GWicke renamed this task from Make HTML output as deterministic as possible to Make HTML output as deterministic / stable as possible.Mar 31 2015, 6:46 PM
Arlolra lowered the priority of this task from High to Medium.Jan 20 2017, 2:22 AM

This was brought up again as desiderata due to caching/storage concerns. Since we don't actually provide data-parsoid to VE, we currently need to *guarantee* persistent storage of a matched set of Parsoid HTML/data-parsoid for that HTML for the entire duration of an editing session, to be certain that the html2wt phase can get back the appropriate data-parsoid.

If the parse was entirely deterministic, we could use non-persistent storage (cache) and simply regenerate the appropriate data-parsoid as needed if it had expired from the cache.

cscott renamed this task from Make HTML output as deterministic / stable as possible to [EPIC] Make Parsoid HTML output completely deterministic.Oct 4 2018, 2:44 PM
cscott updated the task description. (Show Details)
cscott updated the task description. (Show Details)
cscott added a subscriber: GWicke.