[EPIC] Make Parsoid HTML output completely deterministic
Open, MediumPublic
Actions

Assigned To

None

Authored By

	• GWicke
	Mar 24 2015, 5:09 AM

Description

@cscott hijacked this task and rewrote the description for a slightly different use case:

During edits, we don't ship the data-parsoid attributes for a given parse to the client (to save bandwidth and time). Instead we store it persistently server-side, and then recombine it with the client's edited HTML when the edit is converted back to wikitext.

It has been suggested that our storage infrastructure would be a lot simpler if, instead of persistent storage (with ACID guarantees, replication, etc) we could use a simple cache, and recreate the missing data-parsoid on-the-fly if the relevant entry happened to expire from the cache (probably rare).

Unfortunately, that requires a 100% deterministic parse, so that we are guaranteed to be able to recreate exactly the same data-parsoid that we'd originally supplied. This is an "epic" scale task. The basic idea is:

Basic Parsoid execution should be deterministic (almost the case today, but see T206222)
Maintain a unique "timestamp of parse" ID for every parse (RESTBase does this already)
Use this to deterministically fetch appropriate templates/media etc based on the revision active at that timestamp
- The Memento Extension does this (but neither Parsoid nor the PHP parser currently do)
- Some interesting issues w/r/t administratively-deleted content between original parse and present (not blockers, but sometimes regenerating an old parse should deliberately fail)
Use this to deterministically execute all parser functions and scribunto code
- Time/date functions easy, just use the passed timestamp
- "Random" numbers can be replaced with hash(timestamp, secret seed)
- "Current # of articles" and similar are (much?) harder
- Probably lots of hard corner cases where Lua senses the environment (network requests, etc)

Determinism is probably impossible to guarantee on 100% of our content, but we could probably do the easy 80% without too much pain and pass around a flag saying whether the parse encountered any possible nondeterminism (ie, called a blacklisted parser function, invoked a scribunto module, called {{NUMBEROFARTICLES}}, would have used a revision that has since been deleted, etc).

Then we'd have to decide what to do in those 20% of the cases---mitigated perhaps that this is only 20% of the cache misses, and that the "hard 20% of mediawiki features" are probably not used by anything close to 20% of WM articles. Options include:

Perhaps it's not too bad if we just say "the edit fails" in the hard cases.
We might be able to bound the nondeterminism to a DOM tree (aka balanced templates) and only fail the edit if it involved that subtree.
Maybe VE tries to replay its transaction log against a newly-parsed version (akin to an "edit conflict"), and that's then successful "most" of the time.
In the worst case, we end up having to fall back to persistent storage for the "20%", and we haven't simplified our cache infrastructure at all. We may have reduced the amount of expensive reliable persistent storage, however.

Current planning is that we should integrate appropriate persistent editing-session storage into the existing PHP ParserCache mechanism (or continue to use RESTBase), and not attempt this "epic scale" determinism task. But we can continue to discuss the idea here and collect subtasks to scope the work that would be required.

Old task description by @GWicke:
In RESTBase we'd like to avoid storing template updates if none of the content actually changed. I started to look into using simple string equality to avoid storing a template update, but noticed that basically every re-render differs from the previous render in its about attribute values and citation links. It seems that about id assignment changes depending on async execution order. Even if there was an actual change in page rendering, these random about attributes makes the RESTBase compression ratios significantly worse than they could be.

So, my request are:

make sure that two subsequent parses using the same input data (article, templates etc) always result in the same HTML string
minimize the differences introduced by re-renders with different templates; for example, it would be great to keep id & about attribute changes as local as possible (don't re-number all following ids if one element was added); see also T87556.

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T93751 RFC: Next steps for long-term revision storage -- space needs, storage hierarchies
Resolved	• GWicke	T93779 Only store a new render of Parsoid HTML / data-parsoid revision if the content actually changed after a template update
Open	None	T93715 [EPIC] Make Parsoid HTML output completely deterministic
Resolved	• marcoil	T63165 Parsoid's Cite extension sometimes produces different ids for the same <ref> source
Open	None	T206222 Make "about" attribute IDs deterministic

Event Timeline

• GWicke created this task.Mar 24 2015, 5:09 AM

• GWicke raised the priority of this task from to Needs Triage.

• GWicke updated the task description. (Show Details)

• GWicke added projects: Parsoid, RESTBase.

• GWicke added subscribers: • GWicke, • mobrovac.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 24 2015, 5:09 AM

• GWicke removed a project: Parsoid.Mar 24 2015, 5:09 AM

• GWicke set Security to None.

• GWicke edited projects, added Parsoid; removed RESTBase.

• GWicke added a subscriber: RESTBase.

• GWicke mentioned this in T93751: RFC: Next steps for long-term revision storage -- space needs, storage hierarchies.Mar 24 2015, 3:15 PM

• GWicke added a parent task: T93751: RFC: Next steps for long-term revision storage -- space needs, storage hierarchies.

• GWicke triaged this task as High priority.Mar 24 2015, 3:17 PM

Quick comment. This should not be very difficult. A very crude implementation could just do a final pass on the DOM and reassign about ids deterministically. But, there are probably better ways than that.

ssastry moved this task from Needs Triage to In Progress on the Parsoid board.Mar 24 2015, 4:36 PM

• GWicke added a parent task: T93779: Only store a new render of Parsoid HTML / data-parsoid revision if the content actually changed after a template update.Mar 24 2015, 6:09 PM

cscott subscribed.Mar 24 2015, 10:03 PM

Summary of IRC discussion:

There are two different issues here. Cite ref ids being non-deterministic is a bug. I'll investigate that.

As for about ids, this is a little bit more complex. If we assign about ids before kicking off subpipelines (or maybe even pre-assigned in the tokenizer), and use hierarchical ids for nested about ids, as long as we keep about ids assigned in top-level order, about ids should be deterministic. This task is up for grabs right now. If no one has taken it while I am investigating the cite ref id issue, I'll pick it up.

For us the most pressing issue is the non-determinism in cite links. The about attributes are relatively straightforward to work around by ignoring their value in comparisons.

I'm investigating the Cite refs part of the issue. It seems that the changes are in the ids for the <reference> entries, which changes the reflinks and the data-mw.body.id fields.

An example found by Subbu was [[en:Minneapolis]], which changed

<a href="#cite_note-GR3-3">[3]</a> to <a href="#cite_note-GR3-269">[3]</a>
"body":{"id":"mw-reference-text-cite_note-GR3-3"} to "body":{"id":"mw-reference-text-cite_note-GR3-269"}

Change 199643 had a related patch set uploaded (by Marcoil):
T93715: Ensure reference index is reset at the end of document

https://gerrit.wikimedia.org/r/199643

gerritbot added a project: Patch-For-Review.Mar 25 2015, 5:49 PM

Placing up for grabs as the Cite part is being dealt with at T63165.

ssastry mentioned this in T93973: Technical Debt: Eliminate all state from Cite.references object.Mar 25 2015, 11:31 PM

ssastry mentioned this in T93974: Tech Debt: Allocate native extension objects once per document instead of reusing it across documents.Mar 25 2015, 11:33 PM

• marcoil removed a project: Patch-For-Review.Mar 26 2015, 11:28 AM

• GWicke mentioned this in T93779: Only store a new render of Parsoid HTML / data-parsoid revision if the content actually changed after a template update.Mar 27 2015, 12:31 AM

• GWicke mentioned this in T94422: Consistently use the same render for html2wt processing after an edit.Mar 30 2015, 3:59 PM

• GWicke renamed this task from Make HTML output deterministic to Make HTML output as deterministic as possible.Mar 30 2015, 7:10 PM

• GWicke updated the task description. (Show Details)

• GWicke updated the task description. (Show Details)Mar 30 2015, 7:16 PM

• marcoil closed subtask T63165: Parsoid's Cite extension sometimes produces different ids for the same <ref> source as Resolved.Mar 31 2015, 9:55 AM

• GWicke renamed this task from Make HTML output as deterministic as possible to Make HTML output as deterministic / stable as possible.Mar 31 2015, 6:46 PM

• marcoil mentioned this in rGPARe77a61fa75c7: T63165: Ensure reference index is reset at the end of document.Apr 8 2015, 4:28 PM

Qichen.Tu subscribed.Apr 9 2015, 7:11 AM

ssastry moved this task from In Progress to Needs Triage on the Parsoid board.May 26 2015, 4:41 PM

Liuxinyu970226 subscribed.Jun 3 2015, 11:45 PM

Arlolra lowered the priority of this task from High to Medium.Jan 20 2017, 2:22 AM

Arlolra mentioned this in T151474: Investigate source of non-determinism in rt.Mar 28 2018, 7:10 PM

This was brought up again as desiderata due to caching/storage concerns. Since we don't actually provide data-parsoid to VE, we currently need to *guarantee* persistent storage of a matched set of Parsoid HTML/data-parsoid for that HTML for the entire duration of an editing session, to be certain that the html2wt phase can get back the appropriate data-parsoid.

If the parse was entirely deterministic, we could use non-persistent storage (cache) and simply regenerate the appropriate data-parsoid as needed if it had expired from the cache.

cscott renamed this task from Make HTML output as deterministic / stable as possible to [EPIC] Make Parsoid HTML output completely deterministic.Oct 4 2018, 2:44 PM

cscott updated the task description. (Show Details)

cscott updated the task description. (Show Details)Oct 4 2018, 2:48 PM

cscott updated the task description. (Show Details)Oct 4 2018, 2:51 PM

ssastry mentioned this in T206066: Wikimedia Technical Conference 2018 Session - Identifying the requirements and goals for the parser.Oct 11 2018, 5:15 PM

cscott updated the task description. (Show Details)Oct 17 2018, 2:10 AM

cscott mentioned this in T216289: Parsoid is incompatible with node v11 due to about ID nondeterminism.Feb 19 2019, 4:52 PM

LGoto moved this task from Needs Triage to Backlog on the Parsoid board.Feb 15 2020, 9:43 PM

ssastry mentioned this in T246906: VisualEditor causes extra edit conflict noise when headings include whitespace.Mar 5 2020, 2:47 PM

ssastry moved this task from Backlog to Feature requests on the Parsoid board.Mar 7 2020, 8:58 PM

Aklapper added a project: RESTBase.Jun 6 2021, 4:34 PM

Aklapper added a project: Epic.

Aklapper removed subscribers: • marcoil, RESTBase, • GWicke.