Page MenuHomePhabricator

Refactor WTS to be async
Closed, ResolvedPublic

Description

WTS is currently sync. It would be useful to refactor this to allow for async components of WTS.

In particular:

  • serializing images might need to do an imageinfo query (we work around this at the moment by leaning on data-file-width and data-file-height attributes in the input HTML).
  • TemplateData might need to be consulted when serializing template attributes (T104599)

There are also a few internal interfaces to Parsoid which we wanted to make return Promises, but had to add evil sync backdoors specifically for the use of WTS. It would be good to get rid of these.

Event Timeline

cscott assigned this task to ssastry.
cscott raised the priority of this task from to Needs Triage.
cscott updated the task description. (Show Details)
cscott added a project: Parsoid.
cscott subscribed.
ssastry triaged this task as Medium priority.
ssastry set Security to None.

Some notes from @ssastry,

I think a stepwise refinement might be one way to do it.

If you restrict async behavior to just the individual tag handlers (and
leave the rest of it sync), this may not be too difficult. So, images,
templates, and extensions are the only thing that might need to wait on
additional information, i.e. just pause additional work. The
disadvantage of the simple approach is that we'll be twiddling our
thumbs while results come back.

A more sophisticated strategy might push top-level children of <body>
into a queue and and serialize them in parallel, which might help hide
latencies from template data fetch. But, this might just be a wrapper on
top of the previous pause-and-wait approach which would then stitch the
individual pieces of wikitext together with appropriate separators.
There might be some selser and separator handling gotchas .. but might work.

There might be other latency-hiding strategies, but for starters, a
simple pause-and-resume solution might get us out the door. In most
edits and selser scenarios, there should not be any template-data
fetches which should leave the latencies unchanged (but will affect full
serialization latencies - which only matters for rt-testing and Flow,
but flow content is usually not very complex and it shouldn't matter
there anyway).

And, Flow will also have to move to use selective serialization if they want to address complaints of normalization of wikitext between wt <-> VE switches ... they already have it on their todo list.

One fairly elegant way to handle possibly-async constructs I have used a couple times recently is this:

  • Sync pass:
    • have an array accumulator for results
    • if the operation is sync, push the value to the accumulator
    • if the operation is async, push a promise for the value to the accumulator
  • Async result collection pass:
    • Iterate over the accumulator with Promise.each (Bluebird) and emit each result / chunk in a streaming fashion, or wait for all promises to resolve with Promise.all (ES6).

More notes from @ssastry,

I see the html -> wt operation as a reduction operation on the tree,
where you can recursively (and in parallel, and independently) compute
the wikitext output for a subtree, and in a separate combining pass,
compute the required newline separators + nowiki wrappers. But, that is
a much bigger rewrite and there might be some ugly details that crop up.
So, probably not worth going there quite yet, and nothing much to be
gained there right now either.

So, I prefer an accumulator model .. i.e. walk the tree and use a
reduce/accumulator interface to combine the output of the next node into
the already emitted wikitext. The accumulator / reducer can figure out
how to combine the wikitext emitted by a new node N, generate the
newline/whitespace separator required, as well as any required nowiki
wrappers. This reducer will probably buffer output on a per-line basis,
so the reduction operation might be somewhat delayed.

Change 251194 had a related patch set uploaded (by Arlolra):
T115720: Refactor WTS to be async

https://gerrit.wikimedia.org/r/251194

The patch implement:

for starters, a simple pause-and-resume solution might get us out the door

Arlolra moved this task from Needs Triage to In Progress on the Parsoid board.

Change 251194 merged by jenkins-bot:
T115720: Refactor WTS to be async

https://gerrit.wikimedia.org/r/251194