Page MenuHomePhabricator

Extract and use a token transformation interface (API) in place of custom token handlers
Closed, ResolvedPublic

Description

  • TokenHandler right now implements a stub (init, constructor, reset)
  • Most transformers have a handler for
    • All tokens (onAny)
    • Newline token (onNewline)
    • EOF token (onEOF)
    • Specific token types ← this would need some generic handler name / signature (onQuote, onListItem, etc.)
  • Flesh out TokenHandler abstract class based on the above

Not only is this going to be good documentation, this also makes the porting simpler and more performant

Event Timeline

ssastry triaged this task as Medium priority.Sep 24 2018, 8:34 PM
ssastry moved this task from Backlog to Post-Port Work on the Parsoid-PHP board.

Looks like at least for the sync token transformers, the following four handlers would be sufficient: onTag, onNewline, onEOF, onAny in priority order with the default implementation in the TokenHandler base class that simply returns the token unchanged. Individual transformers would override one or more of these handlers.

So, SyncTTM's transformTokens will end up with a double while loop (outer loop over transformers, inner loop over the transformer's handlers).

For the outer loop, order of transformers are defined by the order in which they are linked together in the parser pipeline.

For the inner loop, the order of handlers are predefined ( tag < newline < eof < any). So, all the handler registration (add/remove) and get logic can be removed. If required, we can honor a skip return flag in case the rest of the current transformer's handlers need to skipped.

This will make for much simpler and comprehensible sync logic. Since the PHP ported code will use the same sync loop (but blocking sync) for the current async logic, the token transformer will end up hugely simplified.

The performance impact of this change is however unclear. T209118: Add time usage and call count to TokenTransformMangers add, remove and getTransform functions showed that the add/remove/get takes about 3% of time so we aren't expecting any significant performance improvement from this change. However, now that every single token will go through every single handler of every single transformer (with the majority of them being the v => v identity/nop functions, this may potentially hurt performance.

So, we shouldn't roll this out without thoroughly evaluating perf impacts. Any change we make here will require (a) changes to transformTests.js & transformTests.php code (b) regeneration of token transformer test files.

I decided to a quick whack at this to see what shakes out ... this can actually lead to a good amount of code simplification. Performance doesn't seem to be impacted. The code right now I have makes this work with both the NewSyncTTM and the old SyncTTM, (with parser.js configuring with one or the other), but if we find that NewSyncTTM is going to work out, we can kill ranks, add, remove, get transform code from all handlers.

Change 483151 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Simplify SyncTTM and handlers

https://gerrit.wikimedia.org/r/483151

Change 483151 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Simplify SyncTTM and handlers

https://gerrit.wikimedia.org/r/483151