Page MenuHomePhabricator

Fix token datastructure to fix potential perfomance issue
Open, MediumPublic

Description

Right now, Parsoid stores attributes of a token as an array of (k,v,srcOffsets) triple in the Token.attribs property.
However, an attribute lookup is now an array scan (see Util.lookup) which is unnecessarily expensive.
A better attribs structure would be a map.

However, there seems to be two issues that get in the way of making this fix.

  1. key lookup is whitespace insensitive
  2. based on code in setAttribute in parser.defines.js, keys need not be strings.
  3. code in transformers seem to assume ordering of attributes (that attribute 0 is the template name, for example). Also, the order actually matters in some cases like template args.

(1) might be easier to work around.

For (2) and (3), the solution might be related. There are exactly two instances of new KV(tu.flattenIfArray(..), ...) in the PEG tokenizer and in this case, the key is an array of tokens. Both of them are where the template name or template arg (rare on a top-level page) is itself templated. So, in these cases, attribute key is not a string. Looks like the right fix is to add a synthetic kv pair for the template name instead of implicitly assuming the first attribute is the template name or template arg. i.e. new KV('templatename', array-or-string-here). The trouble here that needs fixing is a potential conflict with a name template arg called 'templatename'.

In any case, this array-scan based attribute lookup is probably a perf. hole waiting to be fixed.

Event Timeline

ssastry triaged this task as Medium priority.Sep 25 2018, 7:34 PM
ssastry moved this task from Backlog to Parsoid Fixes on the Parsoid-PHP board.

Change 462806 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Remove unnecessary check for whether an attribute key is a non-string

https://gerrit.wikimedia.org/r/462806

Overall, based on poking at this for a bit today, this is going to be a little tricky and might need to be done in multiple steps.

Change 462806 abandoned by Subramanya Sastry:
Remove unnecessary check for whether an attribute key is a non-string

Reason:
Till template tokens are fixed to use string keys, this won't work.

https://gerrit.wikimedia.org/r/462806