Maniphest T205479

Fix token datastructure to fix potential perfomance issue
Open, MediumPublic
Actions

Assigned To

None

Authored By

	ssastry
	Sep 25 2018, 7:34 PM

Description

Right now, Parsoid stores attributes of a token as an array of (k,v,srcOffsets) triple in the Token.attribs property.
However, an attribute lookup is now an array scan (see Util.lookup) which is unnecessarily expensive.
A better attribs structure would be a map.

However, there seems to be two issues that get in the way of making this fix.

key lookup is whitespace insensitive
based on code in setAttribute in parser.defines.js, keys need not be strings.
code in transformers seem to assume ordering of attributes (that attribute 0 is the template name, for example). Also, the order actually matters in some cases like template args.

(1) might be easier to work around.

For (2) and (3), the solution might be related. There are exactly two instances of new KV(tu.flattenIfArray(..), ...) in the PEG tokenizer and in this case, the key is an array of tokens. Both of them are where the template name or template arg (rare on a top-level page) is itself templated. So, in these cases, attribute key is not a string. Looks like the right fix is to add a synthetic kv pair for the template name instead of implicitly assuming the first attribute is the template name or template arg. i.e. new KV('templatename', array-or-string-here). The trouble here that needs fixing is a potential conflict with a name template arg called 'templatename'.

In any case, this array-scan based attribute lookup is probably a perf. hole waiting to be fixed.

Details

	Subject	Repo	Branch	Lines +/-
	Remove unnecessary check for whether an attribute key is a non-string	mediawiki/services/parsoid	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T205491: QuoteTransformer and quote-tokens use a ".value" property on the token instead of adding it to the token's attributes

Event Timeline

ssastry created this task.Sep 25 2018, 7:34 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 25 2018, 7:34 PM

ssastry triaged this task as Medium priority.Sep 25 2018, 7:34 PM

ssastry moved this task from Backlog to Parsoid Fixes on the Parsoid-PHP board.

Change 462806 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Remove unnecessary check for whether an attribute key is a non-string

https://gerrit.wikimedia.org/r/462806

gerritbot added a project: Patch-For-Review.Sep 25 2018, 7:59 PM

ssastry updated the task description. (Show Details)Sep 25 2018, 8:39 PM

ssastry updated the task description. (Show Details)Sep 25 2018, 8:48 PM

ssastry mentioned this in T205491: QuoteTransformer and quote-tokens use a ".value" property on the token instead of adding it to the token's attributes.Sep 25 2018, 8:52 PM

ssastry updated the task description. (Show Details)Sep 25 2018, 9:07 PM

Overall, based on poking at this for a bit today, this is going to be a little tricky and might need to be done in multiple steps.

Change 462806 abandoned by Subramanya Sastry:
Remove unnecessary check for whether an attribute key is a non-string

Reason:
Till template tokens are fixed to use string keys, this won't work.

https://gerrit.wikimedia.org/r/462806

ssastry removed a project: Patch-For-Review.Sep 25 2018, 11:08 PM

ssastry moved this task from Parsoid Fixes to Backlog on the Parsoid-PHP board.Feb 5 2019, 11:07 PM

ssastry moved this task from Backlog to Performance on the Parsoid-PHP board.Apr 16 2019, 9:01 PM

ssastry moved this task from Performance to Post-Port Work on the Parsoid-PHP board.Sep 10 2019, 4:45 AM

ssastry moved this task from Post-Port Work to Performance on the Parsoid-PHP board.Dec 8 2019, 3:22 AM

Aklapper edited projects, added Parsoid; removed Parsoid-PHP.Apr 10 2020, 4:27 PM

ssastry moved this task from Needs Triage to Performance on the Parsoid board.Apr 10 2020, 4:51 PM

Fix token datastructure to fix potential perfomance issueOpen, MediumPublicActions

Description

Details

Related Objects

Event Timeline

Fix token datastructure to fix potential perfomance issue
Open, MediumPublic
Actions