Page MenuHomePhabricator

Introduce compound tokens in the parsing pipeline
Closed, ResolvedPublic

Description

In the tokenizer (where the structure is clear), and additionally partway through processing, after the TokenStreamPatcher completes, it should be possible to introduce some kinds of compound tokens to reduce useless processing of individual tokens downstream.

For example, table tags might be a good compound token to introduce. List, Indent-Pre, and Paragraph wrapping might then process content of table-cells only in nested transformers with clean state. And where there is sufficient information available about content (ex: no list items / no newlines / no indent-pre), they can entirely skipped. This can lead to performance improvements on pages with large tables by effectively greatly reducing the volume of processed tokens.

Event Timeline

ssastry triaged this task as Medium priority.Nov 24 2020, 12:19 AM
ssastry moved this task from Needs Triage to Tech Debt / Big changes on the Parsoid board.

Change #1141985 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] WIP: Construct a compound ListTk token for wiki-lists

https://gerrit.wikimedia.org/r/1141985

Change #1142781 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] WIP: Add IndentPre compound token

https://gerrit.wikimedia.org/r/1142781

The patches above create compound tokens for List & Indent-Pre. Tables are a bit trickier -- I haven't looked into it.

Change #1141985 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Construct a compound ListTk token for wiki-lists

https://gerrit.wikimedia.org/r/1141985

Change #1142781 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Add IndentPre compound token

https://gerrit.wikimedia.org/r/1142781

Change #1147810 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a3

https://gerrit.wikimedia.org/r/1147810

Change #1147810 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a3

https://gerrit.wikimedia.org/r/1147810

Change #1149535 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] Introduce LineTk to more efficiently process tokens

https://gerrit.wikimedia.org/r/1149535

Change #1149770 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] Introduce EmptyLineTk to replace the EmptyLine meta tag

https://gerrit.wikimedia.org/r/1149770

This is now set up and we have some compound tokens being created. This also does improve performance on some bigger pages. I am going to resolve this.

Change #1149770 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Introduce EmptyLineTk to replace the EmptyLine meta tag

https://gerrit.wikimedia.org/r/1149770

Change #1154863 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a6

https://gerrit.wikimedia.org/r/1154863

Change #1154863 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a6

https://gerrit.wikimedia.org/r/1154863

Change #1162957 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a3

https://gerrit.wikimedia.org/r/1162957

Change #1162957 abandoned by Jgiannelos:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a3

https://gerrit.wikimedia.org/r/1162957

Change #1149535 abandoned by Subramanya Sastry:

[mediawiki/services/parsoid@master] Introduce LineTk to more efficiently process tokens

Reason:

Not worth it for now - documented reasoning in T395082 and we can restore patch if we pick up the work again later.

https://gerrit.wikimedia.org/r/1149535