Page MenuHomePhabricator

Parsoid needs to implement the Pre-Save Transform (PST)
Open, MediumPublic

Description

The PST is a separate parser task, which is exposed by (for example) https://gerrit.wikimedia.org/r/571109

Parsoid doesn't (yet) have any implementation of this.

There are a couple of ways we could implement this:

  1. As a wikitext-to-wikitext transformation using the main Parsoid tokenizer but operating on the token stream, not going all the way wt2html -> html2wt like our current wt2wt tests do. We'd intercept the token stream, do signature processing and expand {{subst}} and then serialize back to wikitext. We might need some version of selser at a token level to handle the cases where we don't preserve 100% of the whitespace formatting of the original. Benefit is that { and [ bracket-matching would be "guaranteed" by construction to match precisely the wikitext handling, even when new features like heredoc arguments are added or language variant syntax is involved.
  2. Implement PST as a separate PEG grammar, but reusing as many of the rules of the main tokenizer as possible; or maybe a new entry point to our existing tokenizer, sharing the same source code file.
  3. Separate PST-only code base, maybe with a standalone spec which is wikitext-independent. This could be based on the legacy parser, or the legacy preprocessor, or a new codebase (maybe using PEG). The PST in this version would be a separate thing, completely independent from parsing wikitext (although sharing some syntax features). If it doesn't match wikitext exactly in some corner cases (say, doesn't support heredoc arguments in subst), that's a Documented Feature. This might also free the PST (or PST library?) to be tweaked in various ways for non-wikitext content types, like JavaScript.

Event Timeline

cscott created this task.Mar 6 2020, 6:03 PM
Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptMar 6 2020, 6:03 PM
ssastry triaged this task as Medium priority.Mar 11 2020, 4:08 AM

Change 599907 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Add ~ to text_char

https://gerrit.wikimedia.org/r/599907

Arlolra added a subscriber: Arlolra.Jun 1 2020, 5:02 PM

The above patch and https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/256052 remove some support for PST from the grammar. Noting them here in case they need to be restored.

cscott updated the task description. (Show Details)Jun 1 2020, 5:09 PM

Change 599907 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Add ~ to text_char

https://gerrit.wikimedia.org/r/599907

Change 603571 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump Parsoid to v0.12.0-a16

https://gerrit.wikimedia.org/r/603571

Change 603571 merged by jenkins-bot:
[mediawiki/vendor@master] Bump Parsoid to v0.12.0-a16

https://gerrit.wikimedia.org/r/603571