The Pre-Save Transform: https://www.mediawiki.org/wiki/Pre-save_transforms
The PST is a separate parser task, which is exposed by (for example) https://gerrit.wikimedia.org/r/571109
Parsoid doesn't (yet) have any implementation of this.
There are a couple of ways we could implement this:
- As a wikitext-to-wikitext transformation using the main Parsoid tokenizer but operating on the token stream, not going all the way wt2html -> html2wt like our current wt2wt tests do. We'd intercept the token stream, do signature processing and expand {{subst}} and then serialize back to wikitext. We might need some version of selser at a token level to handle the cases where we don't preserve 100% of the whitespace formatting of the original. Benefit is that { and [ bracket-matching would be "guaranteed" by construction to match precisely the wikitext handling, even when new features like heredoc arguments are added or language variant syntax is involved.
- Implement PST as a separate PEG grammar, but reusing as many of the rules of the main tokenizer as possible; or maybe a new entry point to our existing tokenizer, sharing the same source code file.
- Separate PST-only code base, maybe with a standalone spec which is wikitext-independent. This could be based on the legacy parser, or the legacy preprocessor, or a new codebase (maybe using PEG). The PST in this version would be a separate thing, completely independent from parsing wikitext (although sharing some syntax features). If it doesn't match wikitext exactly in some corner cases (say, doesn't support heredoc arguments in subst), that's a Documented Feature. This might also free the PST (or PST library?) to be tweaked in various ways for non-wikitext content types, like JavaScript.