Page MenuHomePhabricator

Create wikitext tokenizer with rules identical to Parser of MW Core
Open, LowPublic

Description

The CodeMirror extension tokenizes wikitext differently of MW Core Parser.
It is not a problem for plain wikitext, but for complex wikitext needed a different approach. (Example T108455 and T108450)
The main difference is that CodeMirror looks for tokens successively in text and Parser looks for tokens in whole text.

The problem is that the string at the beginning may seem like a token, but in fact it isn't.
Incorrect syntax highlighting complicates the visual perception, but rolling back for correction reduces performance.

Perhaps, the best way is to use the combined method, since when editor is writing an article, the end is not known, but probably it will be more comfortable when wikitext is highlighted.
Or necessary automatically add closing tokens, for example if editor wrote '{{', need to add '}}' after cursor.

Event Timeline

Pastakhov raised the priority of this task from to Needs Triage.
Pastakhov updated the task description. (Show Details)
Pastakhov subscribed.

Not sure, but I think their tasks are different.
I'll see how parsoid works, Thanks.

Pastakhov renamed this task from Create JS parser of wikitext similar Parser of MW Core to Create wikitext tokenizer with rules identical to Parser of MW Core.Aug 24 2015, 5:19 AM
Pastakhov updated the task description. (Show Details)
Pastakhov set Security to None.
Pastakhov added a subscriber: Florian.

The Parser's WikiText tokenizing is pretty complex (using multiple passes on the whole text). Fully matching it in real-time isn't likely to be possible. I think we should decline this task and instead try to address specific cases that are broken (some of which may not be fixable without degrading performance unacceptably).

If I remember correctly, I meant order of parser.
And performance should be increased only. Now sometimes (maybe always) tokenizer returns back for parse the same text again because when you look at the beginning of a string you can not be sure which exactly this token is, until find the end of the token.
For example if you meet {{{{{ it can be:

{{{{{1}}}}} - parameter inside template transclusion
{{{{{ hello word - just five {

The Parser's WikiText tokenizer works different. It looks for tokens in whole a string.
For example the firstly it find parameters, then templates, etc. It should be faster and more correctly, but this is not suitable for cases where the string has not been written completely yet.

You could also have a look at my parser in https://de.wikipedia.org/wiki/Benutzer:Schnark/js/syntaxhighlight.js, for the template mess especially search for "Multiple braces". My approach there isn't perfect, but I actually never found a real live instance where it broke.