Page MenuHomePhabricator

Investigation: Highlighting meaningful whitespace
Closed, ResolvedPublic8 Estimated Story PointsSpike

Description

Proposed changes to CodeMirror ext

Add a separate button (from the 'highlighter' syntax highlighting), to highlight meaningful whitespace with whitespace symbols. This would ignore any whitespace without impact on the final format of the template.

Whitespace.png (791×1 px, 245 KB)

Open questions for investigation

The following questions apply to the CodeMirror extension within two different editors: 2017 Wikitext editor (source view inside VE) and the 2010 wikitext editor.

  1. Is it possible to differentiate between meaningful and not-meaningful whitespace?
  2. How many different types of whitespace would be useful to show with a symbol? Currently showing individual spaces and newlines. Potentially add tabs and non-breaking spaces.
  3. Is it possible to apply only to source code for templates? Are there other situations where this would be useful?
  4. How might this proposal be affected by the stacked surface issue described in T254976?

Scope

  • This ticket is not about implementing whitespace symbols yet, we would just like to understand better the opportunities and limitations related to the topic.
  • This feature would only apply to syntax highlighting in the template namespace.
  • Other non-latin writing systems have special whitespace characters. Would be great to note them as they come up, but for this investigation we will limit to those used in the latin letters.

Notes
Should be done after T254976, as it's simpler and a better place to get familiar with CodeMirror. Any lessons learned there will likely make tacking this one easier.

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald Transcript
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ECohen_WMDE renamed this task from CodeMirror Investigation: Highlighting meaningful whitespace to Investigation: Highlighting meaningful whitespace.Jun 10 2020, 9:50 AM

Notes:

  1. With "meaningful" we mean whitespace characters that actually affect how pages look in the end. However, this still leaves some gray area. It's still a continuum. On one end are examples like {{#if:___ where whitespace (here marked with underscores) does not matter at all. Can be zero or more spaces, can be line breaks, anything. On the other end is whitespace that would actually destroy the page if it's missing, e.g. the space in {{Infobox Asteroid}}. A weird edge-case is for example {{Infobox__Asteroid}}. The whitespace does have relevance, but it doesn't matter how much it is. How to highlight this?
  2. I can see two ways to go here. Either focus only on whitespace typically used in programming. That is spaces, line breaks, and tabs. Or highlight all whitespace, including non-breaking spaces, ideographic space, and many others. While the first highlights spaces that are used all the time, the later is more about highlighting copy-paste mistakes and such.
  3. I can see two approaches here. Either decide per namespace. Or make it so that this extra highlighting only happens inside of actual syntax, e.g. only inside {…}, but never in normal text. Note this might miss some relevant whitespace characters at the borders between normal text and syntax.
Lena_WMDE set the point value for this task to 8.Jul 16 2020, 9:01 AM

Change 616534 had a related patch set uploaded (by Thiemo Kreuz (WMDE); owner: Thiemo Kreuz (WMDE)):
[mediawiki/extensions/CodeMirror@master] [POC] Change MediaWiki mode to highlight (meaningful) whitespace

https://gerrit.wikimedia.org/r/616534

I started digging a bit deeper into the source code of the MediaWiki-extensions-CodeMirror extension. Some findings:

  • It looks like we don't need to touch anything but the CodeMirror codebase.
  • There is a syntax highlighting mode inside VisualEditor. The code for this is in the CodeMirror codebase as well. Look for files starting with ve. I was not able to test this, but as far as I can tell we don't need to touch this. When we change the syntax highlighting, it will "magically" work everywhere.
  • The majority of the code was written by @Pastakhov in 2014 in patch https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CodeMirror/+/163544 (direct link to the relevant file: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CodeMirror/+/163544/36/resources/mode/mediawiki/mediawiki.js).
  • What CodeMirror implements is indeed an actual tokenizer/parser. In other words: Yes, CodeMirror re-implements a good chunk of what the actual MediaWiki-Parser does (as well as Parsoid). There is no guarantee that the syntax highlighting created by CodeMirror matches what the actual parser does. There are many, many edge-cases and bugs, most of them probably never reported.
  • We need to learn and understand two concepts the code heavily refers to: streams and states. A stream is the stream of incoming characters. There is a class StringStream in the code. A state is just a JSON object that can be stuffed with whatever is needed. It is initialized in startState.
  • It looks like everything we want is possible. We can not only detect whitespace, we are aware of the context (e.g. inside vs. outside of a parser function).
  • Styling via CSS is not only preferred, I think it's the only way. I'm not sure how easy it is to insert replacement characters as shown in the mock, but there is probably a CSS trick to do this.

In this proof-of-concept I highlight spaces in two different contexts with two different shades of green:

Screenshot from 2020-07-27 16-58-02.png (220×451 px, 20 KB)

  • Yes, it is possible to highlight spaces only inside syntax elements, but not in the regular text. We don't even need to look at the namespace to do this. Usually, pretty much everything in a template is wrapped in some syntax elements, and only very few pieces are "static text". In contrast, virtually no article contains parser functions. This is in fact forbidden in most community guidelines. Our rule for the syntax highlighter can be the same, but the result will look different because of this fundamental difference between the Template: and all other namespaces.
  • I think there are no conflicts with T254976: Investigation: Bracket matching .
  • It's still an open question what we consider "meaningful". I suggest to discuss this in a separate task, collect all possible syntax elements, mark all places where whitespace might appear, and start highlighting what we find critical one by one.
  • I suggest to highlight only regular spaces and tabs, with 2 different symbols, but nothing else. This makes it possible to (kind of) see the difference between expected spaces, and other unexpected whitespace characters.
  • Alternatively we can come up with 3 styles: space, tab, and special highlighting for all other (invisible) whitespace characters. This would act as some kind of "unexpected character" warning.

Forgot to mention T243835#6339118. There is apparently a plan to upgrade to CodeMirror 6, which will be incompatible. I'm not sure what this means for this task. I think there is no other way than implementing it twice. However, the second time should be faster. At that point we know exactly which characters we want to highlight in which context, and how.

Note that an update to CodeMirror 6 means the wikitext language add-on needs to be rewritten first. Currently we don't know when this will happen, and which team will do it.

I had a quick look at the version 6 source, and it appears very different. While "streams" and "states" still appear somewhere, language add-ons are now managed via "grammars" for the "Lezer parser system". See https://lezer.codemirror.net, and https://github.com/lezer-parser/javascript/tree/master/src for an example parser. There is now an "input" (probably the same as the "stream" before), tokens, and a stack.

@thiemowmde wrote:

I suggest to highlight only regular spaces and tabs, with 2 different symbols, but nothing else. This makes it possible to (kind of) see the difference between expected spaces, and other unexpected whitespace characters.

I agree with Thiemo's notes about questions (1-3). Figuring out which whitespace is relevant is a huge effort, and anyway as a user I would expect to see all whitespace highlighted when I toggle visibility. I'd +1 for a simple toggle button, available in every namespace and with no special intelligence built-in. One strong argument IMO is, we can't know exactly what the editor will *think* is relevant. Maybe they're cleaning up double-spaces in normal text, for example, which isn't technically relevant but is important for the use case.

Either way, this additional feature of calculating whitespace relevance can be isolated, postponed, and integrated later if we decide it's needed.

  • It looks like we don't need to touch anything but the CodeMirror codebase.

mediawiki/extensions/CodeMirror, I'm assuming. Great news!

  • Styling via CSS is not only preferred, I think it's the only way. I'm not sure how easy it is to insert replacement characters as shown in the mock, but there is probably a CSS trick to do this.

Perhaps, or we can at least show a background color. Or, if we choose to highlight all whitespace then we could switch to a custom webfont with visible space glyphs.

There is apparently a plan to upgrade to CodeMirror 6, which will be incompatible. I'm not sure what this means for this task. I think there is no other way than implementing it twice.

Sounds right, so we can do the easy stuff now but want to avoid coupling any major investments to CodeMirror 5.

After this investigation and a discussion on the above, we decided to eliminate the 'meaningful' part of this feature. Distinguishing between the two is not a clear line and the added complexity does not have a clear pay off. Instead, it's much simpler to highlight all whitespaces following the same rules everywhere and the symbols will still be helpful in trouble-shooting and finding the source of errors caused by whitespace. The user then decides what is meaningful in their case.

We see the feature being split into two steps (but both would be toggled on/off as a single tool):

  • Base scope: Highlight all whitespace by replacing it with symbols in light grey - dots for individual spaces and tabs with dashes, possibly include a character for non-breaking spaces.
  • Nice-to-have: Find potentially problematic areas automatically and highlight the symbols with a light red background. This can include trailing spaces at the end of lines and double spaces.

Also, we are investigating a separate way to communicate newlines, so it would not be shown with a symbol: T259252: Investigation: Show line numbers along with syntax highlighting

Lastly, recording here in case it is useful for the future: Sublime Text includes this feature currently and there are additional plug-ins written for it as well. A screenshot of sublime taken from the stackoverflow discussion gives a good preview of the current intention, showing dots for spaces and dashes for tabs:

maIXK.png (676×1 px, 96 KB)

For the sake of documentation, these are some of our findings when we explored the idea of whitespace being "meaningful":

  • Let's look at an example: {{ a b | c d | e f = g h }}
    • All spaces between two characters are meaningful, …
    • … except that the template name a b can as well be written as a_b without having any effect.
    • All whitespace before and after template names (a b) as well as parameter names (e f) is meaningless.
    • Whitespace before and after parameter values (g h) is meaningless, …
    • … except it's meaningful when the parameter is unnamed ( c d ).
    • Whitespace outside of the curly brackets is meaningful, …
    • … except it's not.
  • Whitespace in parser functions like {{#if: a | b | c }} is meaningless. I'm not aware of an exception, but there might be one.
  • Single line breaks are meaningless in so far that they don't create a line break in the rendered article, but meaningful in so far that they create a space.
  • We can not say anything about whitespace inside parser tags like <poem>, <math> etc. This depends entirely on the extension that provides the tag.

Please checkout my experiment in T181677: Implement syntax highlight for U+00A0 (no-break space, nbsp) with cm-show-invisibles plugin.
Also note, that CodeEditor implements an 'invisibles' mode and it is behind a toolbar button.

Also a note, that CodeMirror keeps a 'bad char' list somewhere in a configuration variable. Such characters will be indicated with a red dot. A character that falls in this category is U+00AD (soft hyphen).

Note about a possible icon: The CodeEditor extension uses the ¶ character (U+00B6, called "PILCROW SIGN"). It does this by bringing their own custom .svg icon, and integrate it as if it would be a proper OOUI icon. I find the source code quite elegant: https://codesearch.wmcloud.org/search/?q=pilcrow. Might be worth pulling this icon into the OOUI icon set. @Volker_E, is there a process for this?

@thiemowmde Yes there is an already outlined process, both for the design T139351 and the integration T135081. I'm open to the ¶ character because there is also recognition about using it to show whitespace characters in Microsoft Word, probably the most known use.

Before starting the process though, I'd want to do some research to decide if this is the best fit for our use, and also propose a design that is in-line with the OOUI guidelines for how icons should look. In the link you shared I'm not seeing the current design. I've also tried finding/turning on the plug-in without success. Could you point me to the right preferences so that I can take a look?

Also thank you @TheDJ for pointing out this related feature!

Before starting the process though, I'd want to do some research to decide if this is the best fit for our use, and also propose a design that is in-line with the OOUI guidelines for how icons should look. In the link you shared I'm not seeing the current design. I've also tried finding/turning on the plug-in without success. Could you point me to the right preferences so that I can take a look?

Try "Enable editing toolbar" / "2010 wikitext editor", then open a page with a source code suffix such as, https://en-wmde-templates-alpha.wmcloud.org/w/index.php?title=User:Adamw/foo.js&action=edit

Change 616534 abandoned by Thiemo Kreuz (WMDE):
[mediawiki/extensions/CodeMirror@master] [POC] Change MediaWiki mode to highlight (meaningful) whitespace

Reason:

https://gerrit.wikimedia.org/r/616534