Page MenuHomePhabricator

VisualEditor leaving nowiki on enwiki
Closed, DuplicatePublic

Description

There are different "types" of nowiki being added to enwiki by VE sourced edits:

Example diffs by type:

There are probably other types. Not all of the 6,432 and 8,815 are by VE though a lot are. It may be this caused (in part) by user-error of VE. Before we start looking into bot processes to automate fixes, or edit filters, it would be good to establish if this is a VE bug that is fixable, or something we will need to learn to deal with indefinitely.

Event Timeline

I remember trying VE years ago and having funny stuff happening when copy pasting. But that's a long time ago, haven't tried it recently.

Amire80 added subscribers: cscott, Amire80.

Disclaimer: I am not an expert developer of VE or Parsoid. I'm just a nerd editor who explored the topic of unnecessary <nowiki> tags too much.

This is not unique to the English Wikipedia. And this should probably be split to two or more bugs, or perhaps merged with some tasks tagged with Parsoid-Nowiki.

I fixed a lot of these in the Hebrew Wikipedia, and I also documented them: https://he.wikipedia.org/wiki/WP:VE/nowiki . The important parts of that page are translated into English for people from other Wikipedias who may be curious. I stopped the rigorous counting of the various causes for the appearance of these tags a few months ago because I had to prioritize things for the pandemic lockdowns, and also because the count seemed mostly consistent for several years.

The two causes you listed—external links and single quotes—are indeed quite common, but the most common reason by far is writing a trail after a link incorrectly. It may be less common in English because the English morphology is simpler than other languages', but I can still find quite a lot with insource:/\]\]\<nowiki/.

Although all of these are, to the best of my understanding, inserted by Parsoid (the backend) and not exactly by VE (the frontend), there are several interventions on the frontend VE side that could possibly reduce them.

For the case of external links, somebody (maybe @cscott?) once showed me that VE guesses whether something that begins with http should be treated as an actual link or as a string that should be escaped based on the user's keyboard usage. Unfortunately, I don't remember the details, but there are some heuristics that try to guess what to do based on whether the URL was pasted or typed letter by letter, and whether the user changed after inserting, with backspace, etc. Or something along these lines. Maybe somebody who knows the guts of how the VE frontend handles keyboard events can explain. It would be nice if the designers could rethink this behavior. And at the very least, it should be documented in the user manual.

For the case of single quotes, frontend VE should be smarter about warning a user who types something that looks like wikitext. It has already been doing this for years for [[ and {{, and perhaps it can do better with ''. My naïve guess is that it's more a matter of good design and design research, and once an effective design is done, implementing it will be easy because the necessary functions of detecting wikitext and showing warnings basically exists already. (Funny thing about single quotes: In the Hebrew Wikipedia, my strong suspicion is that a lot of people type '' because they simply don't know how to type "! Both characters have been easily accessible on Hebrew keyboards for many decades and most people who type in Hebrew do it correctly, but some apparently don't. I don't know if the situation is the same in English. Maybe users conflate ''' with ", maybe some try to type wikitext, and maybe there's something else going on.)

@Amire80 that is a fascinating taxonomy you created. Glad you found this thread :) I'm having trouble understanding some of them as there are no examples. Intended space has an example, for example, so I understand where nowiki appears and might find those eg regex. []]{2}<nowiki/>[[]{2} - in fact there are 8 intended space on enwiki:

If you ever decide to expand the description or a new column to include (more) examples where possible it would be very helpful towards developing tools and reports. It should be possible to detect many of them universally (like intended space) potentially making for a global bot/tool/report across all wikis.

@Amire80 that is a fascinating taxonomy you created. Glad you found this thread :)

Thanks! It means a lot <3

I'm having trouble understanding some of them as there are no examples. Intended space has an example, for example, so I understand where nowiki appears and might find those eg regex. []]{2}<nowiki/>[[]{2} - in fact there are 8 intended space on enwiki:

The "intended trail" and the "intended space" are the same technically: in both of them, the editor was not supposed to put any characters immediately after a link. The difference between them is in what did the user should have done. "Intended trail" is typing s immediately after [[dog]]; in such a case, the user probably wanted the result to be [[dog]]s, but it came out as [[dog]]<nowiki/>s. "Intended space" is typing something like [[dog]]breeds; in this case, the user probably wanted the result to be [[dog]] breeds, but it came out as [[dog]]<nowiki/>breeds.

So in both cases, the result is different from the (probable) intention, but Parsoid handles them differently. The reason I distinguish them in the table is that I hope that some day a designer who works with the VE team will see this and build a system that tries to be smarter about user intention, DWIM-style.

There are way more than eight cases of what I call "intended space" on the English Wikipedia. Your regex searches for two consecutive links, and what I meant is some plain text after a link. It's more like []]{2}<nowiki/>[a-z]. If I search for insource:/[]]{2}\<nowiki\/\>[a-z]/, I see these:

actual wikitextlikely intended textclassification
[[Sultan bin Muhammad Al-Qasimi|Sheikh Sultan Al Qasim]]<nowiki/>i[[Sultan bin Muhammad Al-Qasimi]]intended trail
[[Rwanda]]<nowiki/>n[[Rwanda]]nintended trail
[[Garden of Eden|Eden]]<nowiki/>ic world[[Garden of Eden|Eden]]ic worldintended trail
notably [[Frontal Lobotomy|frontal lobotomies]]<nowiki/>carried outnotably [[Frontal Lobotomy|frontal lobotomies]] carried outintended space

But you can see that you cannot easily sort "intended space" and "intended trail" using just regexes or code; you need to read the text and guess the user's intention. You can perhaps assume that some common English suffixes, such as -s, -an, -ic, -ing, -ed, etc., are "intended trail", and other things are "intended space", but it will probably still require some manual verification. (Also, it will have to be done different in each language. I try to think about how each thing would work in different wikis and languages. Of course, if you only care about English, it's totally fine.)

The cases you found with insource:/[]]{2}\<nowiki\/\>[[]{2}/ are more like what I call "escaping links" in my big table:

  • [[Eurozine|''Eurozin'']]<nowiki/>[[Eurozine|''e'']] <- this was probably supposed to be simply ''[[Eurozine]]'', but something somewhere went awry when the user tried to mix italics formatting with a link.
  • [[Starogard County|Starogard Gdań]]<nowiki/>[[Starogard County|ski]] <- this was obviously supposed to be [[Starogard County|Starogard Gdański]].
  • [[Democratic Republic of the Congo|Demo]]<nowiki/>[[Democratic Republic of the Congo|cratic Republic of the Congo]] <- this was obviously supposed to be just [[Democratic Republic of the Congo]].

I find analyzing such cases very fun (especially when I'm not too busy with work and family...). Obvious patterns emerge after just a few days, and at least some of them are evidently similar across wikis and languages. Bots can fix some of them automatically, and in the Hebrew Wikipedia we added some very simple cases to periodic bot cleanup. Other patterns probably require manual fixing. Of course, it would be best if these things didn't happen at all ;)

If you ever decide to expand the description or a new column to include (more) examples where possible it would be very helpful towards developing tools and reports. It should be possible to detect many of them universally (like intended space) potentially making for a global bot/tool/report across all wikis.

I'll try! A bit busy right now, but do take a look at the same page in a couple of weeks.

Superyetkin subscribed.

As the issua affects thousands of pages, it deserves a higher priority.

TmY_e12 raised the priority of this task from High to Needs Triage.May 29 2021, 6:29 PM
TmY_e12 subscribed.

It is not assigned yet.

Arlolra subscribed.

There's some good information here, so thank you to the participants, but the two types already have individual tasks, as mentioned in T282322#7072938