Cleanup redundant <nowiki>
Open, LowPublic0 Story Points

Description

Please run a pywikibot/maintenance script on all wikis (or at least all wikis that enable VE) to cleanup redundant nowikis created by Parsoid bugs.

Amir Aharoni is doing a research on the most common patterns of redundant nowikis, and based on his finding I defined the following replacements, which should catch at least part of it (I tested them all on hewiki and it should be safe)

fromtonotes
\n *<nowiki> *</nowiki>\nvery common. not sure about the reuquired \n in the begining, but this make I feel it \n prefix make it totally safe :)
('''{2,3})<nowiki/>\1replace to empty string (equal to <b></b> or <i></i>)
([^s])\]\]<nowiki/>\\1]]side effect: [[link]]<nowiki/>s => [[link|links]] (s part of the link). This should be a site-wide behaviour controlled by linkTrail. spaces should be ignored - requires different handling.

For future bug fixes: please define similar patterns (if possible) so we can fix all the Parsoid bugs in their post-mortem phase and get rid of this dirty wikitext.

See also:
https://lists.wikimedia.org/pipermail/wikitech-l/2015-June/082127.html

eranroz created this task.Aug 1 2015, 9:15 PM
eranroz updated the task description. (Show Details)
eranroz raised the priority of this task from to Needs Triage.
eranroz added subscribers: eranroz, Amire80.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptAug 1 2015, 9:15 PM
Krenair set Security to None.
Krenair added a subscriber: Krenair.

Thanks for creating the task. I agree that it's a good idea to make some common and safe replacements by a commonly maintained bot script in all wikis.

To clarify, I only research this in the Hebrew Wikipedia, which usually has less than 20 edits a day with nowiki. I like doing it thoroughly and manually, and I only have time to do it thoroughly and manually in one wiki, though I'd be happy to share my experience with people in other languages, who may find other curious local cases.

('''{2,3})<nowiki/>\1 (that is, ''<nowiki/>'' or '''<nowiki/>''' are indeed very frequent, and I see no reason not to clean them up.

''<nowiki/>'' and '''<nowiki/>''' as well as just <nowiki/> with nothing else frequently appear in the end of paragraphs, and then they can be removed with all the whitespace before and after them.

([^s])\]\]<nowiki/> is very frequent as well. Auto-replacing may change the semantics, but I really see no place where this could be the intended behavior. In all the cases that I've seen, it was something like [[Moscow ]]mayor, and it should definitely be [[Moscow]] mayor.

Stuff like [[Astronaut]]<nowiki/>s is frequent, but it probably can't be safely auto-replaced without changing semantics in a bad way.

T106641 describes another frequent reason for nowiki, though it's not a thing that can be safely auto-replaced, and it may be unique to Hebrew.

Oh, and there's also [[link|<nowiki/>]]. They show nothing in the rendered page. From inspecting hundreds of diffs, I can see that the editors always add the correct links themselves - I guess that they can see that something is wrong.

So such links should also be auto-removed.

Stryn added a subscriber: Stryn.Aug 2 2015, 12:07 PM

The second one can cause issues like this (removed '''<nowiki/>''' but got incorrect result)

What is incorrect in the above example?

This is the output HTML is see:

The nowiki version (681102887)<i><b>k</b></i><b>×<i>k</i>minor</b>
cleaned up (681103205)<b><i>k</i>×<i>k</i>minor</b>

Okay, In order to fix one I ran my bot to do the job in these languages:
['en', 'de', 'es', 'fr', 'it', 'nl', 'pl', 'ru', 'sv', 'vi', 'fa', 'ar', 'id', 'ms', 'ca', 'cs', 'ko', 'hu', 'ja', 'no', 'zh', 'pt', 'ro', 'sr', 'sh', 'fi', 'tr', 'uk']

If you want I will run this for all languages, then I go to the next regex. Thanks

Started the clean up: An example

There is wrong task number in the comment

I didn't notice, thanks. I fixed it.

I wish you hadn't call the bot 'Parsoid bug' ... because it is a bug in the VE <-> Parsoid relationship. ;-) Anyway, not important, I am just nitpicking. But, good to see it getting cleaned up. Nice work @eranroz and @Ladsgroup.

Initially I wanted to use "Visual Editor bug" but projects of this bug are not VE and they are Parsoid tags/projects. So I changed it to Parsoid, doesn't really matters, I'm down with anything people choose :)

Still working. An example. I hope it will finish soon.

First part (\n *<nowiki> *</nowiki>) is cleaned up now in all wikipedia languages. More than several thousand edits has been made. Final results are in P2069

I start to work on the second part :)

ssastry triaged this task as Normal priority.

First part (\n *<nowiki> *</nowiki>) is cleaned up now in all wikipedia languages. More than several thousand edits has been made. Final results are in P2069

I start to work on the second part :)

P2069 is a bit cryptic ... how do I read those results?

First part (\n *<nowiki> *</nowiki>) is cleaned up now in all wikipedia languages. More than several thousand edits has been made. Final results are in P2069

I start to work on the second part :)

P2069 is a bit cryptic ... how do I read those results?

en {'totalhits': 5} means in en.wikipedia.org there are only "\n *<nowiki> *</nowiki>" matches in articles now (down from ~500)

First part (\n *<nowiki> *</nowiki>) is cleaned up now in all wikipedia languages. More than several thousand edits has been made. Final results are in P2069

I start to work on the second part :)

P2069 is a bit cryptic ... how do I read those results?

en {'totalhits': 5} means in en.wikipedia.org there are only "\n *<nowiki> *</nowiki>" matches in articles now (down from ~500)

Ah, I see. I thought it was some scaled version of # of titles that were fixed up.

Fixing second kind of issues is working now. Some examples

Jdforrester-WMF edited a custom field.
Ladsgroup lowered the priority of this task from Normal to Low.Sep 25 2015, 2:46 PM
Ladsgroup removed Ladsgroup as the assignee of this task.

I fixed as mush as I could (the second kind of errors are all cleaned up). Since errors has been decreased drastically I'm lowering priorities.

@Ladsgroup, even if it's a bit late, I want to mention that abusing Dexbot's bot flag for making these changes is not OK. I can understand that you considered the changes risk-free and I am all too familiar with the loops and hoops some communities make bot operators jump through, but still, policies are there for a reason and should be respected.

I also ask of other bot operators considering to take over this task to follow all local rules for the wikis they work on.

Hey, @Strainu I requested for flag in more than 50 wikis for removing Link FA and Link GA templates I respect policies and communities but cleaning glitches caused by software of mediawiki is not the same thing as an ordinary bot operation. I believe in common sense and I'm sure I made no harm at all to anyone or anything.

Saying take a bot flag for this action equals to "Don't do it" since It's too much headache for a little fix (one or two edits per wiki). I will probably do it by my account if number of fixes is low next time.

It's more like hundreds/thousands of edits per wiki, depending on size:
https://ro.wikipedia.org/w/index.php?title=Special:Contribu%C8%9Bii/Dexbot&offset=&limit=100&target=Dexbot
https://en.wikipedia.org/w/index.php?title=Special:Contributions/Dexbot&offset=20150920120249&limit=100&target=Dexbot

Software glitches should be fixed in software, not by bots, IMO. But still, I feel that this is a task that the communities should have been informed of.

  1. @Ladsgroup - did the bot finish to work on all sites?
  2. @ssastry / @Amire80 /@Ladsgroup - Are the relevant bugs that cause those nowiki are tracked and/or fixed? Do we want to somehow define a "policy" or a standard "workflow" so Parsoid team will report fixed bugs to be cleanup using a bot wiki-wide?

...

  1. @ssastry / @Amire80 /@Ladsgroup - Are the relevant bugs that cause those nowiki are tracked and/or fixed? Do we want to somehow define a "policy" or a standard "workflow" so Parsoid team will report fixed bugs to be cleanup using a bot wiki-wide?

Note that If Parsoid gets HTML that requires a nowiki, Parsoid will add it. For example, if Parsoid gets an ISBN-like string or a url in HTML as plain-text, those will be nowikied. Similarly if there is a link-trail-like text, those will get a <nowiki/> separation.

So, I don't think nowikis will completely disappear -- they will be there while the software-editor (CX, VE) and human-editor interaction introduces HTML as above.

In any case, if there are bots that can fix these scenarios, then that also suggests that in most (but not all) of those scenarios, we could introduce HTML normalization steps to do the same thing in Parsoid. So, Iet us focus on identifying such scenarios.

I agree, sometimes nowikis are required. But in some scenarios which nowikis are redundant, once Parsoid normalize them (e.g the bug is confirmed as a "invalid" nowiki), do we want to normalize OLD edits/ EXISTING content (by bot or by maintenance script) and not only prevent the nowikis in future edits?

@Ladsgroup It also should be safe to replace <nowiki>''</nowiki> to ". Tested at Latvian Wikipedia.