Page MenuHomePhabricator

Stray __TOC__ added by Parsoid in a 3-day window when 1.41.0-wmf.7 group2 wikis had been rolled back.
Closed, ResolvedPublicBUG REPORT

Description

Steps to reproduce

  1. Look at https://hu.wikipedia.org/w/index.php?title=Wikipédia:Kategóriajavaslatok&diff=26081817, an edit done with the reply tool.

Actual result

  1. In addition to the comment, a stray __TARTALOMJEGYZÉK__ (__TOC__) magic word was added.

Expected result

  1. The only change is the new comment.

Event Timeline

I expect this addition was also unintentional.

I think caused by https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/903797 and VE interaction. I am not sure how that is happening since I cannot reproduce it locally, but will investigate -- I imagine it is some specific kinds of interactions causing this.

Oh .. but that patch is part of v0.18.0-a7 in Parsoid which was to go out as part of 1.41.0-wmf.7 ... But, group2 wikis (enwiki, huwiki, cswiki) don't have wmf.7 yet .. So, I am confused!

Did 1.41.0-wmf.7 get rolled out to group2 and then get rolled back?

Indeed, it got rolled back in T330213#8828193. So, looks like it was on group2 wikis for about 3 hours in which time new Parsoid HTML ended up in RESTBase for some subset of pages and when those pages were now edited and hit the older version of Parsoid, we dirtied the pages! This is totally my oversight -- we should have recognized this as effectively a minor HTML version change and followed our established process.

At this time, unless this is causing major disruptions, we could wait for the train to be rolled forward to group2 wikis on Monday.

Could affected pages be collected somehow so that they can be fixed manually or using a bot (depending on the amount)? It’s good to know it will fix itself with the train (I agree it’s not a “major disruption”), but the already-bad pages probably won’t be fixed by future DiscussionTools edits.

Yes, I've been pondering that question. The affected pages won't be fixed on their own. One obvious solution to write a script to process the RC from the affected time window with the visualeditor or discussiontools tags and look for TOC in the diff. But, I am trying to think if there is a simpler solution than that. We'll figure out a strategy this coming week.

MSantos triaged this task as High priority.May 8 2023, 3:07 PM
MSantos moved this task from Backlog to In Progress on the Content-Transform-Team-WIP board.

I think we should be able to adapt this existing script easily to gather this info.

I did a few tweaks and I have it running against enwiki VE edits in the main namespace and it seems to be working, but since this needs to fetch the diff for every tagged edit, it is going to take a while to run through. I might have to run this script on a server somewhere. It also has a few false positives (because there is TOC in the diff both before/after -- I could make it smarter but this is just a quick trial test run).

But, from the run so far, it looks there is about one edit every 10 mins that has this TOC dirtying which probably would mean several hundreds of pages across all wikis (enwiki probably has the highest edit frequency of all wikis). What is the best way to surface this list of dirtied pages? Add it as a paste on this phab task?

I will have to tweak the script to use the localized name of TOC on a wiki, and then run it with both visualeditor and discussion tools edit filters for the timeframe when the rollback was in place. I will then probably let this run on scandium or some labs server.

Change 920285 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] WIP: Prototype script to process RC stream for dirty TOC edits

https://gerrit.wikimedia.org/r/920285

Any progress on this? It’s been four weeks since I reported this issue, and as time goes, the issue becomes more sever and harder to fix:

  • While it caused no visual changes back then, it may well have caused some in the meantime: if a new first section was added, the TOC may now be before the second section, if the first section was removed, it may no be in the middle of the text, and if there were more than three sections at the time but there are at most three currently, the TOC may appear unnecessarily (this latter is probably most likely on automatically archived talk pages).
  • With more new edits, the likelihood of edit conflicts grows, making it impossible to revert by bot.

My apologies - I was waiting for reviews of my patch, but I can instead run this without review and tweak it later if necessary. I'll try to get these results early next week.

Alright, I updated the script one final time to make sure I run it on all non-closed non-private wikis (~750+) and kicked it off on the parsing-qa-02 VM. I expect I'll have results in 24 hours or less.

The timestamps I picked for may 5 may not be the right now since I am seeing some edits from before the start timestamp. Anyway, once this run completes, I can rerun the script with a different timestamp range and merge the results.

But, FWIW, this run found about 70 diffs across all 750+ wikis caused by the use of the reply tool (many of which are also false positives because the talk page has reports of the stray toc additions which then gets picked up the script as a dirty diff).

The run with visualeditor tag is still going on and I expect that to find a much larger number of dirty diffs.

The timestamps I picked for may 5 may not be the right now since I am seeing some edits from before the start timestamp. Anyway, once this run completes, I can rerun the script with a different timestamp range and merge the results.

This current run is from 2023-05-05T18:25:20Z ... 2023-05-08T14:40:02Z. But that was the wrong start timestamp. Once this run completes, will rerun script for range 2023-05-04T20:51:13Z -- 2023-05-05T18:25:20Z and that will pretty much do it.

My apologies - I was waiting for reviews of my patch, but I can instead run this without review and tweak it later if necessary. I'll try to get these results early next week.

Okay, thanks!

Looks like based on the results so far (> 100K edits examined), about 0.8% of them have been dirtied across all wikis. enwiki (which has completed) has about 333 dirty diffs and is the highest so far.

All discusstiontools edits got fully processed for the affected timeframe. Since RC logs for 4th May cleared before the second script completed for the visualeditor tag , RC entries ofr about 3 hours on May 4th are missing for a bunch of wikis. I will need to query the db to find relevant edits for everything after frwiki. I will look into it on Monday.

What is the best way to share these results?

ssastry renamed this task from Stray __TOC__ added by the reply tool to Stray __TOC__ added by Parsoid in a 3-day window when 1.41.0-wmf.7 group2 wikis had been rolled back..Jun 4 2023, 4:17 AM

Okay, the script runs are complete. Some stats.

~200K edits were examined and about 1600 dirty diffs were found across all wikis (~100 are false positive reports on talk pages because of TOC matches later on the page). frwiki has 589 dirty diffs and enwiki has 333. All other wikis have fewer than 100. All but 13 wikis have fewer than 10 dirty diffs.

As reported above, on a few wikipedias (alphabetically everything after frwiki), there is missing data for about 3 hours on May 4 because I didn't get the script run completed before the RC logs cleared entries older than 30 days. But, looking at the stats as I have not (which I am including below), I there a few 10s of diffs (across multiple wikis) might have been missed across these wikis for 3 hours over a 3 day period (3 hrs over 3 day s= 1/25, 1/25 of 1500 = 60, but in reality it is going to be smaller because almost 1000 of those 1500 are from wikis with complete data). So, I am not yet sure it is worth putting in additional effort trying to identify them all precisely. Input welcome.

589 frwiki 
333 enwiki
 98 ukwiki
 85 zhwiki
 69 dewiki
 66 eswiki
 52 ruwiki
 35 trwiki
 33 plwiki
 26 idwiki
 21 jawiki
 14 fawiki
 14 cswiki
  9 rowiki
  9 arwiki
  7 elwiki
  6 viwiki
  6 euwiki
  5 svwiki
  5 skwiki
  5 nlwiki
  4 srwiki
  4 ptwiki
  4 nowiki
  4 hifwiki
  4 fiwiki
  3 thwiki
  3 tewiki
  3 huwiki
  3 etwiki
  2 sowiki
  2 quwiki
  2 mtwiki
  2 ltwiki
  2 lbwiki
  2 kowiki
  2 hywiki
  2 bnwiki
  2 bgwiki
  1 vowiki
  1 uzwiki
  1 tawiki
  1 slwiki
  1 simplewiki
  1 mkwiki
  1 extwiki
  1 dagwiki
  1 azwiki

So, the last thing remaining is dumping this list of diffs somewhere and figure out how to go about fixing them. enwiki and frwiki and few others will definitely benefit from some bot help. But, the rest could probably be tackled by manual edits -- I am happy to take on some of this myself.

Dumping the pages and/or specific diff links into a phab paste and then advertising in User-notice should work to get eyes, people can cross post from there for e.g. en.wp. Or paste on a wiki of your choice e.g. mw wiki.

Good idea. I created https://www.mediawiki.org/wiki/Parsoid/Deployments/T336101_followup .. which should also make it easy for editors to fix the page. Hopefully editors can strike out the entry there after fixing it.

I think caused by https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/903797 and VE interaction. I am not sure how that is happening since I cannot reproduce it locally, but will investigate -- I imagine it is some specific kinds of interactions causing this.

I met this ticket just now. I am the author of the above edit. The edit was done under Monobook with the oldest working interface you can imagine. :-)

Looks like a huwiki editor didn't like me fixing the dirty diff on their user page!

Not so surprising. :-) But this is not a talk page, and the answering tool
does not work here. The magic word may have been put there on purpose.

ssastry <no-reply@phabricator.wikimedia.org> ezt írta (időpont: 2023. jún.
6., K, 19:33):

ssastry added a comment. View Task
https://phabricator.wikimedia.org/T336101

Looks like a huwiki editor didn't like
https://hu.wikipedia.org/w/index.php?title=Szerkeszt%C5%91:Pelenczei_Bal%C3%A1zs&oldid=prev&diff=26180142
me fixing the dirty diff on their user page!

*TASK DETAIL*
https://phabricator.wikimedia.org/T336101

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *ssastry
*Cc: *binbot, Thibaut120094, matmarex, ssastry, matej_suchanek, Izno,
Aklapper, Tacsipacsi, Isabelladantes1983, Themindcoder, Adamm71, Jersione,
Hellket777, LisafBia6531, SLopes-WMF, 786, Biggs657, ihurbain, Bebiezaza,
EhsanKhandowa, Juan90264, Alter-paule, Beast1978, Un1tY, DAlangi_WMF,
Hook696, PatsagornY, Kent7301, joker88john, Viztor, CucyNoiD, Gaboe420,
Patriccck, Amorymeltzer, Giuliamocci, Cpaulf30, Af420, Bsandipan,
Lewizho99, JJMC89, Maathavan, Neuronton, Luke081515, Jrf, Dinoguy1000,
Arlolra, TheDJ, Jay8g

Not so surprising. :-) But this is not a talk page, and the answering tool
does not work here. The magic word may have been put there on purpose.

The dirty diff was caused by the use of visualeditor, not the reply tool. In any case, if the editor is fine with it, I don't have a problem. :)

Change 920285 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Script to process RC stream for dirty TOC edits

https://gerrit.wikimedia.org/r/920285

Re: Tech News - What wording would you suggest as the content? My best guess is something like this (improvements/tweaks appreciated!):

For a few hours last month, some pages edited with VisualEditor or DiscussionTools had an unintended __TOC__ (or its localized form) added during an edit. There is a listing of affected pages sorted by wiki, that may still need to be fixed.

Re: Tech News - What wording would you suggest as the content? My best guess is something like this (improvements/tweaks appreciated!):

For a few hours last month, some pages edited with VisualEditor or DiscussionTools had an unintended __TOC__ (or its localized form) added during an edit. There is a listing of affected pages sorted by wiki, that may still need to be fixed.

Thanks! The only change I would recommend is: to change "For a few hours" to "For 3 days" and also mention that this only impacted group2 wikis, so mostly wikipedias, not other wikis.

Change 929160 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a14

https://gerrit.wikimedia.org/r/929160

Change 929160 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a14

https://gerrit.wikimedia.org/r/929160

I am going to close this task. The wiki page with affected pages has helped fix the majority of pages. I'll probably go in and fix any remaining pages over this week.