| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Remove nlwiki from windows-1252 encoding | operations/mediawiki-config | master | +0 -1 |
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T128149 Remove wgLegacyEncoding feature of Revision/BlobStore | |||
| Resolved | Ladsgroup | T128150 Stop needing to use wgLegacyEncoding in Wikimedia cluster production | |||
| Resolved | Ladsgroup | T128154 Migrate all old DB rows from windows-1252 to UTF-8 on nlwiki |
Event Timeline
Change 383012 had a related patch set uploaded (by Zoranzoki21; owner: Zoranzoki21):
[operations/mediawiki-config@master] Migrate all old DB rows from windows-1252 to UTF-8 on several wikis:
Change 383012 abandoned by Zoranzoki21:
Migrate all old DB rows from windows-1252 to UTF-8 on several wikis:
Reason:
I will abandon this change. Told me please to restore this patch, if it will be need. Sorry for much emails about this.
mysql:research@s2-analytics-replica.eqiad.wmnet [nlwiki]> select old_flags, count(*) from text group by old_flags limit 50; +---------------------+----------+ | old_flags | count(*) | +---------------------+----------+ | | 8027 | | 0 | 1 | | error | 6 | | external,gzip | 863 | | external,object | 84 | | external,utf-8 | 9231979 | | gzip | 14791 | | object | 182513 | | utf-8,gzip | 1079827 | | utf-8,gzip,external | 54472003 | +---------------------+----------+ 10 rows in set (2 min 42.773 sec) mysql:research@s2-analytics-replica.eqiad.wmnet [nlwiki]> select max(old_id) from text where not old_flags like '%external%'; +-------------+ | max(old_id) | +-------------+ | 51326651 | +-------------+ 1 row in set (29.320 sec)
This will be fun
These pages in mw ns actually on legacy encoding but on their latest version meaning it might break the wiki if I move them (and if they have invalid utf-8):
+--------------------+ | page_title | +--------------------+ | Watchmethod-recent | | Watchmethod-list | | Sig_tip | | Nstab-wp | | Portal-url | | Allarticles | | Media_tip | | Math_sample | | Math_syntax_error | | Ipb_expiry_invalid | +--------------------+ 10 rows in set (5.995 sec)
Was a mess to find them:
select old_id from text where old_id in (select SUBSTR(content_address, 4) from slots join content on slot_content_id = content_id where slot_revision_id in (select rev_id from revision join page on rev_page = page_id where page_namespace = 8)) and not old_flags like '%utf-8%';
Just for verbosity, these mediawiki pages have a legacy encoding revision in their history:
+-----------------------------+ | page_title | +-----------------------------+ | Whitelistedittext | | Histlegend | | Math_syntax_error | | Exbeforeblank | | Protectlogtext | | Movethispage | | Rcnote | | Movearticle | | Movenologintext | | Movepagebtn | | Watchmethod-recent | | Math_sample | | Extlink_sample | | Italic_sample | | Article | | Move | | Nstab-user | | Unblocklogentry | | Undelete_short | | Nstab-main | | Nstab-special | | Category_header | | Range_block_disabled | | Summary | | Movepagetext | | Aboutpage | | Anontalkpagetext | | Brokenredirectstext | | Disambiguationspage | | Doubleredirectstext | | Edithelppage | | Editingold | | Explainconflict | | Helppage | | Imagelisttext | | Longpages | | Newpages | | Protectedpagewarning | | Searchresulttext | | Shortpages | | Sitesubtitle | | Whatlinkshere | | Sorbs_create_account_reason | | Help | | Loginprompt | | Blockedtext | | Newarticletext | | Noarticletext | | Copyrightwarning | | Nonefound | | Searchdisabled | | Recentchangestext | | Uploadtext | | Uploadlogpagetext | | Watchmethod-list | | Confirmdeletetext | | Blockiptext | | Blockipsuccesstext | | Blocklogentry | | Pagemovedsub | | Talkexists | | Movedto | | Disclaimerpage | | Media_tip | | Sig_tip | | Googlesearch | | Ipb_expiry_invalid | | Proxyblockreason | | Copyright | | Nstab-wp | | Portal-url | | Spamprotectiontext | | Monobook.js | | Monobook.css | | Sitenotice | | Sharedupload | | Allarticles | | Copyrightwarning2 | | Listingcontinuesabbrev | | Sitesupport-url | | Spamprotectionmatch | | Tog-fancysig | | Rightslogtext | +-----------------------------+ 83 rows in set (0.003 sec)
Mentioned in SAL (#wikimedia-operations) [2023-06-08T11:28:36Z] <Amir1> mwscript maintenance/storage/moveToExternal.php --wiki=nlwiki --iconv DB cluster26 (T128154)
So for nlwiki, there are twenty rows that can't be fixed:
Resolving 20 stubs Error at old_id 880583: can't find main text row old_id 759483 Error at old_id 880584: can't find main text row old_id 696197 Error at old_id 880585: can't find main text row old_id 696197 Error at old_id 880586: can't find main text row old_id 696197 Error at old_id 880587: can't find main text row old_id 696197 Error at old_id 880588: can't find main text row old_id 696197 Error at old_id 880589: can't find main text row old_id 696197 Error at old_id 880590: can't find main text row old_id 696197 Error at old_id 880591: can't find main text row old_id 696197 Error at old_id 880592: can't find main text row old_id 696197 Error at old_id 880593: can't find main text row old_id 696197 Error at old_id 880594: can't find main text row old_id 696197 Error at old_id 880595: can't find main text row old_id 696197 Error at old_id 880596: can't find main text row old_id 696197 Error at old_id 880597: can't find main text row old_id 696197 Error at old_id 880598: can't find main text row old_id 696197 Error at old_id 880599: can't find main text row old_id 696197 Error at old_id 880600: can't find main text row old_id 696197 Error at old_id 880601: can't find main text row old_id 696197 Error at old_id 880602: can't find main text row old_id 696197 0 of 20 stubs resolved
Should we mark them as bad blob and move on?
FWIW, optimizing text table after moving 1M entries to ES:
Before:
root@dbstore1007:/srv/sqldata.s2/nlwiki# ls -Ssh | head total 74G 13G revision.ibd 12G pagelinks.ibd 7.5G text.ibd 5.1G __wmf_checksums.ibd 5.0G externallinks.ibd 4.1G slots.ibd 3.5G logging.ibd 3.3G content.ibd 2.8G comment.ibd
After:
root@dbstore1007:/srv/sqldata.s2/nlwiki# ls -Ssh | head --lines 11 total 69G 13G revision.ibd 12G pagelinks.ibd 5.1G __wmf_checksums.ibd 5.0G externallinks.ibd 4.1G slots.ibd 3.5G logging.ibd 3.3G content.ibd 2.8G comment.ibd 2.5G templatelinks.ibd 2.3G text.ibd
20 rows but only two old_ids?
Should we mark them as bad blob and move on?
Probably, yes. :-(
I think that's not an old_id, it's cur_id and I think the cur row doesn't exist in table so it can't resolve the stub.
Should we mark them as bad blob and move on?
Probably, yes. :-(
I think Tim was working on something about this, I'll go bother him.
Doing a basic check to see if we can recover the lost data somehow.
My favorite thing is that while 19 out of 20 broken revisions refer to one non-existing row (696197) all come up with different hashes for content of that row:
mysql:research@s2-analytics-replica.eqiad.wmnet [nlwiki]> select * from text where old_id in (880583, 880584, 880585, 880586, 880587, 880588, 880589, 880590, 880591, 880592, 880593, 880594, 880595, 880596, 880597, 880598, 880599, 880600, 880601, 880602);
+--------+-----------------------------------------------------------------------------------------------------------+-----------+
| old_id | old_text | old_flags |
+--------+-----------------------------------------------------------------------------------------------------------+-----------+
| 880583 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"759483";s:5:"mHash";s:32:"b81d6af2221802d0378599689748c374";} | object |
| 880584 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"d2c57d0eb947b3a4b422593fab13ebba";} | object |
| 880585 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"14e4347d45263bbe5291c08ff8c456da";} | object |
| 880586 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"6e6ff906bebd62d722e38786aef2a303";} | object |
| 880587 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"10ffbbe23e57ce3e4b854116f38c1676";} | object |
| 880588 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"bf17eb9f7291849bff8f794032c21fbc";} | object |
| 880589 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"62210dc99762fedfc79c14e56b72d2fc";} | object |
| 880590 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"41f827984d23afa3afa8d53d6e955e08";} | object |
| 880591 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"4bc02b17908ebe711a6f9def6ff99e88";} | object |
| 880592 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"bff7094e608cc2143a2ae56b43179467";} | object |
| 880593 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"a8641adf5965dfa9813fe6b15fed6f54";} | object |
| 880594 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"7f268865de3b84bd3fe8b0a4eaf0fe8f";} | object |
| 880595 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"7ba993ffb9c4322efc38adddb6725f6b";} | object |
| 880596 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"33cd4f3fe539498a37dddf72e6f955f7";} | object |
| 880597 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"f4d113b5b1ace53a97edd4ad298c0e05";} | object |
| 880598 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"0d016ef74b5cddd6d9aaa6752620f573";} | object |
| 880599 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"f4d113b5b1ace53a97edd4ad298c0e05";} | object |
| 880600 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"0811fc3818f1b60de72a1b48cc445d07";} | object |
| 880601 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"757c144cfb0eadc3b2e330a1f6599e2f";} | object |
| 880602 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"b59ea154872f10e9669a25453f4c45a8";} | object |
+--------+-----------------------------------------------------------------------------------------------------------+-----------+
20 rows in set (0.001 sec)One row before the broken batch is stored in DB://rc1/588310/1 (just for the record)
mysql:research@s2-analytics-replica.eqiad.wmnet [nlwiki]> select * from text where old_id = 880582; +--------+-------------------+----------------+ | old_id | old_text | old_flags | +--------+-------------------+----------------+ | 880582 | DB://rc1/588310/1 | external,utf-8 | +--------+-------------------+----------------+ 1 row in set (0.001 sec)
Mentioned in SAL (#wikimedia-operations) [2023-06-14T14:36:44Z] <Amir1> mwscript findBadBlobs.php --wiki=nlwiki --revisions 880583,880584,880585,880586,880587,880588,880589,880590,880591,880592,880593,880594,880595,880596,880597,880598,880599,880600,880601,880602 --mark "T128154"
Change 930192 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/mediawiki-config@master] Remove nlwiki from windows-1252 encoding
Change 930192 merged by jenkins-bot:
[operations/mediawiki-config@master] Remove nlwiki from windows-1252 encoding
Mentioned in SAL (#wikimedia-operations) [2023-06-15T11:14:39Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:930192|Remove nlwiki from windows-1252 encoding (T128154)]]
Mentioned in SAL (#wikimedia-operations) [2023-06-15T11:16:14Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:930192|Remove nlwiki from windows-1252 encoding (T128154)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
Mentioned in SAL (#wikimedia-operations) [2023-06-15T11:32:17Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:930192|Remove nlwiki from windows-1252 encoding (T128154)]] (duration: 17m 38s)