Page MenuHomePhabricator

Migrate all old DB rows from windows-1252 to UTF-8 on nlwiki
Closed, ResolvedPublic

Event Timeline

Change 383012 had a related patch set uploaded (by Zoranzoki21; owner: Zoranzoki21):
[operations/mediawiki-config@master] Migrate all old DB rows from windows-1252 to UTF-8 on several wikis:

https://gerrit.wikimedia.org/r/383012

Change 383012 abandoned by Zoranzoki21:
Migrate all old DB rows from windows-1252 to UTF-8 on several wikis:

Reason:
I will abandon this change. Told me please to restore this patch, if it will be need. Sorry for much emails about this.

https://gerrit.wikimedia.org/r/383012

MarcoAurelio subscribed.

sorry, something weird happened when trying to sort the columns on the workboard

mysql:research@s2-analytics-replica.eqiad.wmnet [nlwiki]> select old_flags, count(*) from text group by old_flags limit 50;
+---------------------+----------+
| old_flags           | count(*) |
+---------------------+----------+
|                     |     8027 |
| 0                   |        1 |
| error               |        6 |
| external,gzip       |      863 |
| external,object     |       84 |
| external,utf-8      |  9231979 |
| gzip                |    14791 |
| object              |   182513 |
| utf-8,gzip          |  1079827 |
| utf-8,gzip,external | 54472003 |
+---------------------+----------+
10 rows in set (2 min 42.773 sec)

mysql:research@s2-analytics-replica.eqiad.wmnet [nlwiki]> select max(old_id) from text where not old_flags like '%external%';
+-------------+
| max(old_id) |
+-------------+
|    51326651 |
+-------------+
1 row in set (29.320 sec)

This will be fun

These pages in mw ns actually on legacy encoding but on their latest version meaning it might break the wiki if I move them (and if they have invalid utf-8):

+--------------------+
| page_title         |
+--------------------+
| Watchmethod-recent |
| Watchmethod-list   |
| Sig_tip            |
| Nstab-wp           |
| Portal-url         |
| Allarticles        |
| Media_tip          |
| Math_sample        |
| Math_syntax_error  |
| Ipb_expiry_invalid |
+--------------------+
10 rows in set (5.995 sec)

Was a mess to find them:

select old_id from text where old_id in (select SUBSTR(content_address, 4) from slots join content on slot_content_id = content_id where slot_revision_id in (select rev_id from revision join page on rev_page = page_id where page_namespace = 8)) and not old_flags like '%utf-8%';

Just for verbosity, these mediawiki pages have a legacy encoding revision in their history:

+-----------------------------+
| page_title                  |
+-----------------------------+
| Whitelistedittext           |
| Histlegend                  |
| Math_syntax_error           |
| Exbeforeblank               |
| Protectlogtext              |
| Movethispage                |
| Rcnote                      |
| Movearticle                 |
| Movenologintext             |
| Movepagebtn                 |
| Watchmethod-recent          |
| Math_sample                 |
| Extlink_sample              |
| Italic_sample               |
| Article                     |
| Move                        |
| Nstab-user                  |
| Unblocklogentry             |
| Undelete_short              |
| Nstab-main                  |
| Nstab-special               |
| Category_header             |
| Range_block_disabled        |
| Summary                     |
| Movepagetext                |
| Aboutpage                   |
| Anontalkpagetext            |
| Brokenredirectstext         |
| Disambiguationspage         |
| Doubleredirectstext         |
| Edithelppage                |
| Editingold                  |
| Explainconflict             |
| Helppage                    |
| Imagelisttext               |
| Longpages                   |
| Newpages                    |
| Protectedpagewarning        |
| Searchresulttext            |
| Shortpages                  |
| Sitesubtitle                |
| Whatlinkshere               |
| Sorbs_create_account_reason |
| Help                        |
| Loginprompt                 |
| Blockedtext                 |
| Newarticletext              |
| Noarticletext               |
| Copyrightwarning            |
| Nonefound                   |
| Searchdisabled              |
| Recentchangestext           |
| Uploadtext                  |
| Uploadlogpagetext           |
| Watchmethod-list            |
| Confirmdeletetext           |
| Blockiptext                 |
| Blockipsuccesstext          |
| Blocklogentry               |
| Pagemovedsub                |
| Talkexists                  |
| Movedto                     |
| Disclaimerpage              |
| Media_tip                   |
| Sig_tip                     |
| Googlesearch                |
| Ipb_expiry_invalid          |
| Proxyblockreason            |
| Copyright                   |
| Nstab-wp                    |
| Portal-url                  |
| Spamprotectiontext          |
| Monobook.js                 |
| Monobook.css                |
| Sitenotice                  |
| Sharedupload                |
| Allarticles                 |
| Copyrightwarning2           |
| Listingcontinuesabbrev      |
| Sitesupport-url             |
| Spamprotectionmatch         |
| Tog-fancysig                |
| Rightslogtext               |
+-----------------------------+
83 rows in set (0.003 sec)

Mentioned in SAL (#wikimedia-operations) [2023-06-08T11:28:36Z] <Amir1> mwscript maintenance/storage/moveToExternal.php --wiki=nlwiki --iconv DB cluster26 (T128154)

So for nlwiki, there are twenty rows that can't be fixed:

Resolving 20 stubs
Error at old_id 880583: can't find main text row old_id 759483
Error at old_id 880584: can't find main text row old_id 696197
Error at old_id 880585: can't find main text row old_id 696197
Error at old_id 880586: can't find main text row old_id 696197
Error at old_id 880587: can't find main text row old_id 696197
Error at old_id 880588: can't find main text row old_id 696197
Error at old_id 880589: can't find main text row old_id 696197
Error at old_id 880590: can't find main text row old_id 696197
Error at old_id 880591: can't find main text row old_id 696197
Error at old_id 880592: can't find main text row old_id 696197
Error at old_id 880593: can't find main text row old_id 696197
Error at old_id 880594: can't find main text row old_id 696197
Error at old_id 880595: can't find main text row old_id 696197
Error at old_id 880596: can't find main text row old_id 696197
Error at old_id 880597: can't find main text row old_id 696197
Error at old_id 880598: can't find main text row old_id 696197
Error at old_id 880599: can't find main text row old_id 696197
Error at old_id 880600: can't find main text row old_id 696197
Error at old_id 880601: can't find main text row old_id 696197
Error at old_id 880602: can't find main text row old_id 696197
0 of 20 stubs resolved

Should we mark them as bad blob and move on?

FWIW, optimizing text table after moving 1M entries to ES:

Before:

root@dbstore1007:/srv/sqldata.s2/nlwiki# ls -Ssh | head
total 74G
 13G revision.ibd
 12G pagelinks.ibd
7.5G text.ibd
5.1G __wmf_checksums.ibd
5.0G externallinks.ibd
4.1G slots.ibd
3.5G logging.ibd
3.3G content.ibd
2.8G comment.ibd

After:

root@dbstore1007:/srv/sqldata.s2/nlwiki# ls -Ssh | head --lines 11
total 69G
 13G revision.ibd
 12G pagelinks.ibd
5.1G __wmf_checksums.ibd
5.0G externallinks.ibd
4.1G slots.ibd
3.5G logging.ibd
3.3G content.ibd
2.8G comment.ibd
2.5G templatelinks.ibd
2.3G text.ibd

So for nlwiki, there are twenty rows that can't be fixed

20 rows but only two old_ids?

Should we mark them as bad blob and move on?

Probably, yes. :-(

So for nlwiki, there are twenty rows that can't be fixed

20 rows but only two old_ids?

I think that's not an old_id, it's cur_id and I think the cur row doesn't exist in table so it can't resolve the stub.

Should we mark them as bad blob and move on?

Probably, yes. :-(

I think Tim was working on something about this, I'll go bother him.

Doing a basic check to see if we can recover the lost data somehow.

My favorite thing is that while 19 out of 20 broken revisions refer to one non-existing row (696197) all come up with different hashes for content of that row:

mysql:research@s2-analytics-replica.eqiad.wmnet [nlwiki]> select * from text where old_id in (880583, 880584, 880585, 880586, 880587, 880588, 880589, 880590, 880591, 880592, 880593, 880594, 880595, 880596, 880597, 880598, 880599, 880600, 880601, 880602);
+--------+-----------------------------------------------------------------------------------------------------------+-----------+
| old_id | old_text                                                                                                  | old_flags |
+--------+-----------------------------------------------------------------------------------------------------------+-----------+
| 880583 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"759483";s:5:"mHash";s:32:"b81d6af2221802d0378599689748c374";} | object    |
| 880584 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"d2c57d0eb947b3a4b422593fab13ebba";} | object    |
| 880585 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"14e4347d45263bbe5291c08ff8c456da";} | object    |
| 880586 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"6e6ff906bebd62d722e38786aef2a303";} | object    |
| 880587 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"10ffbbe23e57ce3e4b854116f38c1676";} | object    |
| 880588 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"bf17eb9f7291849bff8f794032c21fbc";} | object    |
| 880589 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"62210dc99762fedfc79c14e56b72d2fc";} | object    |
| 880590 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"41f827984d23afa3afa8d53d6e955e08";} | object    |
| 880591 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"4bc02b17908ebe711a6f9def6ff99e88";} | object    |
| 880592 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"bff7094e608cc2143a2ae56b43179467";} | object    |
| 880593 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"a8641adf5965dfa9813fe6b15fed6f54";} | object    |
| 880594 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"7f268865de3b84bd3fe8b0a4eaf0fe8f";} | object    |
| 880595 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"7ba993ffb9c4322efc38adddb6725f6b";} | object    |
| 880596 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"33cd4f3fe539498a37dddf72e6f955f7";} | object    |
| 880597 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"f4d113b5b1ace53a97edd4ad298c0e05";} | object    |
| 880598 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"0d016ef74b5cddd6d9aaa6752620f573";} | object    |
| 880599 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"f4d113b5b1ace53a97edd4ad298c0e05";} | object    |
| 880600 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"0811fc3818f1b60de72a1b48cc445d07";} | object    |
| 880601 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"757c144cfb0eadc3b2e330a1f6599e2f";} | object    |
| 880602 | O:15:"historyblobstub":2:{s:6:"mOldId";s:6:"696197";s:5:"mHash";s:32:"b59ea154872f10e9669a25453f4c45a8";} | object    |
+--------+-----------------------------------------------------------------------------------------------------------+-----------+
20 rows in set (0.001 sec)

One row before the broken batch is stored in DB://rc1/588310/1 (just for the record)

mysql:research@s2-analytics-replica.eqiad.wmnet [nlwiki]> select * from text where old_id = 880582;
+--------+-------------------+----------------+
| old_id | old_text          | old_flags      |
+--------+-------------------+----------------+
| 880582 | DB://rc1/588310/1 | external,utf-8 |
+--------+-------------------+----------------+
1 row in set (0.001 sec)

Mentioned in SAL (#wikimedia-operations) [2023-06-14T14:36:44Z] <Amir1> mwscript findBadBlobs.php --wiki=nlwiki --revisions 880583,880584,880585,880586,880587,880588,880589,880590,880591,880592,880593,880594,880595,880596,880597,880598,880599,880600,880601,880602 --mark "T128154"

Change 930192 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Remove nlwiki from windows-1252 encoding

https://gerrit.wikimedia.org/r/930192

Change 930192 merged by jenkins-bot:

[operations/mediawiki-config@master] Remove nlwiki from windows-1252 encoding

https://gerrit.wikimedia.org/r/930192

Mentioned in SAL (#wikimedia-operations) [2023-06-15T11:14:39Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:930192|Remove nlwiki from windows-1252 encoding (T128154)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-15T11:16:14Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:930192|Remove nlwiki from windows-1252 encoding (T128154)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-15T11:32:17Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:930192|Remove nlwiki from windows-1252 encoding (T128154)]] (duration: 17m 38s)

Ladsgroup claimed this task.
Ladsgroup added a project: DBA.
Ladsgroup moved this task from Triage to Done on the DBA board.