Page MenuHomePhabricator

nl.wiktionary.org edits from May 2004 corrupt "PHP Warning: gzinflate(): data error" (fatal RevisionAccessException)
Open, MediumPublic

Description

DBAs noticed over a thousand errors from snapshot1009, dumping nlwiktionary metadatadumps. Tracking down 4 specific instances, I found bad text table entries, where the text was presumably deflated but is apparently garbage. Dumps continue as usual, so it doesn't break them.

We should track and do someting about these entries; there are surely some on other wikis as well.

Note that according to the stack trace, the failure to retrieve a sha1 from the revision results in a lookup of the content. We are dumping metadata only, and so lookups of content should never happen.

Sample stack trace:

PHP Warning: gzinflate(): data error

#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.36.0-wmf.13/includes/Storage/SqlBlobStore.php(593): gzinflate(string)
#2 /srv/mediawiki/php-1.36.0-wmf.13/includes/Storage/SqlBlobStore.php(520): MediaWiki\Storage\SqlBlobStore->decompressData(string, array)
#3 /srv/mediawiki/php-1.36.0-wmf.13/includes/Storage/SqlBlobStore.php(430): MediaWiki\Storage\SqlBlobStore->expandBlob(string, array, string)
#4 /srv/mediawiki/php-1.36.0-wmf.13/includes/Storage/SqlBlobStore.php(286): MediaWiki\Storage\SqlBlobStore->fetchBlobs(array, integer)
#5 /srv/mediawiki/php-1.36.0-wmf.13/includes/libs/objectcache/wancache/WANObjectCache.php(1548): MediaWiki\Storage\SqlBlobStore->MediaWiki\Storage\{closure}(boolean, integer, array, NULL, array)
#6 /srv/mediawiki/php-1.36.0-wmf.13/includes/libs/objectcache/wancache/WANObjectCache.php(1376): WANObjectCache->fetchOrRegenerate(string, integer, Closure, array, array)
#7 /srv/mediawiki/php-1.36.0-wmf.13/includes/Storage/SqlBlobStore.php(291): WANObjectCache->getWithSetCallback(string, integer, Closure, array)
#8 /srv/mediawiki/php-1.36.0-wmf.13/includes/Revision/RevisionStore.php(1046): MediaWiki\Storage\SqlBlobStore->getBlob(string, integer)
#9 /srv/mediawiki/php-1.36.0-wmf.13/includes/Revision/RevisionStore.php(1312): MediaWiki\Revision\RevisionStore->loadSlotContent(MediaWiki\Revision\SlotRecord, NULL, NULL, NULL, integer)
#10 [internal function]: MediaWiki\Revision\RevisionStore->MediaWiki\Revision\{closure}(MediaWiki\Revision\SlotRecord)
#11 /srv/mediawiki/php-1.36.0-wmf.13/includes/Revision/SlotRecord.php(300): call_user_func(Closure, MediaWiki\Revision\SlotRecord)
#12 /srv/mediawiki/php-1.36.0-wmf.13/includes/Revision/SlotRecord.php(544): MediaWiki\Revision\SlotRecord->getContent()
#13 /srv/mediawiki/php-1.36.0-wmf.13/includes/Revision/RevisionSlots.php(202): MediaWiki\Revision\SlotRecord->getSha1()
#14 [internal function]: MediaWiki\Revision\RevisionSlots->MediaWiki\Revision\{closure}(NULL, MediaWiki\Revision\SlotRecord)
#15 /srv/mediawiki/php-1.36.0-wmf.13/includes/Revision/RevisionSlots.php(204): array_reduce(array, Closure, NULL)
#16 /srv/mediawiki/php-1.36.0-wmf.13/includes/Revision/RevisionStoreRecord.php(178): MediaWiki\Revision\RevisionSlots->computeSha1()
#17 /srv/mediawiki/php-1.36.0-wmf.13/includes/export/XmlDumpWriter.php(403): MediaWiki\Revision\RevisionStoreRecord->getSha1()
#18 /srv/mediawiki/php-1.36.0-wmf.13/includes/export/XmlDumpWriter.php(316): XmlDumpWriter->{closure}()
#19 /srv/mediawiki/php-1.36.0-wmf.13/includes/export/XmlDumpWriter.php(405): XmlDumpWriter->invokeLenient(Closure, string)
#20 /srv/mediawiki/php-1.36.0-wmf.13/includes/export/WikiExporter.php(536): XmlDumpWriter->writeRevision(stdClass, array)
#21 /srv/mediawiki/php-1.36.0-wmf.13/includes/export/WikiExporter.php(479): WikiExporter->outputPageStreamBatch(Wikimedia\Rdbms\ResultWrapper, stdClass)
#22 /srv/mediawiki/php-1.36.0-wmf.13/includes/export/WikiExporter.php(299): WikiExporter->dumpPages(string, boolean)
#23 /srv/mediawiki/php-1.36.0-wmf.13/includes/export/WikiExporter.php(184): WikiExporter->dumpFrom(string, boolean)
#24 /srv/mediawiki/php-1.36.0-wmf.13/maintenance/includes/BackupDumper.php(318): WikiExporter->pagesByRange(integer, integer, boolean)
#25 /srv/mediawiki/php-1.36.0-wmf.13/maintenance/dumpBackup.php(82): BackupDumper->dump(integer, integer)
#26 /srv/mediawiki/php-1.36.0-wmf.13/maintenance/doMaintenance.php(106): DumpBackup->execute()
#27 /srv/mediawiki/php-1.36.0-wmf.13/maintenance/dumpBackup.php(144): require_once(string)
#28 /srv/mediawiki/multiversion/MWScript.php(101): require_once(string)
#29 {main}

See https://logstash.wikimedia.org/goto/bc9d687961c403d0d34ef3df56d24c16

Event Timeline

Page id: 37917 on nlwiktionary

Info from metadata dump:

<revision>
  <id>2307</id>
  <parentid>1373</parentid>
  <timestamp>2004-05-31T13:01:59Z</timestamp>
  <contributor>
    <username>GerardM</username>
    <id>13</id>
  </contributor>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text bytes="0" id="2307" />
  <sha1 />
</revision>
<revision>
  <id>2396</id>
  <parentid>2307</parentid>
  <timestamp>2004-06-11T22:49:42Z</timestamp>
  <contributor>
    <username>Bemoeial</username>
    <id>7</id>
  </contributor>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text bytes="0" id="2396" />
  <sha1 />
</revision>
<revision>
  <id>2397</id>
  <parentid>2396</parentid>
  <timestamp>2004-06-12T14:33:09Z</timestamp>
  <contributor>
    <username>Bemoeial</username>
    <id>7</id>
  </contributor>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text bytes="0" id="2397" />
  <sha1 />
</revision>
<revision>
  <id>3164</id>
  <parentid>2397</parentid>
  <timestamp>2004-06-12T14:33:46Z</timestamp>
  <contributor>
    <username>Bemoeial</username>
    <id>7</id>
  </contributor>
  <comment>/* te verwijderen vanaf 26-06-2004 */</comment>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text bytes="0" id="3164" />
  <sha1 />
</revision>

Revision data:

wikiadmin@10.192.32.183(nlwiktionary)> select * from revision where rev_id in (2307,2396,2397,3164);
+--------+----------+----------------+-----------+----------------+----------------+-------------+---------+---------------+----------+
| rev_id | rev_page | rev_comment_id | rev_actor | rev_timestamp  | rev_minor_edit | rev_deleted | rev_len | rev_parent_id | rev_sha1 |
+--------+----------+----------------+-----------+----------------+----------------+-------------+---------+---------------+----------+
|   2307 |    37917 |              0 |         0 | 20040531130159 |              0 |           0 |    NULL |          1373 |          |
|   2396 |    37917 |              0 |         0 | 20040611224942 |              0 |           0 |    NULL |          2307 |          |
|   2397 |    37917 |              0 |         0 | 20040612143309 |              0 |           0 |    NULL |          2396 |          |
|   3164 |    37917 |              0 |         0 | 20040612143346 |              0 |           0 |    NULL |          2397 |          |
+--------+----------+----------------+-----------+----------------+----------------+-------------+---------+---------------+----------+
4 rows in set (0.00 sec)

text data:

wikiadmin@10.192.32.183(nlwiktionary)> select * from text where old_id in (2307,2396,2397,3164);
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+
| old_id | old_text                                                                                                                                                                                                                                                                                                                                                                                                                                                        | old_flags |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|   2307 | «®Î-N·
I�/K-*ÏÌJI-JÍ‹/HLÏÌK,®­ååA] àåòÏJ—$&i…Ì’ÔÜb…ÜüôÔ<…ÜÔ’T Užš                                                                                                                                                                                                                                                                                                                                                                                          | gzip      |
|   2396 | «®I�/K-*ÏÌJI-JÍ‹/HLÏÌK,®­åÒ韔,.ILÒ
™%©¹Å
¹ùé©y
¹©%©@ª<5                                                                                                                                                                                                                                                                                                                                                                                                | gzip      |
|   2397 | ]�OOÂPÄï|Š½‰Æþ…C“^ÔÄ“ñ Æ!dK‡²Ð·�¼¾B"òÝÝ**ñ¶ÉLf~;Çã‹=ÂA6t±ãR”ëÓ©“eA�íYyEÃIÒŸ$Ã~LYÖ¹¡Ùì	…ð›l%}ÀbÊùœòª¾�ô�yKkïC$ד‰Rª7œWÞk¢Ì.4â¬Pj*˜ÃÞ¼©e¿œ-—ÑÝ5â;ª•%”¨ŠH­™¸ª`­*ˆIYÛ”‚~ÙÎW/ öQ05`Û†´i”¾¦ŒÓшº¯ÓûëN;¥Cl
                    (ÛÕžÕ5H£´�#\MΗ¦ÿÙ>                                                                                                                                                                                | gzip      |
|   3164 | ]�OOÂ@Åï|Š¹‰Æþ…C“^ÔÄ“ñ Æ!dÊ>Ê´ÝY²ÝB"òÝÝ**ñ6™·ûÞoÞñ8Çj�ÒÀCW;.D¹9�zYFt¡Ñž•74ž%ÃY2§”e½ž`„ߤ’ô«Ê;Ë%%äT�‘ôíyK[ç| ;�‰’5%çµsš(³õ­Ø€4ú½œ×—vý-Â;êMüU 6�öÑŒ¸®“T’&°�foè—çœ|Õ��¶„x…�ªÇ:tç´þqX4;^ƒ¶Èsè Ò<"�h|zë \ü
               ‘k4¦²Uúªc4M'ê¿Îï¯{]}¡
YWDýïÙ'                                                                                                                                                                                     | gzip      |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+
4 rows in set (0.00 sec)

So sometime in 2004-06 things were broken.

ArielGlenn renamed this task from Corrupt entries in text table for nlwiktionary causing to Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning .Oct 20 2020, 10:34 AM

Might be a case for findBadBlobs.php and I really think we'll see this on a bunch of wikis.

Note that these errors can be regenerated at any time by going to a relatively idle snapshot host and, as the dumpsgen user, running

php /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=nlwiktionary --full --stub --report=1000 --output=file:/mnt/dumpsdata/temp/dumpsgen/nlwiktionary-20201020-stub-meta-history.xml --start=37917 --end 37918

Note that the initial report via IRC of over a thousand errors seems to be only these sorts of errors:

dumpsgen@snapshot1005:/mnt/dumpsdata/xmldatadumps/public/nlwiktionary/20201020$ zcat nlwiktionary-20201020-stub-meta-history.xml.gz  | grep '<sha1 />' | wc -l
1817

I'm running a crap bash script on the fallback dumps nfs server (dumpsdata1003), crunching metadata xml files to see the pattern of empty sha1s for revisions across all the wikis. I'll drop a report in here when it's complete. Running in screen session from ariel, as dumpsgen user.

The following wikis have revisions with an empty sha1:
anwiki azwiki cawiki commonswiki dewiki dewikiversity dewiktionary elwiki enwiki eswiki eswiktionary etwiki frwiki hrwiki nlwiki nlwiktionary nowiki ocwiki plwiki ptwiki slwiki viwiki zhwiki

The crap script I ran to generate lists of all revs with empty sha1s, and lists of just the dates:

#!/bin/bash

cd /data/xmldatadumps/public
wikis=$( ls -d *wik* )
#wikis="nlwiktionary"

# first, get the raw info for each wiki; then produce a uniq list of problem dates for each wiki

for wiki in $wikis; do
    stubsfile="${wiki}/20201001/${wiki}-20201001-stub-meta-history.xml.gz"
    if [ -e "$stubsfile" ]; then
        zcat "$stubsfile"  | egrep  '^      <(id|sha1|timestamp)' | sed -e 's|^      ||g; s|</id>||g; s|</timestamp>||g;' |  sed 'N;N;s|\n| |g'| grep '<sha1 />'| sed -e 's|T.*Z <sh| <sh|g; s|<i
d>|<revid>|g;' | gzip >  "/data/xmldatadumps/temp/${wiki}-empty-sha1-info.gz"
        zcat "/data/xmldatadumps/temp/${wiki}-empty-sha1-info.gz" | sed -e 's|<revid>.*<timestamp>||g;' | sort | uniq | gzip > "/data/xmldatadumps/temp/${wiki}-empty-sha1-dates.gz"
    else
        echo "no such file "$stubsfile", skipping"
    fi
done

Number of affected revs per wiki:

anwiki: number of affected revs: 1
azwiki: number of affected revs: 3
cawiki: number of affected revs: 2
commonswiki: number of affected revs: 1
dewiki: number of affected revs: 979
dewikiversity: number of affected revs: 1
dewiktionary: number of affected revs: 1
elwiki: number of affected revs: 1
enwiki: number of affected revs: 403
eswiki: number of affected revs: 48
eswiktionary: number of affected revs: 71
etwiki: number of affected revs: 1
frwiki: number of affected revs: 2
hrwiki: number of affected revs: 1
nlwiki: number of affected revs: 20
nlwiktionary: number of affected revs: 1817
nowiki: number of affected revs: 2
ocwiki: number of affected revs: 1
plwiki: number of affected revs: 3
ptwiki: number of affected revs: 1
slwiki: number of affected revs: 1
viwiki: number of affected revs: 1
zhwiki: number of affected revs: 1

Dates of problem revs for everything except dewiki, enwiki, eswiki, nlwiktionary:

anwiki: dates of problem revs
2009-03-09 
azwiki: dates of problem revs
2009-03-09 
2009-03-10 
cawiki: dates of problem revs
2009-03-09 
commonswiki: dates of problem revs
2004-12-27 
dewikiversity: dates of problem revs
2008-10-26 
dewiktionary: dates of problem revs
2004-05-01 
elwiki: dates of problem revs
2009-03-09 
eswiktionary: dates of problem revs
2004-06-05 
2004-06-06 
2004-06-08 
etwiki: dates of problem revs
2009-03-09 
frwiki: dates of problem revs
2008-03-13 
2009-03-10 
hrwiki: dates of problem revs
2009-03-09 
nlwiki: dates of problem revs
2004-09-09 
2004-09-17 
2004-09-18 
2004-10-10 
2004-10-26 
2004-10-27 
nowiki: dates of problem revs
2009-03-09 
ocwiki: dates of problem revs
2009-03-09 
plwiki: dates of problem revs
2004-08-28 
ptwiki: dates of problem revs
2009-03-09 
slwiki: dates of problem revs
2008-03-13 
viwiki: dates of problem revs
2009-03-10 
zhwiki: dates of problem revs
2009-03-09

I have the lists of affected revs and dates for all wikis in files in /data/xmldatadumps/temp/ on dumpsdata1003.

A note about the range of affected revisions on the wikis that have a large number of them:

  • dewiki, from 2002-09-13 to 2005-05-14 and then also 2006-04-09 and 2009-03-09
  • enwiki, from 2001-10-01 to 2005-08-25 and then also 2006-04-09 and 2009-03-09
  • eswiki, from 2004-02-19 through 2004-11-25 and then 2009-03-09
  • nlwiktionary: 2003-12-23, and then 2004-04-03 to 2004-07-25

At this point input from others is needed. If we are to run findBadBlobs, should we be looking for old bugs about corrupted data or revisions or text? Is there any easy way to find those old bugs, or does someone perhaps remember or have a record of these sorts of issues from the 2004 period or the one date in 2009? Is running that script the way to go or is some other approach better?

AMooney triaged this task as Medium priority.Nov 2 2020, 2:33 PM
Krinkle renamed this task from Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning to Corrupt nlwiktionary text causing "PHP Warning: gzinflate(): data error".Jan 27 2021, 8:04 PM
Krinkle moved this task from Untriaged to Older on the Wikimedia-production-error board.
Krinkle renamed this task from Corrupt nlwiktionary text causing "PHP Warning: gzinflate(): data error" to nl.wiktionary.org faces "PHP Warning: gzinflate(): data error" (sometimes with fatal RevisionAccessException).Apr 18 2021, 10:42 PM
Krinkle subscribed.

I can reproduce this at https://nl.m.wiktionary.org/wiki/Speciaal:MobielVerschillen/7374.

[1726c3b8-fb9f-4cb6-b532-3b5ac293b321] /wiki/Speciaal:MobielVerschillen/7374   PHP Warning: gzinflate(): data error

#1 /srv/mediawiki/php-1.37.0-wmf.1/includes/Storage/SqlBlobStore.php(593): gzinflate(string)
#2 /srv/mediawiki/php-1.37.0-wmf.1/includes/Storage/SqlBlobStore.php(520): MediaWiki\Storage\SqlBlobStore->decompressData(string, array)
#3 /srv/mediawiki/php-1.37.0-wmf.1/includes/Storage/SqlBlobStore.php(430): MediaWiki\Storage\SqlBlobStore->expandBlob(string, array, string)
…
#9 /srv/mediawiki/php-1.37.0-wmf.1/includes/Revision/RevisionStore.php(1463): MediaWiki\Revision\RevisionStore->loadSlotContent(MediaWiki\Revision\SlotRecord, NULL, NULL, NULL, integer)
…
#19 /srv/mediawiki/php-1.37.0-wmf.1/extensions/MobileFrontend/includes/specials/SpecialMobileDiff.php(192): DifferenceEngine->showDiffPage(boolean)

Followed-by:

RevisionAccessException
MediaWiki\Revision\RevisionAccessException: Failed to load data blob from tt:3488: Bad data in text row 3488. Use findBadBlobs.php to remedy.. If this problem persist, use the findBadBlobs maintenance script to investigate the issue and mark bad blobs.

I see a spike of 1815 of these around 09:14 UTC today for nlwiktionary.

I see a spike of 1815 of these around 09:14 UTC today for nlwiktionary.

nlwiktionary and dewiktionary spike at 9:00 UTC today

https://nl.wiktionary.org/w/index.php?oldid=22&uselang=en
[79d39db3-ed1e-440d-b472-e0453f2bad40] 2022-05-09 22:41:06: Fatal exception of type "MediaWiki\Revision\RevisionAccessException"

(nlwiktionary)> SELECT * FROM slots WHERE slot_revision_id=22;
+------------------+--------------+-----------------+-------------+
| slot_revision_id | slot_role_id | slot_content_id | slot_origin |
+------------------+--------------+-----------------+-------------+
|               22 |            1 |            3949 |          22 |
+------------------+--------------+-----------------+-------------+
1 row in set (0.001 sec)

(nlwiktionary)> SELECT * FROM text WHERE old_id=22 LIMIT 1;
| old_id | old_text                         | old_flags |
+--------+----------------------------------+-----------+
–;¸àïnre)¯’ÔÍÊ…´fv…´F˜ÉÙ®CbÀîÂ`úóòÇ@ÿ?±™ß�£ßƒÝø�9âã³2A€!¸äv'Ô0ÃÛ‚›îJ¿Ÿ#âéó-Í€~{… | gzip      |
+--------+----------------------------------+-----------+
1 row in set (0.001 sec)

So, we do have something there, but rather than a reference to something like DB://cluster#/### it is an inlined gzip-compressed blob. But, attempting to gzinflate it, returns false.

As it lacks the utf-8 flag, I checked if perhaps it is expected to be in the 'windows-1252' charset ($wgLegacyEncoding, though not set for nlwiktionary). But alas, no luck with that either.

return gzinflate($data->old_text);
# bool(false)

return $data->old_text;
# –;¸àïnre)¯’ÔÍÊ…´fv…´F˜ÉÙ®CbÀîÂ`úóòÇ@ÿ?±™ß�£ßƒÝø�9âã³2A€!¸äv'Ô0ÃÛ‚›îJ¿Ÿ#âéó-Í€~{uZrƒ(Àž�u¥ÔëRëO`Î5^,•ÆßÛ¿ù¹‹9ñioã

return iconv('windows-1252', 'UTF-8//IGNORE', $data->old_text);
# –;¸àïnre’ÔÃÊ…´fv…´F˜ÉÙ®CbÀîÂ`úóòÇ@ÿ?±™ß�£ßƒÃø�9âã³2AšÃ¸äv'Ô0ÃÛ‚›îJ¿Ÿ#âéó-À~{uZrÆ’(Àž�u¥ÔëRë^,•ÆßÛ¿ù¹‹9ñioã

return gzinflate(iconv('windows-1252', 'UTF-8//IGNORE', $data->old_text));
# bool(false)

I've gone through a number of other charset pairs with iconv, including with and without gzinflate, but didn't find anythign that led to recognisable text or e.g. serialised PHP syntax.

The revision in question belongs to the ja entry of nl.wiktionary.org at https://nl.wiktionary.org/w/index.php?title=ja. I find that actually each of the first ten revisions from May 2004 there are similarly broken from revision 22 upto and including revision 3161. There are however revisions of other pages from May 2004 on that same wiki that are not broken, e.g. https://nl.wiktionary.org/w/index.php?oldid=1300 is fine.

SAL: https://wikitech.wikimedia.org/wiki/Server_Admin_Log/Archive_1.

Mentioned in SAL (#wikimedia-operations) [2022-07-29T22:09:16Z] <Krinkle> findBadBlobs.php nlwiktionary --revisions 22 --mark 'Invalid gzip, T265989'

I used findBadBlobs.php to scan with a low limit (100 revs) from 2004-05-04 on nlwiktionary, and indeed found numerous other bad blobs. I then stepped back in time until there was a run of 100 revs without a bad blob.

nlwiktionary Jan to March 2004 - OK
$ mwscript findBadBlobs.php nlwiktionary --scan-from 2004-01-01T00:00:00 --limit 200
Scanning revisions table, 200 rows starting at rev_timestamp 20040101000000
        - Scanned a batch of 200 revisions, up to revision 301235 (20040317163014)  
Scanning archive table by ar_rev_id, 33490 to 301236
        - Scanned a batch of 200 archived revisions, up to revision 33690 (20040317163014)
nlwiktionary April 2014 - Bad
$ mwscript findBadBlobs.php nlwiktionary --scan-from 2004-01-01T00:00:00 --limit 300
Scanning revisions table, 300 rows starting at rev_timestamp 20040101000000
! Found bad blob on revision 2093 from 20040403182737 (main slot): content_id=5029, address=<tt:2093>
! Found bad blob on revision 2592 from 20040406090514 (main slot): content_id=5279, address=<tt:2592>
! Found bad blob on revision 2067 from 20040407131506 (main slot): content_id=5028, address=<tt:2067>
! Found bad blob on revision 249 from 20040501185440 (main slot): content_id=4073, address=<tt:249>
! Found bad blob on revision 251 from 20040501190827 (main slot): content_id=4075, address=<tt:251>
! Found bad blob on revision 1 from 20040501192407 (main slot): content_id=3933, address=<tt:1>
! Found bad blob on revision 239 from 20040501193827 (main slot): content_id=4063, address=<tt:239>
! Found bad blob on revision 28 from 20040501194629 (main slot): content_id=3955, address=<tt:28>
! Found bad blob on revision 2 from 20040501195750 (main slot): content_id=3934, address=<tt:2>
! Found bad blob on revision 9 from 20040501200123 (main slot): content_id=3939, address=<tt:9>
! Found bad blob on revision 168 from 20040501200328 (main slot): content_id=4025, address=<tt:168>
Scanned a batch of 300 revisions, up to revision 301244 (20040501200328)

Scanning archive table by ar_rev_id, 0 to 301245
! Found bad blob on revision 12 from 20040425235525 (main slot): content_id=3716308, address=<tt:12>
! Found bad blob on revision 18 from 20040329105152 (main slot): content_id=3716309, address=<tt:18>

Scanned a batch of 300 archived revisions, up to revision 791 (20040501200328)

Found 125 bad revisions.
$ mwscript findBadBlobs.php nlwiktionary --scan-from 2004-05-01T00:00:00 --limit 5000
Scanning revisions table, 5000 rows starting at rev_timestamp 20040501000000
! Found bad blob on revision 249 from 20040501185440
! Found bad blob on revision 251 from 20040501190827
! Found bad blob on revision 1 from 20040501192407
! Found bad blob on revision 239 from 20040501193827
! Found bad blob on revision 28 from 20040501194629
! Found bad blob on revision 2 from 20040501195750
! Found bad blob on revision 9 from 20040501200123
! Found bad blob on revision 168 from 20040501200328
! Found bad blob on revision 3 from 20040501200958
! Found bad blob on revision 4 from 20040501211825
! Found bad blob on revision 14 from 20040501234547
! Found bad blob on revision 8 from 20040502065500
! Found bad blob on revision 10 from 20040502074724
! Found bad blob on revision 7 from 20040502095513

! Found bad blob on revision 349 from 20040511213726
! Found bad blob on revision 3013 from 20040512104923
! Found bad blob on revision 3088 from 20040512124023
! Found bad blob on revision 3165 from 20040512130300

Scanned a batch of 1000 revisions, up to revision 301253 (20040802075855)

Scanning archive table by ar_rev_id, 0 to 301254
! Found bad blob on revision 12 from 20040425235525
! Found bad blob on revision 18 from 20040329105152
! Found bad blob on revision 32 from 20040505151705
! Found bad blob on revision 33 from 20040505151734
! Found bad blob on revision 42 from 20040505163313
! Found bad blob on revision 45 from 20040505165211
! Found bad blob on revision 48 from 20040505163400
! Found bad blob on revision 59 from 20040505165256

! Found bad blob on revision 4098 from 20040716162301
! Found bad blob on revision 4100 from 20040716162545
! Found bad blob on revision 4101 from 20040716162912
! Found bad blob on revision 4102 from 20040716162921
! Found bad blob on revision 4105 from 20040630210610
! Found bad blob on revision 4106 from 20040716201850
! Found bad blob on revision 4116 from 20040609120428
! Found bad blob on revision 4119 from 20040718005425
! Found bad blob on revision 4127 from 20040716202548
! Found bad blob on revision 4129 from 20040719121243
! Found bad blob on revision 4136 from 20040612212532
! Found bad blob on revision 4154 from 20040719150223
! Found bad blob on revision 4173 from 20040724135534
! Found bad blob on revision 4174 from 20040724135737
! Found bad blob on revision 4176 from 20040617215747
- Scanned a batch of 1000 archived revisions, up to revision 11867 (20040802075855)
- Scanned a batch of 1000 archived revisions, up to revision 14657 (20040802075855)
- Scanned a batch of 1000 archived revisions, up to revision 32778 (20040802075855)
- Scanned a batch of 1000 archived revisions, up to revision 33882 (20040802075855)
The range of archive rows scanned is based on the range of revision IDs scanned in the revision table.
Found 2371 bad revisions.

Found 2371 bad revisions.

Edits from 2003 and early 2004 appear to be okay. Such as seen at https://nl.wiktionary.org/w/index.php?title=WikiWoordenboek:Lijst_van_Engelse_woorden/b&dir=prev&action=history&limit=10.

The fact that these 2003 edits have a higher revision ID than the broken ones from May 2004, including revision 1 and many other of the first few hundred revisions, suggest to me the following scenario:

  • The wiki was created in May 2004, and something happened at some point (possibly many years later) that caused the first week of edits to be lost.
  • At some point during that week edits from 2003 were imported, most of these are not broken.
  • Additional edits that first week, after the import, are also broken.
  • Edits from after the first week remain fine.
(nlwiktionary)> SELECT old_id, SUBSTR(old_text, 1, 10) FROM text ORDER BY old_id ASC LIMIT 1000;
+--------+-------------------------+
| old_id | SUBSTR(old_text, 1, 10) |
+--------+-------------------------+
|      1 | ]Ž½
1�                 |
|      2 | ÍTQkÔ@~                |
|      3 | åZmSþ                |
|      4 | ¥Y[SÛÈ�                  |
|      7 | UTÑNÛ@|                |
|      8 | ÕVÍnÛF                 |
…
|     65 | ¥ZYsÛ¸�                  |
…
|    350 | •ZÛn                 |
|    351 | MOA‚0�                 |
|    352 | •ZÛn                 |
|    353 | {{msg:-nl-              |
|    354 | [[geslacht              |
|    355 | [[Afbeeldi              |
|    356 | {{msg:-nou              |
Krinkle renamed this task from nl.wiktionary.org faces "PHP Warning: gzinflate(): data error" (sometimes with fatal RevisionAccessException) to nl.wiktionary.org edits from May 2004 corrupt "PHP Warning: gzinflate(): data error" (fatal RevisionAccessException).Jul 29 2022, 10:40 PM

Mentioned in SAL (#wikimedia-operations) [2022-07-29T22:43:45Z] <Krinkle> krinkle@mwmaint1002$ mwscript findBadBlobs.php nlwiktionary; mark 2371 blobs from May 2004 as "Invalid gzip, T265989"

I've marked them as bad blobs to resolve the production error. I've re-triaged the task as Wikimedia-database-issue (Bad data) to potentially investigate further and/or recover at some future point.

Can you provide more examples of the bad data from affected revisions? The –; at the start feels suspect to me, could it be a delimiter that isn't supposed to be there?

While of course hex dumps would be helpful, I already did notice something:

return iconv('windows-1252', 'UTF-8//IGNORE', $data->old_text);
# –;¸àïnre’ÔÃÊ…´fv…´F˜ÉÙ®CbÀîÂ`úóòÇ@ÿ?±™ß�£ßƒÃø�9âã³2AšÃ¸äv'Ô0ÃÛ‚›îJ¿Ÿ#âéó-À~{uZrÆ’(Àž�u¥ÔëRë^,•ÆßÛ¿ù¹‹9ñioã

The substring � is what you would get if $data->old_text contains the bytes EF BF BD (U+FFFD REPLACEMENT CHARACTER in UTF-8). So it is quite possible that this resulted from a bad UTF-8 conversion similar to the one that caused data corruption on eswiktionary (T2950). At least in the case of revision 22, the corruption does not appear to have happened recently; the text is missing from nlwiktionary-20060703-pages-meta-history.xml.7z.

Some history

T2950 lacks some details as to how this sort of corruption likely happened, so I did a little research.

As mentioned in the wikitech-l thread "Corrupt old entries on eswiktionary", some wikis were converted to UTF-8 without the use of $wgLegacyEncoding (which was added to speed up migrations of larger wikis) by first making a backup dump, then converting the character encoding of the dump and restoring the database tables from the converted dump. Most likely, some version of a C++ program written by Med was used. (wikitech-l message from Shaihulud following the frwiki conversion, message from Med saying why this approach was taken, later message with a URL for Med's software, Wayback Machine copies of some of Med's files)

Med's conversion program was a simple iconv-like one that used Qt classes to convert from a legacy encoding (such as "ISO8859-1") to UTF-8, with the additional feature of decoding HTML entities. The program was totally unaware of MediaWiki's database schema, and if run on a dump of the old table, it could easily and irreversibly corrupt compressed revision text. At least in the case of the version used for the eswiktionary conversion, any of several byte values in the input could become EF BF BD. It seems that Shaihulud became aware of this problem prior to performing the nlwiktionary conversion. So why might Shaihulud possibly have continued to use that conversion program on the old table? Maybe Shaihulud was unaware that some nlwiktionary revisions had been compressed. Unfortunately, it seems that this issue went unnoticed, or at least was ignored, for nearly six years.

Backups?

In a wikitech-l thread from June 2010, the author of this task, @ArielGlenn, asked if anyone still had a full history dump of nlwiktionary from between June and October 2004. Though WMF's copies may well be long gone by now, I did manage to find SQL dumps of nlwiktionary's cur and old tables in the Wayback Machine:

These particular dumps are too old to include all of the revisions that were corrupted, though should include many of them. (The only newer nlwiktionary SQL dump I have found so far is the one from March 2005 on dumps.wikimedia.org, which includes only the cur table. Also, if the corruption happened during the UTF-8 conversion, that dump would be way too new anyway; see wikitech-l post "nl:wiktionary now on UTF-8 :)", from July 2004.)

A script could be written that would read cur/old SQL dumps and try to match revisions contained in the dumps with revisions on the site, and restore any corrupted revisions from the dumps. If such a script is capable of matching revisions not only by revision IDs (which prior to 1.5, changed after a delete/undelete cycle) but also by user IDs/IPs and timestamps or by MD5 hashes, maybe it could be useful for T147146 as well.

If you are reading this and are interested in writing such a script, please don't assume I will end up completing the work. (After all, I spent many hours working on a script to fix T24624, and never ended up finishing it. In that case, ultimately it was decided to delete the data as part of the MCR work. Hopefully, this time will be different.)