Page MenuHomePhabricator

Database corruption due to compressOld array plus bug, April 2006
Open, MediumPublic

Description

Timeline:

  • May 2005: compressOld.php run, without 120 byte threshold
  • April ~20 2006: compressOld.php run again, with 120 byte threshold. Filter for previously compressed rows ignored due to array plus bug. Texts greater than 120 bytes were rewritten, but texts less than 120 bytes were left as HistoryBlobStub objects pointing to nonexistent data.
  • April 22, 2006: checkStorage.php written
  • April 24, 2006: checkStorage.php run on all affected wikis. Corrupted text was restored from the XML backups available at the time. Empty revisions were not restored, and text which was too new for the latest backup was not restored.
  • December 2017: Storage errors changed from returning null/false to throwing
  • January 2022: This bug, original report:

5730218 in German Wikipedia contains bad text and as result https://de.wikipedia.org/w/index.php?oldid=5730218 is causing fatal error. This probably needs recovering from backup if the corruption is new. Something hit that revision 20k times and caused an alert: https://logstash.wikimedia.org/goto/a8385d0d2cc5ec8e24fc3df102af4388

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2022-01-18T11:28:15Z] <Amir1> mwscript findBadBlobs.php --wiki=dewiki --revisions 5730218 --mark "T299387"

Now it won't fatal anymore at least.

Hey, should I try to search backups for that blob? I guess there is a very small chance it is there, but it takes very little to check. Do you know the missing key on ES database to search for?

Sadly I cannot know the blob_id that was failing because after "fixing it", it is overriden on the metadata. Do the logs have this id as part of the error? Otherwise I would have to check the backups first to find the old metadata content.

I think I found it, the referenced text row says:

old_id: 5730218
 old_text: O:15:"historyblobstub":2:{s:6:"mOldId";s:7:"5815762";s:5:"mHash";s:32:"d41d8cd98f00b204e9800998ecf8427e";}
old_flags: object

I think I need help interpreting this- I know how to handle and recover something like DB://clusterX/<blobid>/<hash>, but not sure where to look at backups (content) for this. Or would this be the (metadata) row to search for? Is this some kind of compression? Is this referencing another row on the same table? I am lost.

After reading some docs, the text row seem to indicate "it has the same content as text row oldid = 5815762 (which is revision 5806950: https://de.wikipedia.org/w/index.php?title=Enzym&oldid=5806950 ) and that seems very plausible in context.

Why the row is failing, I am unsure, could this be a software or infra (e.g. serialization formatting difference) bug, and not a data issue? The data seems as expected (although I am not familiar enough with mw to be sure).

The revision is a vandalism with 0 useful information, but the bug or compatibility issue may be interesting for the mw developers.

If this happened to be a software regression, the way to revert the metadata edit- which was non destructuve in any case, would be:

UPDATE `dewiki`.`content` SET content_address = 'tt:5730218' WHERE content_id = 5349895;

Sorry to add @daniel but you are the biggest expert I know about metadata tables, and the content table in particular. Could you help me come up with an explanation why something like "object type text table references on dewiki" may have regressed? Or php serialization changes? See my above comments.

I think I need help interpreting this- I know how to handle and recover something like DB://clusterX/<blobid>/<hash>, but not sure where to look at backups (content) for this. Or would this be the (metadata) row to search for? Is this some kind of compression? Is this referencing another row on the same table? I am lost.

It's a HistoryBlobStub, which is indeed part of a compression scheme: multiple revisions are compressed together into a blob, and a stub is put in place that points to the location of that compressed blob, and a text within the array that is in that blob (by hash). The blob itself seems to be stored in ES, at DB://cluster5/16645/3f07bcd2ed6e40844da84f4cdfd70143.

After reading some docs, the text row seem to indicate "it has the same content as text row oldid = 5815762 (which is revision 5806950: https://de.wikipedia.org/w/index.php?title=Enzym&oldid=5806950 ) and that seems very plausible in context.

No, it doesn't have the same content - that blob should be meta-blob, it contains multiple blobs in an array (and is then compressed). @tstarling is the expert on how exactly this works.

But... tt:5815762 doesn't contain a "meta-blob", or any compressed data. So the old_id in the HistoryBlobStub is wrong. It's pointing to the wrong blob. I can't think of a way for this to happen after the fact, unless a bit was flipped somewhere.

How did you find revision 5806950? That points to blob tt:5806950, which is very similar, but not exactly the same.

A server admin log entry from March 30, 2006 indicates that I ran resolveStubs.php. This was supposed to remove all HistoryBlobStub objects from the database.

T22757#265050 mentions that resolveStubs.php somehow failed to resolve all stubs, and that some remaining stubs were corrupted by recompressTracked.

This revision in particular was apparently not corrupted by recompressTracked. The data is there, same as always:

$ mwscript maintenance/eval.php --wiki=dewiki
> $dbc = MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancerFactory()->getExternalLB('cluster5')->getConnectionRef(DB_REPLICA);
> $blob = $dbc->selectField('blobs','blob_text', ['blob_id' => 16645]);
> $obj = unserialize($blob);
> print $obj->getItem('d41d8cd98f00b204e9800998ecf8427e');

{{Vandalismussperre}}
{{keine Auskunft}}

Außer diesem Artikel "Auferstehung" gibt es noch einen Artikel [[Auferstehung_Jesu_Christi]]. Der artikelteil "Die historisch-kritische Diskussion ist eine umfangreichen Darstellung philsosophischer und theologischer Gedanken zur Auferstehung Jesu und beginnt mit dem Abschnitt "Rationalismus"--[[Benutzer:Ulamm|Ulamm]] 11:20, 13. Nov. 2006 (CET)

...

Maybe something is just going wrong with HistoryBlobStub.php line 107-116:

			if ( in_array( 'external', $flags ) ) {
				$url = $row->old_text;
				$parts = explode( '://', $url, 2 );
				if ( !isset( $parts[1] ) || $parts[1] == '' ) {
					return false;
				}
				$row->old_text = MediaWikiServices::getInstance()
					->getExternalStoreAccess()
					->fetchFromURL( $url );
			}

Ideally we would resolve the stubs. There are a lot:

MariaDB [dewiki]> select count(*) from text where old_text like 'O:15:"historyblobstub"%' and old_id<10000000;
+----------+
| count(*) |
+----------+
|     3449 |
+----------+
1 row in set (7.545 sec)

$blob = $dbc->selectField('blobs','blob_text', ['blob_id' => 16645]);
$obj = unserialize($blob);
print $obj->getItem('d41d8cd98f00b204e9800998ecf8427e');

Actually $obj here is a DiffHistoryBlob and DiffHistoryBlob::getItem() casts the key to an int, so the text is just some random text, not actually the text of the revision.

So the original text was probably lost due to T22757.

Haha d41d8cd98f00b204e9800998ecf8427e is the MD5 hash of the empty string. We can recover the text from the hash.

Would it be possible to introduce a revision that fixes the missing reference? If you document the process I can fix future cases (or at least avoid errors).

$blob = $dbc->selectField('blobs','blob_text', ['blob_id' => 16645]);
$obj = unserialize($blob);
print $obj->getItem('d41d8cd98f00b204e9800998ecf8427e');

Actually $obj here is a DiffHistoryBlob and DiffHistoryBlob::getItem() casts the key to an int, so the text is just some random text, not actually the text of the revision.

The table should have been blobs_cluster5 not blobs. Then you get a CGZ blob. But it's still not the right CGZ blob.

Of the 3449 HistoryBlobStub objects in dewiki, 2269 are still working fine. They probably point to the 1129 CGZ blobs that are still directly in dewiki's text table. This is the correct usage of HistoryBlobStub, although I thought we had removed all instances of it from production.

Of the remainder, 243 are the empty string. 27 more have a hash which is in a CGZ blob in the text table (probably by coincidence). That leaves 910 probably lost.

Would it be possible to introduce a revision that fixes the missing reference? If you document the process I can fix future cases (or at least avoid errors).

I will probably write a script, or update the existing scripts, or both. The idea will be to resolve this category of errors automatically. New instances of this category of error are not being created, because we stopped trying to do multi-revision text compression. For the last decade, we just threw hardware at the problem.

[0448][tstarling@mwmaint1002:~]$ mwscript maintenance/storage/storageTypeStats.php --wiki=dewiki
Using bin size of 100000
226000000

Flags                         Class                                   Count               old_id range                 
------------------------------------------------------------------------------------------------------------------------
external,utf-8                CGZ pointer                             15235546            0             - 54600000     
external,utf-8                DHB pointer                             21783105            0             - 54600000     
gzip                          [none]                                  39406               0             - 6800000      
object                        historyblobstub                         3449                200000        - 5900000      
object                        concatenatedgziphistoryblob             1129                4000000       - 5900000      
object                        historyblobcurstub                      313429              6700000       - 7400000      
0                             [none]                                  235                 2600000       - 6800000      
[none]                        [none]                                  551                 3700000       - 6800000      
utf-8,gzip                    [none]                                  779275              3700000       - 28100000     
utf-8,gzip,external           simple pointer                          184983686           12400000      - 226300000    
external,utf8                 DHB pointer                             1611064             33900000      - 46800000     
gzip,external                 simple pointer                          1                   59600000      - 59700000     
error                         [none]                                  2524                172100000     - 172600000    
external,object               simple pointer                          1580                172100000     - 172600000

Change 858837 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Update moveToExternal and resolveStubs

https://gerrit.wikimedia.org/r/858837

EDIT: removed incorrect explanation. compressOld.inc had an overcomplicated condition for old_flags, but testing shows that it prevents recompression of a CGZ main text pointer like DB://cluster5/1234 since such pointers have old_flags like %object%.

SAL April 24 is possibly related:

11:20 Tim: checkStorage.php completed, I'm now using it to fix the wikis corrupted by a bug in compressOld.php. Sample output at http://p.defau.lt/?z9EhTaOllcxImxBw0VacnQ , the rest is in /home/wikipedia/logs/checkStorage . Currently running on srv31, I might need to move to benet later to get higher dump filtering speeds

checkStorage.php:

		if ( $text === '' ) {
			// This is what happens if the revision was broken at the time the
			// dump was made. Unfortunately, it also happens if the revision was
			// legitimately blank, so there's no way to tell the difference. To
			// be safe, we'll skip it and leave it broken
			$id = $id ? $id : '';
			echo "Revision $id is blank in the dump, may have been broken before export\n";
			return;
		}

That explains why the empty revisions weren't fixed at the time.

Mysteries remain. There should never have been HistoryBlobStub objects for empty revisions in the first place, because since r6138, October 2004, texts shorter than 120 bytes were skipped. The revision in the task description has a timestamp of 2005-05-10.

We know that a version of compressOld.php without the 120 byte threshold was run on dewiki, because that's the only thing that could create the empty HBS objects. It was run some time after 2005-05-14, which is the latest rev_timestamp for such an object. The upgrade to MW 1.5 was done in June-July 2005. The compressOld.inc in the REL1_4 branch indeed does not have the 120 byte threshold. So that all makes sense since the SAL shows me running compressOld.php in late May.

We know that when the CGZ objects were later overwritten, that was done by compressOld.php with the --extdb option, because that is the only thing that can write CGZ objects with empty mDefaultHash. I confirmed that in external CGZ objects targeted by HBS objects, the mDefaultHash is always empty, with eval.php as follows:

$dbr = wfGetDB(DB_REPLICA);
$res = $dbr->query('select * from text where old_id<5900000 and old_text like \'O:15:"historyblobstub"%\'');
$exa = MediaWiki\MediaWikiServices::getInstance()->getExternalStoreAccess();
foreach ( $res as $row ) { $obj = unserialize($row->old_text); $mainRow = $dbr->selectRow('text','*',['old_id' => $obj->getLocation()]); if ( strpos($mainRow->old_flags, 'external') === false) continue; $parts = explode('/', $mainRow->old_text); $blob = $exa->fetchFromURL( "DB://{$parts[2]}/{$parts[3]}" ); $mainObj = unserialize($blob); print "{$row->old_id}: "; var_dump( $mainObj->mDefaultHash ); }

The latest old_id that points to DB://cluster4/%/% is 15778868, which has a rev_timestamp of 2006-04-20. And as previously noted, I wrote checkStorage.php on April 22 and ran it on April 24.

I think the relevant bug was misuse of array plus, causing $conds[0] to be ignored. I think I fixed it in the production working copy, and the fix was eventually committed to Subversion by Brion in d88bf87284c59097878f761cdbcfa27c75b6262c.

tstarling renamed this task from Bad revision in German Wikipedia to Database corruption due to compressOld array plus bug, April 2006.Nov 27 2022, 11:12 PM
tstarling updated the task description. (Show Details)

Change 858837 merged by jenkins-bot:

[mediawiki/core@master] Update moveToExternal and resolveStubs

https://gerrit.wikimedia.org/r/858837

I'm running this for nlwiki for unrelated reason. These are the ones it couldn't find:

Error at old_id 880583: can't find main text row old_id 759483
Error at old_id 880584: can't find main text row old_id 696197
Error at old_id 880585: can't find main text row old_id 696197
Error at old_id 880586: can't find main text row old_id 696197
Error at old_id 880587: can't find main text row old_id 696197
Error at old_id 880588: can't find main text row old_id 696197
Error at old_id 880589: can't find main text row old_id 696197
Error at old_id 880590: can't find main text row old_id 696197
Error at old_id 880591: can't find main text row old_id 696197
Error at old_id 880592: can't find main text row old_id 696197
Error at old_id 880593: can't find main text row old_id 696197
Error at old_id 880594: can't find main text row old_id 696197
Error at old_id 880595: can't find main text row old_id 696197
Error at old_id 880596: can't find main text row old_id 696197
Error at old_id 880597: can't find main text row old_id 696197
Error at old_id 880598: can't find main text row old_id 696197
Error at old_id 880599: can't find main text row old_id 696197
Error at old_id 880600: can't find main text row old_id 696197
Error at old_id 880601: can't find main text row old_id 696197
Error at old_id 880602: can't find main text row old_id 696197

Mentioned in SAL (#wikimedia-operations) [2023-06-13T11:45:15Z] <Amir1> cat wikis_having_stubs | xargs -I {} bash -c 'echo {}; touch /home/ladsgroup/{}.undo.sql; chmod 777 /home/ladsgroup/{}.undo.sql; mwscript maintenance/storage/moveToExternal.php --wiki={} --end 200000000 --undo /home/ladsgroup/{}.undo.sql DB cluster26' (T299387)

Until last week, these wikis had unresolved stubs:

afwiki
arwiki
cawiki
commonswiki
dewiki
enwiki
enwikibooks
enwikinews
enwikiquote
enwiktionary
eswiki
etwiki
fiwiki
frwiki
hewiki
huwiki
itwiki
jvwiki
kuwiki
labswiki
lnwiki
metawiki
nlwiki
nowiki
plwiki
ruwiki
slwiki
sourceswiki
zh_min_nanwiki

I ran the moveToExternal.php on any wiki that had stubs and now only these are left with unresolved stubs:

arwiki
commonswiki
dewiki
enwiki
enwikibooks
eswiki
frwiki
itwiki
nlwiki
plwiki

I guess @tstarling would like to take at that (I'm focused on finishing legacy encoding instead)