Corruption of text from early 2005 due to HistoryBlobStub pointers broken by recompressTracked.php
Closed, ResolvedPublic

Description

In several articles, revision text from early 2005 appears blank when viewed in the English Wikipedia. IIRC the blank revision text appears around the time that Wikipedia changed its compression formats. This has been discussed several times on the English Wikipedia village pump; the URLs below contain plenty of examples of this problem:

http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_64#Old_versions_of_articles_missing

http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_62#Revision_content_disappeared


Version: unspecified
Severity: major
URL: http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive 65#Blank_sequence_of_pages_in_an_article's_history

Details

bzimport set Reference to bz20757.
Graham87 created this task.Sep 21 2009, 2:32 PM
brion added a comment.Sep 21 2009, 4:49 PM

Tim, can you take a peek?

ISTR we cleaned up some similar items recently, where the old revs had ended up stored with incorrect compression flags which lead to them being loaded incorrectly... we might have more of such. :(

This may be the same issue faced by enciclopedia.us.es, regarding compressed revisions.

Might be related:
This diff says there are 2950 intermediate revisions, but there are none in the history:
http://de.wikipedia.org/w/index.php?title=Benutzer_Diskussion%3ADickbauch&diff=25461813&oldid=18692073&uselang=en
http://de.wikipedia.org/w/index.php?title=Benutzer_Diskussion:Dickbauch&offset=20061230130858&limit=2&action=history&uselang=en

This diff doesn't say there are any intermediate revisions, however one revision is from 2006 and another one from 2004 - there are at least hundreds of intermediate revisions.
http://de.wikipedia.org/w/index.php?title=Benutzer_Diskussion:Dickbauch&diff=next&oldid=18692073&uselang=en

Also see this current village pump discussion:
http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=322294276#Lost_page_histories

Do you need *all* the examples of this issue to be reported here, or can a database query or some other process be used to fix all the places where this occurred, like what happend at bug 19990?

There are also a few more instances at:
http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_66

in the sections entitled "Lost page histories" (described above), "Missing revision content on Magic Knight Rayearth", and "revision history oddities".

Also see this current discussion:

http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=324595859
Under the section "Blank revisions, tracking them".

asmarin wrote:

I have similar problem with my site http://enciclopedia.us.es.

I use CompressOld.php over my database and has a bug when release was 1.14.x.

If you see recent changes before apply compressold dont show nothing.

On 1.14.x runs, but on > 1.15 dont. See http://encicloold.us.es.

A patch wasnt release never.

Another example is here, in the edits before the cut and paste move:
http://en.wikipedia.org/w/index.php?title=Eurydice_(mythology)&action=history

I won't history merge it yet, to avoid what happened in bug 19990.

All edits from 2005, besides the first and the last ones from that year, are blank in this page history:
http://en.wikipedia.org/w/index.php?title=Causes_of_sexual_orientation&action=history

This is not what I thought it was. It is a bug in recompressTracked.php. I am looking at it now. It should be recoverable.

OK I've checked a lot of these test cases, and they all seem to be the same, so I'm changing the summary. All of the relevant revisions should now be serving errors instead of pretending to be blank.

The original version of compressOld.php concatenated several revisions into one "blob" and stored it in a random row in the old table. Then the other old rows which needed data from the concatenated blob would get a pointer object, called a HistoryBlobStub. This pointer object gave an old_id and content hash which located the text for that revision.

After we started using external storage (ES), all the bulk data was moved out of the core database. Now, to load a HistoryBlobStub, MW would first load the old_id where the concatenated text used to be, where it would find a second pointer (with old_flags=external), then it would follow the second pointer to load the blob from ES. This was an inefficient situation, so I introduced a new pointer type (the "two-part CGZ URL") which pointed directly from the rows where the stub objects used to be, into ES.

I then wrote a script called resolveStubs.php, and ran it, removing all HistoryBlobStub objects from the database. Or at least, that's what I thought I did. It transpires that these missing revisions above are all HistoryBlobStub objects that somehow escaped resolveStubs.php.

The current generation of recompression script, trackBlobs/recompressTracked, has no appropriate handling for HistoryBlobStub. It leaves the HistoryBlobStub objects in place, but removes the CGZ objects they point to, creating a broken pointer.

Due to a bug in Revision.php, the broken pointer was displayed as a blank page instead of an error message. This is fixed in r62119.

Luckily I was fairly paranoid when I wrote trackBlobs/recompressTracked, and all the data required for recovery appears to have been retained. It's just a matter of writing a bug fix script.

Thanks Tim for looking into this. I've added some text about this bug to:
http://en.wikipedia.org/wiki/MediaWiki:Missing-article

It'd be confusing to have this error message pop up when someone is checking the history of a page. Since I had to read through your explanation twice to understand, I hope that "database glitch" is OK for now as a layman's explanation.

That doesn't explain the existance of wrong ConcatenatedGzipHistoryBlob objects (the serialized mItems length doesn't match with the real one).
Perhaps they were indeed different issues :S

Report different issues on a separate bug report please.

All the test cases on the English Wikipedia should be fixed now:

  • 1.3 million revisions were broken by this bug and are now fixed
  • 177 revisions were unrecoverable due to being damaged by a previous compression script some years ago, while cluster4 and cluster5 were current.
  • 333 revisions were unrecoverable due to the text row being missing, probably due to a bug in the original 2005 compression script.

The fix script still needs to be run on the other wikis, so this bug has to stay open for now.

Are you going to provide a list of the unrecoverable revisions?

They're not really relevant to this bug. Maybe they are listed on some other bug report already.

Does this error message at the plasma page have anything to do with bug 20,757, or the fix for it:
http://en.wikipedia.org/w/index.php?title=Plasma&oldid=9752546

I undid a braindead history merge from "plasma" to "plasma physics", before the script was run in the English Wikipedia. Since the history merge tangled many edits together from January 2005, I wonder if my machinations at the plasma and plasma physics pages in January 2010 caused something to break.

I'm fairly sure that the above revision was visible before I untangled the history at plasma physics.

So Tim ran the fixup script on all other wikis on Feb 27th and none of them were affected. I don't know if there is anything else that needs to be checked before this bug is closed, though.

The bug is almost resolved, then. I'm still curious about the problem with the plasma article that I described in comment 29; it turns out to affect all edits from 12:22, 28 January 2005 (UTC_) to 00:00, 16 April 2005 (UTC). I'd like to know whether (a) it is a result of this bug and (b) whether the affected revisions are recoverable.

(In reply to comment #31)

According to the fixup script, those revisions are unrecoverable.

I had a look at a few random revisions 9752546, 11243046, 11397897 from the time period you mentioned. The text pointer for these revisions goes to a single location in cluster5, with the same id and itemid. I seem to be able retrieve something from there manually, plugging the pointer into ExternalStore::fetchFromUrl(), but it's one text item, not a concatenated set of texts. I can't say if your history unmerge had anything to do with it.

Ariel, can you check 44320111 from bug 8689 against that list ?

Perhaps the list of unrecoverable revisions be added to the ticket or something ? That would help match any other cases we find against this problem and help finding issues that are something other than this problem.

EN.WP.ST47 wrote:

It seems that between Tim and Ariel the repair scripts have been run and all test cases except the most recent one referenced to bug 8689, however that bug has been resolved, and the referenced revision text appears to be available. Marking this as fixed?

Add Comment