Page MenuHomePhabricator

Corruption of article dumps in the dump from 20240520 in multiple languages
Open, HighPublicBUG REPORT

Description

The size of pages-articles.xml.bz2 dumps has decreased between 20240501 and 20240520 for multiple languages and the content of some pages are missing.

For instance:

  • Report on-wiki for frwiki (the size went down from 5.7 GB to 5.3 GB).
  • Email about dewiki on xmldatadumps-l (6.6 GB -> 6.1 GB).
  • For enwiki, the size went down from 20.6 GB to 18.0 GB.

See also another report of a page without text on dewiktionary: T365425

Event Timeline

I confirm the missing pages. For instance in French pages-article dump:

bzcat dumps/fr/20240520/frwiktionary-20240520-pages-articles.xml.bz2| grep -B 1 -A 20 -e ">Module:lexique<"
<page>
    <title>Module:lexique</title>
    <revision>
    <id>4185217</id>
    <ns>828</ns>
      <timestamp>2024-05-19T11:50:10Z</timestamp>
      <parentid>33410406</parentid>
      <id>34652150</id>
        <id>204645</id>
        <username>Lepticed7</username>
      <contributor>
      <model>Scribunto</model>
      <comment>Renommage des catégories</comment>
      </contributor>
    </revision>
      <sha1>jdvyl05peuuv5jzhtkv1j3xsg1xxn2v</sha1>
      <text bytes="11714" />
      <format>text/plain</format>
  </page>

Here the text element is indeed empty.

This seems to happen on pages that have been edited between 20240501 and 20240520. For instance the Module:lexique page was edited the 16th of May, while page Module:string contains is not empty and was modified before 20240501 :

  <page>
    <title>Module:string</title>
    <ns>828</ns>
    <id>3725531</id>
    <revision>
      <id>31473499</id>
      <parentid>31410121</parentid>
      <timestamp>2023-02-07T22:23:48Z</timestamp>
      <contributor>
        <username>Pamputt</username>
        <id>2901</id>
      </contributor>
      <minor />
      <comment>A protégé « [[Module:string]] » : Modèle ou module sensible ou répandu ([Modifier = Autoriser uniquement les utilisateurs autoconfirmés] (infini) [Renommer = Autoriser uniquement les utilisateurs autoconfirmés] (infini))</comment>
      <model>Scribunto</model>
      <format>text/plain</format>
      <text bytes="20195" xml:space="preserve">local m_params = require(&quot;Module:paramètres&quot;)
local str = {}

-- Cannot include null byte.
local UTF8_char = &quot;[\1-\127\194-\244][\128-\191]*&quot;
...

Also, some figures that tend to show that the problem occurred strictly after May 1st 2024 (on a sample of wiktionary dumps):

for f in dumps/??/20240420/*.bz2; do echo -n $f " -> " ; bzcat $f | grep "<text bytes=" | grep "/>" | wc -l; done
dumps/ca/20240420/cawiktionary-20240420-pages-articles.xml.bz2  ->        5
dumps/el/20240420/elwiktionary-20240420-pages-articles.xml.bz2  ->       13
dumps/fi/20240420/fiwiktionary-20240420-pages-articles.xml.bz2  ->       22
dumps/ga/20240420/gawiktionary-20240420-pages-articles.xml.bz2  ->       22
dumps/nl/20240420/nlwiktionary-20240420-pages-articles.xml.bz2  ->        5
for f in dumps/??/20240501/*.bz2; do echo -n $f " -> " ; bzcat $f | grep "<text bytes=" | grep "/>" | wc -l; done
dumps/ca/20240501/cawiktionary-20240501-pages-articles.xml.bz2  ->        5
dumps/el/20240501/elwiktionary-20240501-pages-articles.xml.bz2  ->       13
dumps/fi/20240501/fiwiktionary-20240501-pages-articles.xml.bz2  ->       22
dumps/ga/20240501/gawiktionary-20240501-pages-articles.xml.bz2  ->       22
dumps/nl/20240501/nlwiktionary-20240501-pages-articles.xml.bz2  ->        5

And the last dumps have a lot more empty pages:

for f in dumps/??/20240520/*.bz2; do echo -n $f " -> " ; bzcat $f | grep "<text bytes=" | grep "/>" | wc -l; done
dumps/ca/20240520/cawiktionary-20240520-pages-articles.xml.bz2  ->     1992
dumps/el/20240520/elwiktionary-20240520-pages-articles.xml.bz2  ->     2059
dumps/fi/20240520/fiwiktionary-20240520-pages-articles.xml.bz2  ->    21771
dumps/fr/20240520/frwiktionary-20240520-pages-articles.xml.bz2  ->   199315
dumps/ga/20240520/gawiktionary-20240520-pages-articles.xml.bz2  ->       25
dumps/nl/20240520/nlwiktionary-20240520-pages-articles.xml.bz2  ->     2815

And some figures that tend to show that the bug appeared the 8th of May, around 18:42:

bzcat dumps/fr/20240520/frwiktionary-20240520-pages-articles.xml.bz2| grep -A 12 -e "<timestamp>2024-05-" | grep -B 12 -e "<text bytes=.*/>" | grep "<timestamp>" | sort > empty-timestamps.txt

this will extract most pages with a timestamp in May, then, filter only on empty pages and extract their timestamps. Then, we sort and get the first timestamps corresponding to an empty entry:

head -4 empty-timestamps.txt
<timestamp>2024-05-08T18:42:42Z</timestamp>
<timestamp>2024-05-08T18:43:37Z</timestamp>
<timestamp>2024-05-08T18:44:05Z</timestamp>
<timestamp>2024-05-08T18:44:12Z</timestamp>

ruwiktionary is same. ruwiktionary-20240501-pages-meta-current.xml was good, ruwiktionary-20240520-pages-meta-current.xml became bad. <text bytes="\d{2,10}" / found 8717 times including ns0 pages in 20240520 file. Zero times found in 20240501 file.

I think the dump generation process should be stopped. All produced files with article content should be removed from the servers. A notice should be placed on the dump download pages that the 20240520 version is a failure. It makes no sense continuing a known errornous process and it can prevent wwc (world wide confusion).

@xcollazo Hi Xabriel. IMO the corruption is the latest dumps is a *HUGE* issue. I only just discovered it now when trying to figure out why some pages were not being found that should have been found. Why is this bug report still marked as "needs triage" instead of "major bug" and why is there no notice no dumps.wikimedia.org that the dumps are corrupted?

BTW I'm going to have to redo at least a day's worth of work because of this.

@Formatierer and @Benwing2:

Agreed this issue should be solved. My team is aware, and we are discussing getting it prioritized.

CC @VirginiaPoundstone @WDoranWMF

Report via VRTS (Ticket#2024052810009724) for dewiki:

$> bzcat dewiki-latest-pages-articles.xml.bz2 | grep -B 1 -A 20
'<title>Bodensee</title>'

you can see that the xmlfield <text> has no content. (translate by DeepL)

<page>
  <title>Bodensee</title>
  <ns>0</ns>
  <id>540</id>
  <revision>
    <id>245128636</id>
    <parentid>244981700</parentid>
    <timestamp>2024-05-20T04:41:29Z</timestamp>
    <contributor>
      <username>Hutch</username>
      <id>114381</id>
    </contributor>
    <minor />
    <comment>Abschnittlink korrigiert</comment>
    <model>wikitext</model>
    <format>text/x-wiki</format>
    <text bytes="196185" />
    <sha1>flah3qkis3fjh8s8lli2rescjrv6o0g</sha1>
  </revision>
</page>

I think the more important question is - will this be fixed before the next dump?

There should be no next dump until this is fixed. It just makes no sense.

@xcollazo Thank you for triaging this bug. However I notice that the broken dumps are still present on dumps.wikimedia.org, and there is still no indication on the site that the dumps are broken. This seems a very bad user experience and will likely reduce trust in Wikimedia in the future. I would recommend you take action ASAP (today if possible) to rectify this, either by adding a prominent notice on the site that the dumps are broken or removing/hiding the dump files entirely.

And in my opinion this bug is also a show-stopper for the corresponding MediaWiki software release, because the software is used on several platforms out in the world. If some of the admins uses the dump functionality for what purpose ever, MediaWiki will say: "Surprise, surprise, some of your data is lost, we knew this, but we didn't tell you. Take it as a funny challenge."

Is the reason of this bug defined or not ?

I don't have time to install and try things in order to debug myself, but for what is it worth, here are my (mainly browse and read code based) findings.

  • The disappearing elements are the one that have been revised after May 8th.
  • By browsing the different updates, I came accross this one 1020948: SqlBlobStore: Directly store ES addresses in content table | https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1020948
    • This has been merged (don't really know when, but around the 8th),
    • The logic of storeBlob has been changed : before, all calls where passing threw the update DB part, now only the calls where useExternalStore is false

Hypothesis, with this ES changes, the new revisions are stored externally only, but the dump process rely on an the DB to get back the text data.

Sorry if this does not make sense, I just skimmed through the code and am really not an expert of mediawiki (or even php...).

And in my opinion this bug is also a show-stopper for the corresponding MediaWiki software release, because the software is used on several platforms out in the world. If some of the admins uses the dump functionality for what purpose ever, MediaWiki will say: "Surprise, surprise, some of your data is lost, we knew this, but we didn't tell you. Take it as a funny challenge."

I agree, no mw version should have such a breaking bug and also the dumps should really be marked as faulty or removed in dumps.mediawiki.org

...
either by adding a prominent notice on the site that the dumps are broken or removing/hiding the dump files entirely.

Fair enough. This issue likely hit all XML dumps from 20240520, so I will delete them all.

...
either by adding a prominent notice on the site that the dumps are broken or removing/hiding the dump files entirely.

Fair enough. This issue likely hit all XML dumps from 20240520, so I will delete them all.

Deleted the source data via snapshot1014.eqiad.wmnet with script like so: P63721.

Unfortunately, the NFS server that serves the public is separate (clouddumps1002.eqiad.wmnet), with its own rsync configuration that keeps older data even when deleted from the source, and I have no SSH access to it. I will ask an SRE for help. We should be able to reuse P63721.

CC @Gehel

Is the reason of this bug defined or not ?

See parent task T365155.

Change #1037845 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Temporarily disable XML dumps on snapshot hosts

https://gerrit.wikimedia.org/r/1037845

Change #1037845 merged by Btullis:

[operations/puppet@production] Temporarily disable XML dumps on snapshot hosts

https://gerrit.wikimedia.org/r/1037845

...
either by adding a prominent notice on the site that the dumps are broken or removing/hiding the dump files entirely.

Fair enough. This issue likely hit all XML dumps from 20240520, so I will delete them all.

Deleted the source data via snapshot1014.eqiad.wmnet with script like so: P63721.

Unfortunately, the NFS server that serves the public is separate (clouddumps1002.eqiad.wmnet), with its own rsync configuration that keeps older data even when deleted from the source, and I have no SSH access to it. I will ask an SRE for help. We should be able to reuse P63721.

CC @Gehel

20240520 XML dumps should now be unavailable publicly as well. Thanks so much for your help @BTullis!

Cursory check:

$ curl https://dumps.wikimedia.org/frwiktionary/20240520/
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.18.0</center>
</body>
</html>

Change #1038845 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Revert "Temporarily disable XML dumps on snapshot hosts"

https://gerrit.wikimedia.org/r/1038845

Change #1038845 merged by Btullis:

[operations/puppet@production] Revert "Temporarily disable XML dumps on snapshot hosts"

https://gerrit.wikimedia.org/r/1038845

I have reverted the patch that temporarily disabled the dumps and deployed to all four affected snapshot hosts.

The timer will start at 20:05 UTC today, as per:

btullis@cumin1002:~$ sudo cumin 'snapshot10[10-13].eqiad.wmnet' 'systemctl show fulldumps-rest.timer |grep next_elapse'
4 hosts will be targeted:
snapshot[1010-1013].eqiad.wmnet
OK to proceed on 4 hosts? Enter the number of affected hosts to confirm or "q" to quit: 4
===== NODE GROUP =====                                                                                                                                                                                             
(4) snapshot[1010-1013].eqiad.wmnet                                                                                                                                                                                
----- OUTPUT of 'systemctl show f...grep next_elapse' -----                                                                                                                                                        
TimersCalendar={ OnCalendar=*-*-01..14 08,20:05:00 ; next_elapse=Wed 2024-06-05 20:05:00 UTC }

sha1 is redundant now?

tail --lines=100 dewiktionary-20240601-pages-articles.xml

<page>
  <title>Kleinformats</title>
  <ns>0</ns>
  <id>1375865</id>
  <revision>
    <id>10073572</id>
    <timestamp>2024-06-05T19:33:41Z</timestamp>
    <contributor>
      <username>Udo T.</username>
      <id>91150</id>
    </contributor>
    <comment>neu (autoedit/[[Benutzer:Formatierer/checkpage FAQ|checkpage]] 3.62)</comment>
    <origin>10073572</origin>
    <model>wikitext</model>
    <format>text/x-wiki</format>
    <text bytes="376" sha1="h5vgfi6kft6nz73vuirg42e07dj8xgg" xml:space="preserve">TEXTCONTENT IS OK NOW, BUT REMOVED BY ME</text>
    <sha1>h5vgfi6kft6nz73vuirg42e07dj8xgg</sha1>
  </revision>
</page>

sha1 is redundant now?

tail --lines=100 dewiktionary-20240601-pages-articles.xml

<page>
  <title>Kleinformats</title>
  <ns>0</ns>
  <id>1375865</id>
  <revision>
    <id>10073572</id>
    <timestamp>2024-06-05T19:33:41Z</timestamp>
    <contributor>
      <username>Udo T.</username>
      <id>91150</id>
    </contributor>
    <comment>neu (autoedit/[[Benutzer:Formatierer/checkpage FAQ|checkpage]] 3.62)</comment>
    <origin>10073572</origin>
    <model>wikitext</model>
    <format>text/x-wiki</format>
    <text bytes="376" sha1="h5vgfi6kft6nz73vuirg42e07dj8xgg" xml:space="preserve">TEXTCONTENT IS OK NOW, BUT REMOVED BY ME</text>
    <sha1>h5vgfi6kft6nz73vuirg42e07dj8xgg</sha1>
  </revision>
</page>

This is because of Multi Content Revisions.

TL;DR: A while ago, a mechanism was built so that revisions can have multiple 'slots', with the main slot being the usual wikitext, and other slots containing extra data. AFAIK, only commonswiki has implemented multiple slots. The XML schema of the dumps was updated to support dumping these extra 'slots', but for some reason we never moved to the new schema. One implication is that the revision <sha1> tag is now a "hash of hashes", and the embedded sha1 is the hash of that specific slot's content. Since most wikis do not currently use multiple slots, both fingerprints match.