Page MenuHomePhabricator

Fix Dumps - errors exporting good revisions
Closed, ResolvedPublic

Description

In T376926#10238695 we found lots of errors coming from XML content Dumps in the last couple of months. Some wikis run to completion while others slow down a lot and struggle. Most importantly, data corruption is suspected as good revisions seem to fail to export after multiple retries.

This task will look into why and have three deliverables.

  1. fix for the problem
  2. guide for Data Platform Engineers looking into similar issues. How would one debug this issue? MW is complex, so sticking to this very specific issue and hopefully the "similar" part emerges.
  3. better collaboration with MW folks in general. Huge kudos to @pmiazga for offering support and looking over our shoulder here.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Looking in dumpsgen@snapshot1010:/var/log/dumps/fulldumps_04_520430.log.gz the errors are definitely from other wikis like itwiki, so that count I mentioned might be too large for just frwiki. But here are some manually picked out revisions from looking for chunks like these (sometimes we see Generic error sometimes Received text is unplausible):

2024-10-11 13:15:35: frwiki (ID 2464140) 1011 pages (0.0|1.4/sec all|curr), 185000 revs (2.6|1.3/sec all|curr), 97.9%|97.9% prefetched (all|curr), ETA 2027-06-17 15:24:41 [max 219348239]
Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways
getting/checking text tt:331925 failed (Received text is unplausible for id tt:331925) for revision 331925

... list is: 331925,346198,346206,296457,346221

# so then check if they're bad blobs:
mwscript-k8s -- maintenance/findBadBlobs.php --wiki=frwiki --revisions 331925,346198,346206,296457,346221

And logs say:

Scanning 5 ids
	- Scanned a batch of 5 revisions
Found 0 bad revisions.

So these are ids we should be able to repro the problem with. Now to figure out how I can minimally recreate this code path.

quick update: unfortunately I couldn't figure this out in time and I have a conference to attend this week. I'll try and keep checking but realistically I only have time again next Monday.

FYI: some more context that might be useful is these errors when they were safe to ignore: T315902

At a glance, Received text is unplausible is based on a size check (compares the bytes attribute of the XML dump, which in the end is the content_size DB field, to the actual text length) and findBadBlobs doesn't check for that. Seems fragile in any case, content_size is in bogo-bytes, it's not meant for actual size checks AIUI.

They look like normal wikitext revisions for what it's worth:

mysql:research@dbstore1009.eqiad.wmnet [frwiki]> select content_size, model_name, role_name from content join slots on slot_content_id = content_id join revision on slot_revision_id = rev_id join content_models on model_id = content_model join slot_roles on role_id = slot_role_id where rev_id in (331925,346198,346206,296457,346221);
+--------------+------------+-----------+
| content_size | model_name | role_name |
+--------------+------------+-----------+
|        12265 | wikitext   | main      |
|         2905 | wikitext   | main      |
|         1881 | wikitext   | main      |
|         1925 | wikitext   | main      |
|         1946 | wikitext   | main      |
+--------------+------------+-----------+

so probably some sort of character encoding issue?

@BTullis Let's pause the production of the dumps given that they are stuck in a retry loop that's putting load on the production servers.

Change #1083136 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Pause the XML/SQL dumps due to potential data quality issues

https://gerrit.wikimedia.org/r/1083136

@BTullis Let's pause the production of the dumps given that they are stuck in a retry loop that's putting load on the production servers.

I have now created this patch: Pause the XML/SQL dumps due to potential data quality issues
This will prevent the XML/SQL Dumps from running on the 1st and 20th of each month.

The wikidata entities and other miscellaneous dumps will not be affected.

As for the currently running dumps, I will find any remaining processes from the dumps that started on October 1st.
I believe that the affected wikis will be: frwiki and wikidatawiki.
These two dumps will remain incomplete.

The dumps that started on October 20th are all complete, with the exception of commonswiki but that will likely finish soon.
These partial dumps on the 20th don't suffer the same issues as those that start on the 1st of the month.

Change #1083136 merged by Btullis:

[operations/puppet@production] Pause the XML/SQL dumps due to potential data quality issues

https://gerrit.wikimedia.org/r/1083136

Copypasting here relevant info to help find root cause:

Ok, so I have to point out that I found a lot of the familiar Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways errors. We're used to seeing these for bad/corrupt revisions. But now we're seeing very large numbers for good revisions.

I was asked to take a look at this part. Looking at the code, it is clear that the issue is a symptom of "failed" revisions and fixing that removes this part too.

There are more details at T377594#10243166 that present evidence of some of these revisions being good revisions.

Yup, that's why I put the failed in scare quotes. The issue just shows up when you have a different problem and causes red herrings.

But on "db is not closed" I think it's just that mw for really long time now doesn't let any layer above rdbms to really close the db connection or say "pick another database". The whole $this->rotateDb(); is outdated and should go away. It's adding pressure on the databases as well for no real good reason (maybe it was useful long time ago but definitely not anymore).

@aaron has a patch that removes rotateDb() over at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1078491. @Ladsgroup would you mind reviewing that patch?

That patch needed manual rebase when you commented. Now it's fixed. I reviewed it and soon can go in. Dumps 1.0 is breaking a lot of layering in mw. If a replica is giving you "bad" data. In almost all cases, you can't get a better one from another replica.

back from conference, on this as my top priority. Thanks so much all for thoughts above.

I investigated (detailed log here https://www.mediawiki.org/wiki/SQL/XML_Dumps/Debugging_PHP)

I spoke with Amir, I found two problems with the way the dumps scripts are written currently. First, basically what's going on is two big jobs are happening in parallel. One is a big migration whereby the text table is getting cleaned up and moved to the content table. The other is dumps is running on all the wikis. It seems that dumps is somehow using the old addresses that the migration is deleting. So when dumps goes to actually get the blob at the address, the blob is gone and the address is erased from the db. At the rate that this is happening, hundreds of thousands of times, Amir thinks it can't possibly be a replication / lock / lag issue. It has to be that dumps is somehow systematically saving the old addresses. I'm very fuzzy on how stubs and Dumps 1's two-pass approach work, I don't know whether "prefetch" or "stub"s have those addresses. But the main point is that somehow dumps is using an address that is dead when it tries it. And then it retries over and over with the same address.

  1. getTextDb seems to us to break the abstractions that MW is comfortable with. It's looking for an address and it should be using a revision id. If it used the revision id and ran into issues because of the ongoing migration, by the time it retried presumably those issues would be fixed and the revision would now be mapped to a good content address.
  1. some code paths are not considered either in MW core or in dumps. Manually executing this line in my debug setup I get null as the result, as in, $text is null. But the check one up the stack is for === false. Since null === false is itself false, this falls through. Again, it's possible that function it's calling should never return null or all callers of it should be found and checks for === false should be changed.

If we fix these issues, in theory re-running Dumps would just work. So depending on how quickly we can fix them, maybe that's the best option. Not so great options are trying to understand where those old addresses are coming from and find ways for dumps to not use them. Like, if stubs are the source maybe we can run without stubs.

(ping @EBernhardson because rumor is you're going to work on this)

It has to be that dumps is somehow systematically saving the old addresses. I'm very fuzzy on how stubs and Dumps 1's two-pass approach work, I don't know whether "prefetch" or "stub"s have those addresses.

Yes that is how dumps work. First, we do a first pass that creates the stub files. Stub files have all the metadata, except the actual wikitext. Using the failed frwiki 20241001 full dump as an example, one such stub file is frwiki-20241001-stub-articles6.xml.gz. If we open it up we find the following:

% tail -n 20 frwiki-20241001-stub-articles6.xml
  <page>
    <title>Anabantiformes</title>
    <ns>0</ns>
    <id>16502304</id>
    <revision>
      <id>219083900</id>
      <timestamp>2024-10-01T15:20:02Z</timestamp>
      <contributor>
        <username>Givet</username>
        <id>114686</id>
      </contributor>
      <comment>ébauche</comment>
      <origin>219083900</origin>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="1789" sha1="pogi4px3jvnq2axlifr9csmuk9rfa60" location="es:DB://cluster31/2024259?flags=utf-8,gzip" id="44423a2f2f636c757374657233312f32303234323539" />
      <sha1>pogi4px3jvnq2axlifr9csmuk9rfa60</sha1>
    </revision>
  </page>
</mediawiki>

The later pass will then take these stubs and enrich them with the wikitext. As we can see there is definitely an internal location and id which presumably point to the wikitext:

<text bytes="1789" sha1="pogi4px3jvnq2axlifr9csmuk9rfa60" location="es:DB://cluster31/2024259?flags=utf-8,gzip" id="44423a2f2f636c757374657233312f32303234323539" />

On the second pass, dumps will try to fetch that, and replace the <text> node with it.

The stubs job for frwiki was successful early on this month, at 2024-10-01 18:50:57, as per the dump page. Thus, presumably, the addresses changed in between that and the time we picked up the pages-meta-history job (aka the full revision dump).

makes perfect sense, thanks @xcollazo. So even if we erase stubs, any kind of two-pass algorithm will always have a race condition with the database migration. I see two possible solutions right now:

  • change the retry logic to look up the revision id based on the address from the stub
  • change the stub logic to write the revision id instead of the address

I took a look over this, the above all seems reasonable. The database side shouldn't be too bad, we can lookup the content address via the slots table joined to the content table. We will need the slot primary key (slot_revision_id, slot_role_id) pair to do the lookup, these can be stored in the <text/> tag instead of the address stored today. There is a minor question of how often to lookup, most reliable would probably be to repeat this lookup on each retry to load the text content.

Change #1085634 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/core@master] TextPassDumper: refresh content address on failure

https://gerrit.wikimedia.org/r/1085634

@pmiazga could you please take a look at Erik's patch? I'll try and set up a test on real data on Monday

To reproduce i checked mwmaint2002 to see which wikis were currently migrating, saw enwikibooks was early-ish in the migration, and ran a dump manually on snapshot1012 via:

sudo -u dumpsgen /usr/bin/php7.4 \
    /srv/mediawiki/multiversion/MWScript.php \
    dumpBackup.php \
    --wiki=enwikibooks \
    --current \
    --stub \
    --report 1000 \
    --output gzip:/mnt/dumpsdata/xmldatadumps/temp/e/ebernhardson-test/enwikibooks-20251105-stub-current.xml.gz

This resulted in an xml file with good and bad text locations, as the recorded locations were being actively migrated. I ran the dump through P70950 to remove all the pages that had already been migrated, to speed up reproduction. This stubs file prepared for reproduction is then run through the following to reproduce the errors:

sudo -u dumpsgen /usr/bin/php7.4 \
        /srv/mediawiki/multiversion/MWScript.php \
        dumpTextPass.php \
        --wiki=enwikibooks \
        --stub=gzip:/home/ebernhardson/enwikibooks-filtered.xml.gz \
        --dbgroupdefault=dump \
        --report=1000 \
        --spawn=/usr/bin/php7.4 \
        --output=bzip2:/dev/null \
        --full | tee enwikibooks.log

I've then re-run it after applying the patch to snapshot1012 to verify it fixes the identified issue.

For next steps this needs to get through code review, and we will need to deploy through the normal means (backport or otherwise).

Change #1087592 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/core@wmf/1.44.0-wmf.1] TextPassDumper: refresh content address on failure

https://gerrit.wikimedia.org/r/1087592

Change #1087593 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/core@wmf/1.44.0-wmf.2] TextPassDumper: refresh content address on failure

https://gerrit.wikimedia.org/r/1087593

Change #1085634 merged by jenkins-bot:

[mediawiki/core@master] TextPassDumper: refresh content address on failure

https://gerrit.wikimedia.org/r/1085634

Change #1087598 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Revert "Pause the XML/SQL dumps due to potential data quality issues"

https://gerrit.wikimedia.org/r/1087598

Change #1087592 merged by jenkins-bot:

[mediawiki/core@wmf/1.44.0-wmf.1] TextPassDumper: refresh content address on failure

https://gerrit.wikimedia.org/r/1087592

Change #1087593 merged by jenkins-bot:

[mediawiki/core@wmf/1.44.0-wmf.2] TextPassDumper: refresh content address on failure

https://gerrit.wikimedia.org/r/1087593

Mentioned in SAL (#wikimedia-operations) [2024-11-05T23:53:42Z] <ebernhardson@deploy2002> Started scap sync-world: Backport for [[gerrit:1087592|TextPassDumper: refresh content address on failure (T377594)]], [[gerrit:rCICF1087593a763e|TextPassDumper: refresh content address on failure (T377594)]]

Mentioned in SAL (#wikimedia-operations) [2024-11-05T23:56:37Z] <ebernhardson@deploy2002> ebernhardson: Backport for [[gerrit:1087592|TextPassDumper: refresh content address on failure (T377594)]], [[gerrit:rCICF1087593a763e|TextPassDumper: refresh content address on failure (T377594)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-11-06T00:02:31Z] <ebernhardson@deploy2002> Finished scap sync-world: Backport for [[gerrit:1087592|TextPassDumper: refresh content address on failure (T377594)]], [[gerrit:rCICF1087593a763e|TextPassDumper: refresh content address on failure (T377594)]] (duration: 08m 48s)

Change #1087598 merged by Ryan Kemper:

[operations/puppet@production] Resume XML/SQL dumps now that data qual fixed

https://gerrit.wikimedia.org/r/1087598

Mentioned in SAL (#wikimedia-operations) [2024-11-06T00:21:29Z] <ryankemper> T377594 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087598; ran puppet on snapshot101[0-7]*. These dumps should be re-enabled now

New code is running right now and hasn't completely blown up. I'm going to call this a success and deal with any new issues as such. Thank you very much all chiming in and helping, and big up for Erik for the great patch.

We consider the issue / ticket resolved. The dumps pipeline has been resumed as of November 8th.
The November dumps (October data) are running with the fix on a schedule delayed by about a week. Most dumps are complete with wikidatawiki still being in progress (Nov 14th). Logs show that the fix is working well and we believe the new dumps to be accurate.
Dumps should be provided on the regular cadence going forward.