Page MenuHomePhabricator

Text id verification makes dumps skip many good rows
Closed, ResolvedPublic

Description

Bunch of recoverable errors reported on https://groups.google.com/a/wikimedia.org/g/ops-dumps/c/efjbIJHS--Q/m/LekH2tawAgAJ. A sample:

*** Wiki: wikidatawiki
=====================
[20240512132606]: Skipping bad text id 44423a2f2f636c757374657232392f3739303236383533 of revision 2148925565

[20240512132620]: Skipping bad text id 44423a2f2f636c757374657232392f3739303435343333 of revision 2148962571

[20240512133014]: Skipping bad text id 44423a2f2f636c757374657232382f3739303137383638 of revision 2148962836

[20240512133027]: Skipping bad text id 44423a2f2f636c757374657232392f3739303336323538 of revision 2148944452

[20240512133121]: Skipping bad text id 44423a2f2f636c757374657232392f3739303339343635 of revision 2148950677

[20240512133833]: Skipping bad text id 44423a2f2f636c757374657232392f3739303339373430 of revision 2148951238

[20240512134223]: Skipping bad text id 44423a2f2f636c757374657232382f3739303035343633 of revision 2148937955

[20240512134256]: Skipping bad text id 44423a2f2f636c757374657232382f3739303033373434 of revision 2148934567

I have not seen these errors before, and they coincide with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1020948 being merged, which is part of T362566.

Perusing code, the exception is coming from :

...
	protected function getText( $id, $model = null, $format = null, $expSize = null ) {
		if ( !$this->isValidTextId( $id ) ) {
			$msg = "Skipping bad text id " . $id . " of revision " . $this->thisRev;
			$this->progress( $msg );
			return '';
		}
...

and isValidTextId() is defined as:

	private function isValidTextId( $id ) {
		if ( preg_match( '/:/', $id ) ) {
			return $id !== 'tt:0';
		} elseif ( preg_match( '/^\d+$/', $id ) ) {
			return intval( $id ) > 0;
		}

		return false;
	}

We should get rid of that part and bump the XML version as Daniel suggested. Wanna do it?

I'm afraid I am not versed in MediaWiki development. I just happen to be the guy that inherited Dumps 1.0 via a series of unfortunate events.

I am happy to bump the XML and get rid of the text id as part of Dumps 2.0 work though (Dumps 2.0 is tech stack of Hadoop/Flink/Spark which am very comfortable with).

We all have had such misfortunes! Fixing the issue in mw is rather easy, we should just drop the code piece, bump the number in several places, update the schema validation and update tests. Here is an example https://gerrit.wikimedia.org/r/c/mediawiki/core/+/464768

I'd be more than happy to review or help getting it done.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
xcollazo renamed this task from Remove text ids from XML dump schema to Text id verification makes dumps miss many good rows.May 22 2024, 2:02 PM
xcollazo renamed this task from Text id verification makes dumps miss many good rows to Text id verification makes dumps skip many good rows.

Any news on this? Will the next dump process be correct?

The submitted dump code seems reasonable, but it uses getNativeData in method writeText which is deprecated from a long time.

Meanwhile, there are changes occurring the 8th May (which is the date of revisions that have an empty text) that hard deprecate functions, among which getNativeData.

Not sure this is relevant, hard deprecation should emit a worning and not break the function (shouldn't it ?), but maybe worth a look.

Hypothesis: the effective class of the content has changed and the wrong call is generated (getNativeContent instead of $content->serialize( $contentFormat ),or vice versa...)

Any news on this? Will the next dump process be correct?

Next full dump starts in 4 days. It is unlikely we will have a fix by then, but this bug is being discussed.

Making the fix isn't that hard, I can backport it if you manage to get it done and land it.

@Ladsgroup it might be fastest if you make the fix, I don't think we are likely to have the bandwidth or MW dev experience available. Is that feasible?

Actually, as an update @Ladsgroup, @dr0ptp4kt is looking into the issue.

I scheduled a meeting tomorrow afternoon UTC+2 time with @Ladsgroup @daniel @xcollazo @Milimetric to troubleshoot.

This has been discussed in some other places, but let's defer the dump until next week and be aware of downstream jobs. That way we have Friday and Monday to hopefully roll a fix and verify that it doesn't introduce any other surprise bugs, and hopefully can get everything else enqueued following a fix. @WDoranWMF I think you have this covered, but wanted to make sure I wrote it down here as well.

Dan, Xabriel, and I looked at the code, and the hunch is to mask the symptom / react to the original change in TextPassDumper.php (and double check if there are any other modules depending on the things in TextPassDumper.php) (this kind of mirrors the comment in T364250#9792093). But we'll want to discuss together as a group.

I'm still tryingf to wrap my head around where these giant IDs are coming from. Looking at relevant database row for one of the revisions in question, I got:

MariaDB [acewiki_p]> select * from slots join content on content_id = slot_content_id where slot_revision_id = 150761\G
*************************** 1. row ***************************
slot_revision_id: 150761
    slot_role_id: 1
 slot_content_id: 147080
     slot_origin: 150761
      content_id: 147080
    content_size: 105744
    content_sha1: km57g051c661etp7zellc7wvi9nttnw
   content_model: 1
 content_address: es:DB://cluster30/1?flags=utf-8,gzip
1 row in set (0.002 sec)

That looks fine at a glance.. What am I missing?

The giant hex string are encoded ES addresses

php > $s = hex2bin( "44423a2f2f636c757374657232392f3739303339373430" );
php > print "$s\n";
DB://cluster29/79039740

This is an ES address, but it doesn't have the "es" prefix, and it doesn't have all the "query parameters" I would expect from the new entries.

In summary of a conversation we just had: This just not be a problem if we used dump schema version 0.11. To my surprise I learned that we are still using 0.10 in production, which still relies on text table IDs. We designed version 0.11 in 2019 with support for content addresses, to avoid the issue we are now encountering.

Change #1037809 had a related patch set uploaded (by Milimetric; author: Milimetric):

[mediawiki/core@master] Hack the location attribute to patch v10 dumps

https://gerrit.wikimedia.org/r/1037809

The next steps are:

  • @Milimetric connecting with @xcollazo to run dump on a server against simplewiki using the schema version 11 as a command line flag
  • @Milimetric preparing a DRAFT patch in case that doesn't work, as a hack to work around.

If the dump on a server against simplewiki using the schema version 11 as a command line flag does NOT work, then the hack will undergo code review and deployment.

If the simplewiki dump DOES work, the plan would be to change the dump-invoking scripts for production to use the version of 11 (via Puppet repo), OR apply the change to the $wgXmlDumpSchemaVersion variable so that it applies universally. @daniel would prefer application to the $wg variable, but the one thing to note there is this would also then apply to Special:Export...whereas if the change happens at the dump-invoking scripts (via Puppet) only, then Special:Export would be unaffected.

Based on a small local test on Daniel's machine, it looks like the actual XML does not bear any deletions of elements or major restructuring of XML tree shape...the additions are:

  • new attributes within an existing XML tag
  • introduction of slot content - this is a material change but does not change the nesting level of the XML tree

But, the main slot still exists where it always does. This means that (most likely, unless we're missing something) most downstream jobs would probably work okay even if the schema is bumped to version 11. However, there is a risk that code generation tools using XML schema may be dependent on the precise expression of the attributes or permissible XML elements (i.e., new attributes and new XML element types may not conform to the generated code).

Now, if a downstream job is dependent on the schema version expressed in the header of a dump, well, it will inherently be affected. For those with access to Turnilo, the volume of Special:Export access is visible at https://w.wiki/AFLu .

The instructions for running maintenance scripts are here:

https://wikitech.wikimedia.org/wiki/Maintenance_server

This can probably be done at mwdebug1001.eqiad.wmnet .

A message will need to go out to the xmldumps and wikitech-l lists to notify of the upcoming change. @WDoranWMF we'll want to get something drafted - we'll need to connect with you later on this if we don't just go ahead and send it. Those of us in Americas will be connecting later today to get this prepared, and more updates will happen on ticket. We can put the draft message here on ticket so that it is clear what will be sent.

First, let's confirm that regardless of V10 or V11, a 'single' pass dump via dumpBackup.php is happy. Let's use simplewiki:

ssh mwmaint1002.eqiad.wmnet


pwd
/home/xcollazo/dumps/simplewiki

cat pagelist.txt 
1_(number)
2_(number)
3_(number)

mwscript dumpBackup.php \
  --wiki=simplewiki \
  --current \
  --schema-version 0.10 \
  --pagelist pagelist.txt \
  > output-010.xml

mwscript dumpBackup.php \
  --wiki=simplewiki \
  --current \
  --schema-version 0.11 \
  --pagelist pagelist.txt \
  > output-011.xml


colordiff output-010.xml output-011.xml 
1c1
< <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
---
> <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.11/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.11/ http://www.mediawiki.org/xml/export-0.11.xsd" version="0.11" xml:lang="en">
43a44
>       <origin>9441215</origin>
46c47
<       <text bytes="4635" xml:space="preserve">{{Infobox number
---
>       <text bytes="4635" sha1="facd748ca5lha58nlgje5nwvfnbxk2l" xml:space="preserve">{{Infobox number
136a138
>       <origin>9538115</origin>
139c141
<       <text bytes="2839" xml:space="preserve">{{otherusesof|2|2 (disambiguation)}}
---
>       <text bytes="2839" sha1="cxf2wti1hsisxlr1oikxuh83v3pf4wk" xml:space="preserve">{{otherusesof|2|2 (disambiguation)}}
219a222
>       <origin>9097931</origin>
222c225
<       <text bytes="2266" xml:space="preserve">{{Infobox number
---
>       <text bytes="2266" sha1="6nh3w6kxx4ytpsbzg7u7zmltckml3it" xml:space="preserve">{{Infobox number

simplewiki does not have any slots other than MAIN, and indeed we see little difference in the diff above. However, unexpected to me, there is a new <origin> tag that is equal to the revision <id> tag. Not sure what the intention is with that tag, but it is allowed as per export-0.11.xsd.

Let's now try a 2-pass dump, mimicking what actually happens in our monthly full and current dumps, and using a user submitted offending example from T365501#9819042 from frwiktionary:

First let's repro issue with V10:

pwd
/home/xcollazo/dumps/frwiktionary

cat pagelist.txt
Module:lexique

We do stubs first:

mwscript dumpBackup.php \
  --wiki=frwiktionary \
  --current \
  --schema-version 0.10 \
  --pagelist pagelist.txt \
  --stub \
  > output-010.xml

Then do 2nd pass:

cat output-010.xml | mwscript dumpTextPass.php --wiki=frwiktionary > output-010-2nd-pass.xml

Skipping bad text id 44423a2f2f636c757374657233312f3632303834 of revision 34652150
      204645
      
      
2024-05-31 17:14:06: frwiktionary (ID 18958) 1 pages (185.8|185.8/sec all|curr), 1 revs (185.8|185.8/sec all|curr), ETA 2024-06-02 21:29:29 [max 34955301]


xcollazo@mwmaint1002:~/dumps/frwiktionary$ colordiff output-010.xml output-010-2nd-pass.xml 
73c73
<       <text bytes="11714" id="44423a2f2f636c757374657233312f3632303834" />
---
>       <text bytes="11714" />
77c77
< </mediawiki>
---
> </mediawiki>
\ No newline at end of file

This indeed repros the issue at T365501#9819042.

Now let's try with V11:

mwscript dumpBackup.php \
  --wiki=frwiktionary \
  --current \
  --schema-version 0.11 \
  --pagelist pagelist.txt \
  --stub \
  > output-011.xml

cat output-011.xml | mwscript dumpTextPass.php --wiki=frwiktionary --schema-version 0.11 > output-011-2nd-pass.xml
2024-05-31 17:21:27: frwiktionary (ID 14561) 1 pages (102.4|102.4/sec all|curr), 1 revs (102.4|102.4/sec all|curr), ETA 2024-06-04 16:08:34 [max 34955333]

colordiff output-011.xml output-011-2nd-pass.xml 
74c74,435
<       <text bytes="11714" sha1="jdvyl05peuuv5jzhtkv1j3xsg1xxn2v" location="es:DB://cluster31/62084?flags=utf-8,gzip" id="44423a2f2f636c757374657233312f3632303834" />
---
>       <text bytes="11714" sha1="jdvyl05peuuv5jzhtkv1j3xsg1xxn2v" xml:space="preserve">local m_bases = require(&quot;Module:bases&quot;)
> local m_langs = require(&quot;Module:langues&quot;)
> local m_table = require(&quot;Module:table&quot;)
> local m_string = require(&quot;Module:string&quot;)
> local m_params = require(&quot;Module:paramètres&quot;)
> 
> local tree = mw.loadData(&quot;Module:lexique/data&quot;)
> 
...

So indeed moving to V11 fixes the issue. Notice further that the location and id subtags that we had discussed in our last sync are removed by this second pass, and that is indeed the behavior that we wanted considering these are internal and not needed for public dumps.

I still want to try a couple pages from commonswiki as that is, AFAIK, the only wiki with multiple slots. Will do some in a separate comment.

Ok now let's try commonswiki, considering that this wiki can contain multiple slots.

First, figure out some pages that contain multiple slots as I presume not all do. Using presto:

presto:wmf_raw> select * from mediawiki_slot_roles where wiki_db='commonswiki' and snapshot='2024-04';
 role_id | role_name | snapshot |   wiki_db   
---------+-----------+----------+-------------
       1 | main      | 2024-04  | commonswiki 
       2 | mediainfo | 2024-04  | commonswiki 
(2 rows)


presto:wmf_raw> select * from mediawiki_slots where slot_role_id = 2 and wiki_db='commonswiki' and snapshot='2024-04' limit 3;
 slot_revision_id | slot_role_id | slot_content_id | slot_origin | snapshot |   wiki_db   
------------------+--------------+-----------------+-------------+----------+-------------
        460623414 |            2 |       457018038 |   460623414 | 2024-04  | commonswiki 
        460623415 |            2 |       457018041 |   460623415 | 2024-04  | commonswiki 
        460623416 |            2 |       457018036 |   460623416 | 2024-04  | commonswiki 
(3 rows)


presto:wmf_raw> select rev_id, rev_page from mediawiki_revision where rev_id IN (460623414,460623415,460623416) and wiki_db='commonswiki' and snapshot='2024-04';
  rev_id   | rev_page 
-----------+----------
 460623414 | 76615328 
 460623416 | 18966948 
(2 rows)

presto:wmf_raw> select page_id, page_title, page_namespace from mediawiki_page where page_id IN (76615328, 18966948) and wiki_db='commonswiki' and snapshot='2024-04';
 page_id  |                page_title                 | page_namespace 
----------+-------------------------------------------+----------------
 76615328 | Парк_Революции,_г._Ростов-на-Дону._35.jpg |              6 
 18966948 | Lehi_AR_2012-04_009.jpg                   |              6 
(2 rows)

All right, let's use those two, back on mwmaint1002.eqiad.wmnet:

pwd
/home/xcollazo/dumps/commonswiki

cat pagelist.txt 
File:Парк_Революции,_г._Ростов-на-Дону._35.jpg
File:Lehi_AR_2012-04_009.jpg

mwscript dumpBackup.php \
  --wiki=commonswiki \
  --current \
  --schema-version 0.11 \
  --pagelist pagelist.txt \
  --stub \
  > output-011.xml


cat output-011.xml | mwscript dumpTextPass.php --wiki=commonswiki --schema-version 0.11 > output-011-2nd-pass.xml

redacted diff output below (full diff at P63783):

colordiff output-011.xml output-011-2nd-pass.xml 
58c58
<       <minor/>
---
>       <minor />
63c63,76
<       <text bytes="355" sha1="8gw0pd8r1guw81l9k1x9sxbrcdawksh" location="es:DB://cluster31/1032792?flags=utf-8,gzip" id="44423a2f2f636c757374657233312f31303332373932" />
---
>       <text bytes="355" sha1="8gw0pd8r1guw81l9k1x9sxbrcdawksh" xml:space="preserve">=={{int:filedesc}}==
> {{Information
> |description={{ru|1=Парк Революции, г. Ростов-на-Дону.}}
> |date=2015-07-10 10:11:16
> |source={{own}}
...
69c82
<         <text bytes="5508" sha1="6ccqqgkzy5zztzn96qonwx3xi2l1pxf" location="tt:725974142" />
---
>         <text bytes="5508" sha1="6ccqqgkzy5zztzn96qonwx3xi2l1pxf" xml:space="preserve">{&quot;type&quot;:&quot;mediainfo&quot;,&quot;id&quot;:&quot;M76615328&quot;,   ...    </text>
90c103,117
<       <text bytes="293" sha1="gf5hzrkhsx15xvxy4l3a7tk8je5osjz" location="tt:353128526" id="353128526" />
---
>       <text bytes="293" sha1="gf5hzrkhsx15xvxy4l3a7tk8je5osjz" xml:space="preserve">== {{int:filedesc}} ==
> {{Information
> |Description=Lehi, Arkansas.
> 
> |Source={{own}}
...
96c123
<         <text bytes="2719" sha1="qpt69mnztoskcgnkctp83g80ccqp5b0" location="tt:705418432" />
---
>         <text bytes="2719" sha1="qpt69mnztoskcgnkctp83g80ccqp5b0" xml:space="preserve">{&quot;type&quot;:&quot;mediainfo&quot;,&quot;id&quot;    .....   </text>
101c128
< </mediawiki>
---
> </mediawiki>

Diff appears correct to me, both the MAIN slot <text> being resolved under the <revision> tag, and the MEDIAINFO slot being resolved under the <content><text> tags.

Full output of output-011.xml at P63784, and full output of output-011-2nd-pass.xml at P63786 for further perusal by other folks, but I think we are good 🎉 .

@xcollazo There's another case that may be worth testing:

When we use dumpTextPass to generate a full dumpf of all revision, we feed it not only the stub dump as input, but we also feed it the previous full dump using the --prefetch option. Assuming the new stub dump and the old full dump are in the same order, we can use the old full dump to get the text of nearly all the old revisions, and we also have to hit the database for revisions added after the previous full dump was generated.

So, what happens if we have a full dump in 0.10 and a stub dump in 0.11, and feed these two to dumpTextPass? It should work, but does it? I'm particularly unsure about old revisions that have slots. They will not be found in the "prefetch" dump. Which by itself is not a peroblem, we can get them from the database - but that may cause the two streams to get out of whack, so we are then trying to load everything from the database, which would be waaaayyyyy slow.

Here's the verbiage for the email.

To: wikitech-l
(then cross post to xmldatadumps-l, analytics)

Subject: (Possible breaking change) XML pages-articles dumps bug with missing revision text for some records; fix in progress with schema change

As described on Phabricator a bug [1] surfaced whereby the "pages-articles" XML dumps on https://dumps.wikimedia.org/ bear incomplete records.

A possible fix has been identified, and it involves bumping the dump schema version from version 0.10 to version 0.11 [2], which could be a breaking change for some.

MORE DETAILS:

Due to the bug that surfaced, a nontrivial number of <text> nodes representing article text shows in a fashion like so as empty.

<text bytes="123456789" />

A potential fix in T365155 [3] has been identified. Assuming further testing looks good, XML dumps will be kicked off again starting next week in order to restore the missing records as soon as possible. It will take a while for new dumps to be generated as it is a compute intensive operation. More progress will be reported at T365155 and new dumps will eventually show up on dumps.wikimedia.org .

Although a number of pipelines may not notice the change associated with the schema bump, if your dump ingestion tooling or use of Special:Export relies on the specific shape of the XML at version 0.10 (e.g., because of code generation tools), please examine the differences between version 0.10 and version 0.11. One notable addition in version 0.11 is addition of MCR [4] fields.

Thank you for your patience while this issue is resolved.

-Adam

[1]
https://phabricator.wikimedia.org/T365501

[2]
https://www.mediawiki.org/xml/export-0.10.xsd

and

https://www.mediawiki.org/xml/export-0.11.xsd

Schema version 0.11 has existed in MediaWiki for over 6 years, but Wikimedia wikis have been using version 0.10.

[3]
https://phabricator.wikimedia.org/T365155#9851025

and

https://phabricator.wikimedia.org/T365155#9851160

[4]
https://www.mediawiki.org/wiki/Multi-Content_Revisions

@xcollazo Quick update: I tested this locally, with maintenance/run TextPassDumper --stub=file:cache/stub-dump-11.xml --prefetch=file:cache/full-dump-10.xml > cache/new-dump-11.xml.

I edited the cache/full-dump-10.xml to have some empty <text/> tags, to emulate the problem we are facing in production.

The good news is that it seems to work perfectly! Here's an excerpt from a diff:

--- cache/full-dump-10.xml      2024-05-31 20:54:52.928952601 +0200
+++ cache/new-dump-11.xml       2024-05-31 20:55:01.652701098 +0200
@@ -1,4 +1,4 @@
-<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
+<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.11/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.11/ http://www.mediawiki.org/xml/export-0.11.xsd" version="0.11" xml:lang="en">
   <siteinfo>
     <sitename>MyWiki</sitename>
     <dbname>my_wiki</dbname>
@@ -40,9 +40,10 @@
         <id>1</id>
       </contributor>
       <comment>yooohooo!</comment>
+      <origin>1</origin>
       <model>wikitext</model>
       <format>text/x-wiki</format>
-      <text bytes="9" xml:space="preserve"/>
+      <text bytes="9" sha1="1m32kpiapvnk93sg42tg7z4zbc28a83" xml:space="preserve">yooohooo!</text>
       <sha1>1m32kpiapvnk93sg42tg7z4zbc28a83</sha1>
     </revision>
     <revision>
@@ -54,9 +55,10 @@
         <id>1</id>
       </contributor>
       <comment>[[File:Example.png|thumb]]</comment>
+      <origin>2</origin>
       <model>wikitext</model>
       <format>text/x-wiki</format>
-      <text bytes="37" xml:space="preserve">yooohooo!
+      <text bytes="37" sha1="43bgjyj9g9mw5082ugkywaq5b6471v5" xml:space="preserve">yooohooo!

 [[File:Example.png|thumb]]</text>

A few things to note:

  • the missing text of revision 1 got backfilled from the database.
  • we are adding sha1 attributes to the <text> element, and we are adding <origin> elements to the <revision> element. That will increase the dump size somewhat.

The bade new is: the next run may take significantly longer than usual, since it has to fetch all the missing text from the database.

Change #1037845 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Temporarily disable XML dumps on snapshot hosts

https://gerrit.wikimedia.org/r/1037845

Mentioned in SAL (#wikimedia-analytics) [2024-05-31T19:35:30Z] <btullis> dumpsgen@clouddumps1002:/srv/dumps/xmldatadumps/public$ find . -maxdepth 2 -wholename '*/20240520' -exec rm -rf {} \; (disabling tomorrow's XML dumps for T365155)

Change #1037845 merged by Btullis:

[operations/puppet@production] Temporarily disable XML dumps on snapshot hosts

https://gerrit.wikimedia.org/r/1037845

...
The bade new is: the next run may take significantly longer than usual, since it has to fetch all the missing text from the database.

The next full dump would had normally run on 20240601. Considering the last full dump from 20250501 did finish before all this happened, we should be good to prefetch from it. The next ‘partial’ dump however, i.e. just the current revision of all pages, will struggle as we deleted the 2024020 run from the nodes that generate the dumps since it was faulty. That one typically only takes a couple days so if it takes double we should still be ok.

As per T365155#9850367, seems like next steps are:

If the simplewiki dump DOES work, the plan would be to change the dump-invoking scripts for production to use the version of 11 (via Puppet repo), OR apply the change to the $wgXmlDumpSchemaVersion variable so that it applies universally. @daniel would prefer application to the $wg variable, but the one thing to note there is this would also then apply to Special:Export...whereas if the change happens at the dump-invoking scripts (via Puppet) only, then Special:Export would be unaffected.

I vote on setting the $wgXmlDumpSchemaVersion so that it applies universally. One consistent way to dump all around.

I would also like to take this opportunity to upgrade the three remaining snapshot hosts that are still running buster to bullseye, for T325228: Migrate Dumps Snapshot hosts from Buster to Bullseye

Those three are:

  • snapshot1012 - English Wikipedia dumps
  • snapshot1010 - all other xml/sql dumps - plus the monitor and html page writer
  • snapshot1013 - currently unused

I can do snapshot1013 straight away, but I think it best that I don't do the other two without agreement from @xcollazo / @daniel (in case I am missing anything).
snapshot1011 which does the wikidata xml/sql dump has already been upgraded to bullseye, so we know that the dump scripts work on the newer version.

I can do snapshot1013 straight away, but I think it best that I don't do the other two without agreement from @xcollazo / @daniel (in case I am missing anything).

I'm not aware of any reason not to do it, but that doesn't mean much.

I would also like to take this opportunity to upgrade the three remaining snapshot hosts that are still running buster to bullseye, for T325228: Migrate Dumps Snapshot hosts from Buster to Bullseye

Now or never! Go for it!

I would also like to take this opportunity to upgrade the three remaining snapshot hosts that are still running buster to bullseye, for T325228: Migrate Dumps Snapshot hosts from Buster to Bullseye

Now or never! Go for it!

Thanks. That's under way now.

I vote on setting the $wgXmlDumpSchemaVersion so that it applies universally. One consistent way to dump all around.

I would prefer the`$wgXmlDumpSchemaVersion` approach as well.

Change #1038392 had a related patch set uploaded (by Dr0ptp4kt; author: Dr0ptp4kt):

[operations/mediawiki-config@master] Bump XML dump schema to version 0.11

https://gerrit.wikimedia.org/r/1038392

Change #1037809 abandoned by Milimetric:

[mediawiki/core@master] Hack the location attribute to patch v10 dumps

Reason:

decided to upgrade dumps to v11

https://gerrit.wikimedia.org/r/1037809

Did a cursory check on dumps code to make sure we were not overriding the XML XSD version.

Python scripts that do control are not overriding.

Back on PHP-land, we are pulling from the global variable for the first pass here and for the second pass here.

SO AFAICT, once we change the default via https://gerrit.wikimedia.org/r/1038392, the XML dump infrastructure will pick it up and to the right thing.

Change #1038392 merged by jenkins-bot:

[operations/mediawiki-config@master] Bump XML dump schema to version 0.11

https://gerrit.wikimedia.org/r/1038392

Mentioned in SAL (#wikimedia-operations) [2024-06-05T16:34:05Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:rGERRIT103839223877|Bump XML dump schema to version 0.11 (T365155)]]

Mentioned in SAL (#wikimedia-operations) [2024-06-05T16:36:34Z] <ladsgroup@deploy1002> ladsgroup and dr0ptp4kt: Backport for [[gerrit:rGERRIT103839223877|Bump XML dump schema to version 0.11 (T365155)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-06-05T16:52:28Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:rGERRIT103839223877|Bump XML dump schema to version 0.11 (T365155)]] (duration: 18m 23s)

Change #1038845 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Revert "Temporarily disable XML dumps on snapshot hosts"

https://gerrit.wikimedia.org/r/1038845

Change #1038845 merged by Btullis:

[operations/puppet@production] Revert "Temporarily disable XML dumps on snapshot hosts"

https://gerrit.wikimedia.org/r/1038845

I have reverted the patch that temporarily disabled the dumps and deployed to all four affected snapshot hosts.

The timer will start at 20:05 UTC today, as per:

btullis@cumin1002:~$ sudo cumin 'snapshot10[10-13].eqiad.wmnet' 'systemctl show fulldumps-rest.timer |grep next_elapse'
4 hosts will be targeted:
snapshot[1010-1013].eqiad.wmnet
OK to proceed on 4 hosts? Enter the number of affected hosts to confirm or "q" to quit: 4
===== NODE GROUP =====                                                                                                                                                                                             
(4) snapshot[1010-1013].eqiad.wmnet                                                                                                                                                                                
----- OUTPUT of 'systemctl show f...grep next_elapse' -----                                                                                                                                                        
TimersCalendar={ OnCalendar=*-*-01..14 08,20:05:00 ; next_elapse=Wed 2024-06-05 20:05:00 UTC }

Most XML Dumps have started. Some of them, such as simplewiki, are already doing the first pass. It will be a while till any of them starts the second pass where we'd know if we are doing well or not.

wikidatawiki dump appears stalled or it died on "First-pass for page XML data dumps". https://dumps.wikimedia.org/wikidatawiki/20240601/.
Also, it normally dumps tables such as "A few statistics such as the page count" (site_stats.sql) before processing stub-meta. https://dumps.wikimedia.org/wikidatawiki/20240501/

Update: The dump is running. The stub-meta finished 2024-06-08 16:09:16. For some reason the progress page was not timely updated.

Thanks for the update @Bamyers99.

The heavy weights enwiki, wikidatawiki and commonswiki all appear to be making progress. Minor self-recoverable exceptions have happened but nothing out of the ordinary.

Some of the smaller wikis have finished.

An example: simplewiktionary:
https://dumps.wikimedia.org/simplewiktionary/20240601/ output appears correct when manually perusing some of the files. Files appear to be ~8% bigger all around when compared to https://dumps.wikimedia.org/simplewiktionary/20240501/

Most wikis have now finished. Only these 7 remain: ruwiki, dewiki, enwiki, commonswiki, frwiki, wikidatawiki, and zhwiki.

All 7 are making progress, but enwiki is struggling on a couple of files. I'm monitoring that issue now.

Had to kill enwiki dump to recover from the stalling issue:

hostname -f
snapshot1012.eqiad.wmnet

sudo -u dumpsgen bash

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:en --wiki enwiki --dryrun
would kill processes ['2528923', '2528934', '2528935', '2528947']

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:en --wiki enwiki
python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:en --wiki enwiki --dryrun
would kill processes []

Activity picked up right away and appears to be making progress:

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:en --wiki enwiki --dryrun
would kill processes ['2789246', '2789247', '2789248', '2789249', '2789250', '2789251', '2789252', '2789253', '2789254', '2789255', '2789256', '2789257', '2789258', '2789259', '2789260', '2789261', '2789263', '2789264', '2789265', '2789269', '2789272', '2789274', '2789276', '2789279', '2789281', '2789284', '2789460', '2789596', '2789610', '2789628', '2789633', '2789636', '2789714', '2789721', '2789724', '2789727', '2789730', '2789733', '2789736', '2789739', '2789742', '2789750', '2789756', '2789760', '2789763', '2789766', '2789790', '2789814', '2789834', '2789840', '2789864', '2789867']

Now commonswiki acting up with a bunch of:

[20240613032540]: Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways
Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways
Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways
Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways
...

Killing current processes so that it can recover:

hostname -f
snapshot1010.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup


python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki --dryrun
would kill processes ['2998421']

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki

ruwiki, dewiki, frwiki, zhwiki have actually finished successfully as per logs. Example from zhwiki:

dumpsgen@snapshot1010:/mnt/dumpsdata/xmldatadumps/private$ tail -n 2 zhwiki/20240601/dumplog.txt 
2024-06-11 20:44:18: zhwiki Removing old rss feed /mnt/dumpsdata/xmldatadumps/public/zhwiki/latest/zhwiki-latest-pages-meta-history1.xml-p2275p7615.7z-rss.xml for link /mnt/dumpsdata/xmldatadumps/public/zhwiki/latest/zhwiki-latest-pages-meta-history1.xml-p2275p7615.7z
2024-06-11 20:44:20: zhwiki SUCCESS: done.

But, for some reason, the rsync process has not picked it up to show recent files and success on the web UI. Will investigate.

CC @BTullis

ruwiki, dewiki, frwiki, zhwiki have actually finished successfully as per logs. Example from zhwiki:

dumpsgen@snapshot1010:/mnt/dumpsdata/xmldatadumps/private$ tail -n 2 zhwiki/20240601/dumplog.txt 
2024-06-11 20:44:18: zhwiki Removing old rss feed /mnt/dumpsdata/xmldatadumps/public/zhwiki/latest/zhwiki-latest-pages-meta-history1.xml-p2275p7615.7z-rss.xml for link /mnt/dumpsdata/xmldatadumps/public/zhwiki/latest/zhwiki-latest-pages-meta-history1.xml-p2275p7615.7z
2024-06-11 20:44:20: zhwiki SUCCESS: done.

But, for some reason, the rsync process has not picked it up to show recent files and success on the web UI. Will investigate.

CC @BTullis

Ben and I took a look at this. We confirmed that, for example, zhwiki have indeed finished successfully, and that its index.html file show it as so.

The rsync process that copies data from Dumpsdata1006 to clouddumps1002, for some reason, has been slow to reflect these changes.

We restarted the process, and will give it some time to catch up. Hopefully this self resolves, but if not, we will take a deeper look Monday.

commonswiki now having straggling files:

PROBLEM: commonswiki has file commonswiki/20240601/commonswiki-20240601-pages-meta-current2.xml-p20587571p22087570.bz2.inprog at least 4 hours older than lock

Thus nuking processes:

ssh snapshot1010.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki --dryrun
would kill processes ['3687316', '3687317', '3687318', '3687319', '3687320', '3687321', '3687347', '3687349', '3687351', '3687353', '3687358', '3687361']

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki

Also deleted stale inprog files.

Activity picked up right away and appears to be making progress.

The rsync process that copies data from Dumpsdata1006 to clouddumps1002 was able to recover.

All dumps except commonswiki and wikidatawiki have now finished.

commonswiki seems to be making active progress.

wikidatawiki has stalled with the following:

PROBLEM: wikidatawiki has file wikidatawiki/20240601/wikidatawiki-20240601-pages-meta-current17.xml-p29578206p31078205.bz2.inprog at least 4 hours older than lock

So, one more time:

hostname -f
snapshot1011.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki --dryrun
would kill processes ['1816676', '1816684', '1816686', '1816688', '1816690', '1816692', '1816698', '1816700']

Activity picked up right away and appears to be making progress.

wikidatawiki finished the dump of current pages (aka pages-meta-current), but appeared to be idle.

We are close to the 20th, which is the start of the next dump, and wikidatawiki had not even started the full history dump (aka pages-meta-history).

Thus moving it to a spare node as I would like us to have a full set of successful dumps after the changes from this ticket:

Halting the run on snapshot1011:

hostname -f
snapshot1011.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki --dryrun
would kill processes []

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki

python3 dumpadmin.py --unlock --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki

Re-enabling it on snapshot1014:

ssh snapshot1014.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

screen -S wikidatawiki-20240601

bash ./worker --date 20240601 --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki

Does this issue affect the loading of tables like category? I imported the enwiki-20240501-pages-meta-current.xml.bz2 dump using importDump.php and the row count for category is 888 but when I check the count on Quarry against the enwiki replica, I get 2361850 rows. I'm running

SELECT COUNT(*) FROM category

Does this issue affect the loading of tables like category? I imported the enwiki-20240501-pages-meta-current.xml.bz2 dump using importDump.php and the row count for category is 888 but when I check the count on Quarry against the enwiki replica, I get 2361850 rows. I'm running

SELECT COUNT(*) FROM category

This ticket is unrelated. You can get the content of the category table for enwiki from enwiki-20240601-category.sql.gz.

Does this issue affect the loading of tables like category? I imported the enwiki-20240501-pages-meta-current.xml.bz2 dump using importDump.php and the row count for category is 888 but when I check the count on Quarry against the enwiki replica, I get 2361850 rows. I'm running

SELECT COUNT(*) FROM category

This ticket is unrelated. You can get the content of the category table for enwiki from enwiki-20240601-category.sql.gz.

Thank you, this worked.

wikidatawiki started lagging behind a bit with the following error:

PROBLEM: wikidatawiki has file wikidatawiki/20240601/wikidatawiki-20240601-pages-meta-history21.xml-p47161304p47243979.bz2.inprog at least 4 hours older than lock

I would still like us to have a full set of dumps for 20240601, thus restarting on same spare node:

ssh snapshot1014.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki --dryrun
would kill processes ['448212', '2007603', '2007605', '2007614', '2007617']

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki

cd /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20240601
rm *.inprog

rm /mnt/dumpsdata/xmldatadumps/private/wikidatawiki/lock_20240601


screen -xS wikidatawiki-20240601
cd /srv/deployment/dumps/dumps/xmldumps-backup

bash ./worker --date 20240601 --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki

It should finish in the next couple days.

20240601 wikidatawiki dump finally succeeded:

2024-07-05 08:39:15: wikidatawiki Reading wikidatawiki-20240601-pages-articles-multistream-index.txt.bz2 checksum for sha1 from file /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20240601/sha1sums-wikidatawiki-20240601-pages-articles-multistream-index.txt.bz2.txt
2024-07-05 08:40:23: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/latest ...
2024-07-05 08:40:47: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/latest ...
2024-07-05 08:42:00: wikidatawiki SUCCESS: done.

Files are not rsynced publicly yet, but should be in the next day or so.

We have a full set of dumps! 🎉

Hello. This is an empty text entry in wikidatawiki-20240701-pages-articles-multistream6.xml-p4469005p5969004.bz2 as downloaded from https://dumps.wikimedia.org/wikidatawiki/20240701/:

<page>
  <title>Q6157973</title>
  <ns>0</ns>
  <id>5952866</id>
  <revision>
    <id>2136476381</id>
    <parentid>2045646165</parentid>
    <timestamp>2024-04-24T18:13:17Z</timestamp>
    <contributor>
      <username>William Avery Bot</username>
      <id>2964320</id>
    </contributor>
    <comment>/* wbeditentity-update:0| */ Changing runeberg.org URLs to https (×7). See [[Wikidata:Requests_for_permissions/Bot/William_Avery_Bot_11|request for permission]]</comment>
    <origin>2136476381</origin>
    <model>wikibase-item</model>
    <format>application/json</format>
    <text bytes="90348" sha1="aimrgeqjzqp6d3oz5qc8awzr5nhi9da" />
    <sha1>aimrgeqjzqp6d3oz5qc8awzr5nhi9da</sha1>
  </revision>
</page>

Shouldn't this be already fixed with the deployed patch?

Hello. This is an empty text entry in wikidatawiki-20240701-pages-articles-multistream6.xml-p4469005p5969004.bz2 as downloaded from https://dumps.wikimedia.org/wikidatawiki/20240701/:

<page>
  <title>Q6157973</title>
  <ns>0</ns>
  <id>5952866</id>
  <revision>
    <id>2136476381</id>
    <parentid>2045646165</parentid>
    <timestamp>2024-04-24T18:13:17Z</timestamp>
    <contributor>
      <username>William Avery Bot</username>
      <id>2964320</id>
    </contributor>
    <comment>/* wbeditentity-update:0| */ Changing runeberg.org URLs to https (×7). See [[Wikidata:Requests_for_permissions/Bot/William_Avery_Bot_11|request for permission]]</comment>
    <origin>2136476381</origin>
    <model>wikibase-item</model>
    <format>application/json</format>
    <text bytes="90348" sha1="aimrgeqjzqp6d3oz5qc8awzr5nhi9da" />
    <sha1>aimrgeqjzqp6d3oz5qc8awzr5nhi9da</sha1>
  </revision>
</page>

Shouldn't this be already fixed with the deployed patch?

Here are the counts of total <text> openings in that file:

cat wikidatawiki-20240701-pages-articles-multistream6.xml-p4469005p5969004 | grep '<text ' | wc -l
 1316226

Total endings with content:

cat wikidatawiki-20240701-pages-articles-multistream6.xml-p4469005p5969004 | grep '</text>' | wc -l
 1314838

So presumably there are 1316226-1314838=1388 revisions in this file that did not make it. Sporadic missing text was and is still possible, even after the work of this ticket. This happens because for these particular revision fetches there was a runtime failure, such as a database connection error. The dumps process retries, but after exhausting the retries it does not fail, instead it continues as it would be too expensive to restart the whole process.

Got it, thank you for the clarification!