Page MenuHomePhabricator

Truncated XML
Closed, ResolvedPublic

Description

Hello folks at Wikimedia,
I'm sorry to report that some of your latest meta history dumps (both 7zip and bzip2) are truncated at the end, xml follows.

Keep up the great work 💪
Enrico

</page>
<page>
  <title>Progetto:Biografie/Attività/Restauratori</title>
  <ns>102</ns>
  <id>3340028</id>
  <revision>
    <id>38358168</id>
    <timestamp>2011-02-06T20:57:20Z</timestamp>
    <contributor>
      <username>Biobot</username>
      <id>124123</id>
    </contributor>
    <minor />
    <comment>Biobot [[Utente:Biobot#6.6|6.6]]</comment>
    <model>wikitext</model>
    <format>text/x-wiki</format>
    <text xml:space="preserve" />
    <sha1>rhk3cnaawxhi00lyn4mmpbibmfy756m</sha1>
  </revision>

Event Timeline

There are several files affected from the latest itwiki full run. Listing them here:

-rw-r--r-- 1 dumpsgen dumpsgen 970M Dec  5 17:31 itwiki-20181201-pages-meta-history5.xml-p3147943p3289692.bz2.inprog
-rw-r--r-- 1 dumpsgen dumpsgen 1.2G Dec  5 19:07 itwiki-20181201-pages-meta-history5.xml-p3289693p3409278.bz2.inprog
-rw-r--r-- 1 dumpsgen dumpsgen 925M Dec  5 20:24 itwiki-20181201-pages-meta-history5.xml-p3409279p3562778.bz2.inprog
-rw-r--r-- 1 dumpsgen dumpsgen 816M Dec  5 21:45 itwiki-20181201-pages-meta-history5.xml-p3562779p3709334.bz2.inprog
-rw-r--r-- 1 dumpsgen dumpsgen 1.3G Dec  5 23:16 itwiki-20181201-pages-meta-history5.xml-p3709335p3864178.bz2.inprog
-rw-r--r-- 1 dumpsgen dumpsgen 906M Dec  6 00:33 itwiki-20181201-pages-meta-history5.xml-p3864179p4035534.bz2.inprog

I will check into this.

Two more files, from metawiki:

./metawiki/20181201/metawiki-20181201-pages-meta-history1.xml-p78478p303065.bz2.inprog
./metawiki/20181201/metawiki-20181201-pages-meta-history1.xml-p2p78477.bz2.inprog

Everything else from the 20181201 run is good, as well as from the current run.

It looks like the inprog files are complete. I'll double check and then move them to permanent locations and update the status files for these two wikis. Then the files should be available for download later today or at worst case tomorrow.

The bz2 files are updated, along with the status files; they should be available for download in a few hours. The md5sums and sha1s are not yet fixed up, nor are the 7z files. That will happen later today or early tomorrow.

The 7z files are in the process of being regenerated; this will take several hours at least.

Good to know, thanks 👍

By the way the current full English dump seems to be free of this bug, but the previous months were affected as well. On the other side only the current Italian dump seems to be affected.

Enrico

Hash sum files have been updated. Everything should be available for download by sometime tomorrow.

Actually itwiki-20181201-pages-meta-history5.xml-p3147943p3289692.7z doesn't sum up, SHA1 is

5a684433a2e6c561f96fefcb480b8ee3318dd140

while it's reported as being:

6340dc5bdbdf1a111b70636ebf170c86d24ae2ed

Change 481893 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] Check for truncated file content in certain circumstances

https://gerrit.wikimedia.org/r/481893

The corrected status files, while generated on Dec 27th, were not rsynced out from the primary host to the web server and labs nfs server, because they were not from a current run. I have pushed them out manually. There is a patch to our rsync procedures which will address this, in the process of streamlining something else: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/481303/

Apologies for the delay!

Found the bug at last! In jobs.py, command_completion_callback(), the very first line,

if not proc.exited_successfully()

is wrong. It should be if not series.exited_successfully():

Bug was introduced in Iaa6f140f14f3e92d0832132d86ec825885a80a91 when temp output files were added. Since then we've been saved because output files have always been found to be truncated when there's an error. These files weren't; a full bz2 block and bz2 footer were written, before the file was closed.

pylint never caught that error, which is a bit special.

With this in mind I can deploy a much shorter fix for the issue and then think about whether to check last lines of compressed xml files separately.

Change 482042 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] Fix a long-standing bug that allowed some incomplete output files to sneak in

https://gerrit.wikimedia.org/r/482042

Change 482042 merged by ArielGlenn:
[operations/dumps@master] Fix a long-standing bug that allowed some incomplete output files to sneak in

https://gerrit.wikimedia.org/r/482042

This is now deployed, which takes care of the problem in the short term; new runs of the dumps will pick up this change.

Change 482293 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] file truncation checks also now optionally check for last xml tag

https://gerrit.wikimedia.org/r/482293

I will abandon the earlier change https://gerrit.wikimedia.org/r/481893 if the current patch works as i expect, since the new patch is much cleaner.

December dump now runs smoothly, thanks a lot for the fix 👍 - I'll let you know if ever anything arises on the new ones.

Wish you a happy new year,
Enrico

Change 482293 merged by ArielGlenn:
[operations/dumps@master] file truncation checks also now optionally check for last xml tag

https://gerrit.wikimedia.org/r/482293

Change 481893 abandoned by ArielGlenn:
Check for truncated file content in certain circumstances

Reason:
Abaonded in favor of https://gerrit.wikimedia.org/r/#/c/operations/dumps/ /482293/

https://gerrit.wikimedia.org/r/481893

ArielGlenn claimed this task.

Having merged everything I wanted to merge, and abandoned everything I wanted to abandon, I'm closing this. If you see any issues on future dumps, please do report them, and thanks!