Page MenuHomePhabricator

Missing content in cirrus search dump
Closed, ResolvedPublic

Description

https://www.mediawiki.org/wiki/Topic:Vsgsayf31ks8akx8

enwiki-20200727-cirrussearch-content.json.gz

Some articles I expected to find, but didn't: "Paper Mario", "Sakura Wars (1996 video game)". On further examination, the JSON I have has a little over 5M entries with a "title" property. But English Wikipedia passed 6M entries in January, so that number seems off...

Event Timeline

Hello, I'm the person that posted this on the other help board. Sorry for not posting it here, wasn't sure where the right place to post was.

Let me know if there's any other information I can provide.

The plan here is to spend no more than 3 days investigating and check back in at the next sprint planning meeting on Sept 7.

There are stale .tmp files in the dump directory, this is suspicious.
One possibility would be that the script is ran concurrently from the same machine or from another machine if it's writing to a shared folder.

Change 623783 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [cirrusdumps] Skip wikis with existing dump files

https://gerrit.wikimedia.org/r/623783

The patch above does not fix the root cause (it just adds a bit more error handling/logging), I suspect the dumpcirrussearch.sh to be run concurrently but I'm not sure where to start looking for possible duplicate or stale cron entries. @ArielGlenn would you have any advises on this (I'm not even sure I'm on the right track)? Thanks!

Worth noting that I don't see such tmp files on /mnt/dumpsdata/otherdumps/cirrussearch/ mounted from snapshot1008, my earlier explanation is probably wrong then, these tmp files are probably left over by an rsync process between this folder and the folder behind https://dumps.wikimedia.org/other/cirrussearch.
Checked frwiki-20200831-cirrussearch-content.json.gz which has a stale tmp file and the number of articles in this dump seems correct.
These stale tmp files are probably a red herring, I'm going to assume that the dump script simply failed to dump all the content it was supposed to dump for enwiki-20200727-cirrussearch-content.json.gz due to some error (the logs for this dump has been cleaned up already), the extra error checking/logging added to the attached patch should hopefully help not to promote partial dump files to the public site.

The patch above does not fix the root cause (it just adds a bit more error handling/logging), I suspect the dumpcirrussearch.sh to be run concurrently but I'm not sure where to start looking for possible duplicate or stale cron entries. @ArielGlenn would you have any advises on this (I'm not even sure I'm on the right track)? Thanks!

For the record there's only one copy of the process running over there right now:

ariel@snapshot1008:~$ ps axuww | grep -i cirr
dumpsgen 126079  0.0  0.0   4276   744 ?        Ss   Aug31   0:00 /bin/sh -c /usr/local/bin/dumpcirrussearch.sh --config /etc/dumps/confs/wikidump.conf.other
dumpsgen 126081  0.0  0.0  11212  3068 ?        S    Aug31   0:00 /bin/bash /usr/local/bin/dumpcirrussearch.sh --config /etc/dumps/confs/wikidump.conf.other
dumpsgen 126101  0.0  0.0  11212  2152 ?        S    Aug31   0:01 /bin/bash /usr/local/bin/dumpcirrussearch.sh --config /etc/dumps/confs/wikidump.conf.other
dumpsgen 127889 32.8  0.1 488516 98188 ?        S    13:42  16:42 /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php extensions/CirrusSearch/maintenance/DumpIndex.php --wiki=metawiki --indexType=general

These jobs typically take three days to run.

<snip>

These stale tmp files are probably a red herring, I'm going to assume that the dump script simply failed to dump all the content it was supposed to dump for enwiki-20200727-cirrussearch-content.json.gz due to some error (the logs for this dump has been cleaned up already), the extra error checking/logging added to the attached patch should hopefully help not to promote partial dump files to the public site.

I see some logs from August 27 for the Aug 234th run in snapshot1008:/var/log/cirrusdump, was there another log you were trying to check?

I see some logs from August 27 for the Aug 234th run in snapshot1008:/var/log/cirrusdump, was there another log you were trying to check?

I'd need a log file from July: /var/log/cirrusdump/cirrusdump-enwiki-20200727-content.log.

Ah rats, that is indeed too old.

I had a look at the sizes of the relevant file on our web server:

ariel@labstore1007:/srv/dumps/xmldatadumps/public/other/cirrussearch$ ls -l 20200*/enwiki-*content*gz
-rw-r--r-- 1 dumpsgen dumpsgen 31147816314 Jun 23 13:53 20200622/enwiki-20200622-cirrussearch-content.json.gz
-rw-r--r-- 1 dumpsgen dumpsgen 31198064043 Jun 30 15:14 20200629/enwiki-20200629-cirrussearch-content.json.gz
-rw-r--r-- 1 dumpsgen dumpsgen 31244306569 Jul  7 16:44 20200706/enwiki-20200706-cirrussearch-content.json.gz
-rw-r--r-- 1 dumpsgen dumpsgen  1615456056 Jul 14 04:03 20200713/enwiki-20200713-cirrussearch-content.json.gz
-rw-r--r-- 1 dumpsgen dumpsgen 31355629658 Jul 21 19:05 20200720/enwiki-20200720-cirrussearch-content.json.gz
-rw-r--r-- 1 dumpsgen dumpsgen 22550187789 Jul 28 18:16 20200727/enwiki-20200727-cirrussearch-content.json.gz
-rw-r--r-- 1 dumpsgen dumpsgen 31452325788 Aug  4 08:58 20200803/enwiki-20200803-cirrussearch-content.json.gz
-rw-r--r-- 1 dumpsgen dumpsgen 31516448290 Aug 11 21:41 20200810/enwiki-20200810-cirrussearch-content.json.gz
-rw-r--r-- 1 dumpsgen dumpsgen 31557488458 Aug 18 22:41 20200817/enwiki-20200817-cirrussearch-content.json.gz
-rw-r--r-- 1 dumpsgen dumpsgen 31621027454 Aug 25 23:18 20200824/enwiki-20200824-cirrussearch-content.json.gz
-rw-r--r-- 1 dumpsgen dumpsgen 31647418627 Sep  1 23:28 20200831/enwiki-20200831-cirrussearch-content.json.gz

So the output does indeed look truncated for that one run.

Change 623783 merged by Ryan Kemper:
[operations/puppet@production] [cirrusdumps] use temp dir and add better error handling

https://gerrit.wikimedia.org/r/623783