Page MenuHomePhabricator

commonswiki dump failure for 20240401
Closed, ResolvedPublic

Description

Got a bunch of these messages from ops-dumps:

Systemd timer ran the following command:

/bin/bash /usr/local/bin/job_watcher.sh --dumpsbasedir /data/xmldatadumps/public --locksbasedir /data/xmldatadumps/private

Its return value was 0 and emitted the following output:

PROBLEM: commonswiki has file commonswiki/20240401/commonswiki-20240401-pages-meta-history6.xml-p109903254p110194293.bz2.inprog at least 4 hours older than lock

I attempted to de-stuck it a couple times but that did not work. (see email thread for details).

Thus decided to kill the job for good and rerun on testbed host as per https://wikitech.wikimedia.org/wiki/Dumps/Rerunning_a_job#Rerunning_a_complete_dump

Event Timeline

Here are the steps I took following https://wikitech.wikimedia.org/wiki/Dumps/Rerunning_a_job#Rerunning_a_complete_dump:

Kill the commonswiki dump:

hostname -f
snapshot1013.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki --dryrun
would kill processes ['44736', '44833', '64116', '64245']

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki

Check everything is dead:

dumpsgen@snapshot1013:/srv/deployment/dumps/dumps/xmldumps-backup$ ps -aux | grep 44736
dumpsgen 21980  0.0  0.0   6072   892 pts/0    S+   21:41   0:00 grep 44736
dumpsgen@snapshot1013:/srv/deployment/dumps/dumps/xmldumps-backup$ ps -aux | grep 44833
dumpsgen 21989  0.0  0.0   6072   892 pts/0    S+   21:41   0:00 grep 44833
dumpsgen@snapshot1013:/srv/deployment/dumps/dumps/xmldumps-backup$ ps -aux | grep 64116
dumpsgen 21991  0.0  0.0   6072   892 pts/0    S+   21:41   0:00 grep 64116
dumpsgen@snapshot1013:/srv/deployment/dumps/dumps/xmldumps-backup$ ps -aux | grep 64245
dumpsgen 22005  0.0  0.0   6072   828 pts/0    S+   21:42   0:00 grep 64245

Clean up locks:

python3 dumpadmin.py --unlock --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki --dryrun
would remove lock /mnt/dumpsdata/xmldatadumps/private/commonswiki/lock_20240401 for wiki commonswiki

python3 dumpadmin.py --unlock --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki

Verify lock file is gone:

dumpsgen@snapshot1013:/mnt/dumpsdata/xmldatadumps/private/commonswiki$ pwd
/mnt/dumpsdata/xmldatadumps/private/commonswiki
dumpsgen@snapshot1013:/mnt/dumpsdata/xmldatadumps/private/commonswiki$ ls
20231201  20231220  20240101  20240120  20240201  20240220  20240301  20240320  20240401

Re-running commonswiki on testbed host:

ssh snapshot1009.eqiad.wmnet
sudo -u dumpsgen bash

rm stale .inprog files:

dumpsgen@snapshot1009:/mnt/dumpsdata/xmldatadumps/public/commonswiki/20240401$ ls -lsha *inprog*
209M -rw-r--r-- 1 dumpsgen dumpsgen 209M Apr 12 21:40 commonswiki-20240401-pages-meta-history6.xml-p109903254p110194293.bz2.inprog
177M -rw-r--r-- 1 dumpsgen dumpsgen 177M Apr 12 21:51 commonswiki-20240401-pages-meta-history6.xml-p119976192p120285167.bz2.inprog
dumpsgen@snapshot1009:/mnt/dumpsdata/xmldatadumps/public/commonswiki/20240401$ rm *.inprog

And now we attempt to rerun:

cd /srv/deployment/dumps/dumps/xmldumps-backup
bash ./worker --date last --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki

Unfortunately, after running for ~2+ days, the commonswiki dump got stuck again with the same probem as in description, against the same file.

I was running the process in a terminal window so did a simple CTRL-C to kill it.

Helper script says there is nothing else to kill:

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki --dryrun
would kill processes []

Let's clean up the lock file still:

dumpsgen@snapshot1009:/srv/deployment/dumps/dumps/xmldumps-backup$ python3 dumpadmin.py --unlock --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki --dryrun
would remove lock /mnt/dumpsdata/xmldatadumps/private/commonswiki/lock_20240401 for wiki commonswiki
dumpsgen@snapshot1009:/srv/deployment/dumps/dumps/xmldumps-backup$ python3 dumpadmin.py --unlock --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki

Not much else to do here. For this month, there will be no commonswiki dump for the full dump (i.e "All pages with complete page edit history").

For some reason, we are reattempting the 20240401 commonswiki dump, and it is failing with the same issue.

So one more time:

Kill the commonswiki dump:

cat /mnt/dumpsdata/xmldatadumps/private/commonswiki/lock_20240401 
snapshot1011.eqiad.wmnet 43210

hostname -f
snapshot1011.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup



python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki --dryrun
would kill processes ['49891', '49936']

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki

Check everything is dead:

dumpsgen@snapshot1011:/srv/deployment/dumps/dumps/xmldumps-backup$ ps -aux | grep 49891
dumpsgen 16719  0.0  0.0   6072   888 pts/0    S+   16:24   0:00 grep 49891
dumpsgen@snapshot1011:/srv/deployment/dumps/dumps/xmldumps-backup$ ps -aux | grep 49936
dumpsgen 16739  0.0  0.0   6072   892 pts/0    S+   16:25   0:00 grep 49936

Clean up locks:

dumpsgen@snapshot1011:/srv/deployment/dumps/dumps/xmldumps-backup$ python3 dumpadmin.py --unlock --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki --dryrun
would remove lock /mnt/dumpsdata/xmldatadumps/private/commonswiki/lock_20240401 for wiki commonswiki
dumpsgen@snapshot1011:/srv/deployment/dumps/dumps/xmldumps-backup$ python3 dumpadmin.py --unlock --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki

Verify lock file is gone:

dumpsgen@snapshot1011:/srv/deployment/dumps/dumps/xmldumps-backup$ ls /mnt/dumpsdata/xmldatadumps/private/commonswiki
20231201  20231220  20240101  20240120  20240201  20240220  20240301  20240320  20240401

Got the failure email so that is good: https://groups.google.com/a/wikimedia.org/g/ops-dumps/c/YrCE7k_xOno/m/6VZ4lRqgAgAJ