Page MenuHomePhabricator

commonswiki and enwiki dumps thrashing
Open, HighPublic

Description

For ~22 hours now, the commonswiki and enwiki dumps have been trashing.

Email thread from ops-dumps: https://groups.google.com/a/wikimedia.org/g/ops-dumps/c/lyErVXUIKXk

Latest reads like so:

PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history1.xml-p7864184p8057557.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history1.xml-p8057558p8233432.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history1.xml-p8233433p8425544.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history1.xml-p8611890p8796610.bz2.inprog at least 4 hours older than lock
PROBLEM: enwiki has file enwiki/20240501/enwiki-20240501-pages-meta-history27.xml-p70704075p70976995.bz2.inprog at least 4 hours older than lock
PROBLEM: enwiki has file enwiki/20240501/enwiki-20240501-pages-meta-history27.xml-p71883811p72200615.bz2.inprog at least 4 hours older than lock
PROBLEM: enwiki has file enwiki/20240501/enwiki-20240501-pages-meta-history27.xml-p72200616p72496357.bz2.inprog at least 4 hours older than lock
PROBLEM: enwiki has file enwiki/20240501/enwiki-20240501-pages-meta-history27.xml-p73502224p73857908.bz2.inprog at least 4 hours older than lock
PROBLEM: enwiki has file enwiki/20240501/enwiki-20240501-pages-meta-history27.xml-p74947052p75285188.bz2.inprog at least 4 hours older than lock
PROBLEM: enwiki has file enwiki/20240501/enwiki-20240501-pages-meta-history27.xml-p76565476p76788710.bz2.inprog at least 4 hours older than lock

This is the same symptoms that made commonswiki fail last month on T362454.

Event Timeline

Figure where things are running:

pwd
/mnt/dumpsdata/xmldatadumps/private/commonswiki

cat lock_20240501 
snapshot1010.eqiad.wmnet 8415


pwd
/mnt/dumpsdata/xmldatadumps/private/enwiki

cat lock_20240501
snapshot1012.eqiad.wmnet 3699

Now let's go to the node running commonswiki and nuke all running processes:

ssh snapshot1010.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki --dryrun
would kill processes ['59602', '59603', '59604', '59605', '59606', '59607', '59651', '59653', '59659', '59660', '59661', '59662']

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki

Same with enwiki:

ssh snapshot1012.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:en --wiki enwiki --dryrun


python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:en --wiki enwiki

Now we wait, see if they get picked up automatically again.

Update:

enwiki appears to be doing well, with no further email warning in the last 24 hours.

commonswiki and now zhwiki however continue having issues:

PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history1.xml-p7864184p8057557.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history1.xml-p8057558p8233432.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history1.xml-p8233433p8425544.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history1.xml-p8611890p8796610.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history3.xml-p28754100p28993268.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history3.xml-p28993269p29241308.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history3.xml-p29241309p29459407.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history3.xml-p29459408p29695693.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history3.xml-p29695694p29954512.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history3.xml-p29954513p30215144.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history4.xml-p44916064p45208003.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history4.xml-p45208004p45493510.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history4.xml-p45493511p45811238.bz2.inprog at least 4 hours older than lock
PROBLEM: commonswiki has file commonswiki/20240501/commonswiki-20240501-pages-meta-history4.xml-p45811239p46114477.bz2.inprog at least 4 hours older than lock
PROBLEM: zhwiki has file zhwiki/20240501/zhwiki-20240501-pages-meta-history6.xml-p6480506p6712125.bz2.inprog at least 4 hours older than lock

I will try and nuke them both.

ssh snapshot1010.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki --dryrun
would kill processes ['41324', '41325', '41326', '41327', '41328', '41329', '41374', '41375', '41377', '41382', '41383', '41385']

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki
pwd
/mnt/dumpsdata/xmldatadumps/private/zhwiki

cat lock_20240501 
snapshot1010.eqiad.wmnet 32055

cd /srv/deployment/dumps/dumps/xmldumps-backup

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki zhwiki --dryrun
would kill processes ['31429', '31430', '31433', '31434']

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki zhwiki
Milimetric renamed this task from commonswiki and enwiki dumps trashing to commonswiki and enwiki dumps thrashing.Thu, May 9, 11:22 AM

enwiki and zhwiki have finished dumping successfully.

commonswiki is unfortunately still trashing. Considering that last month it failed (see T362454) I will attempt to force heal the dump again.

Both enwiki and zhwiki did send a failure email, while commonswiki did not. I suspect the dumpadmin.py --kill command is not killing all processes, so will try nuking things manually.

Host rebooted by btullis@cumin1002 with reason: Terminating stray dumps processes

commonswiki shows as 'aborted' after the node reboot. Gave it some time to see if the systemd unit would trigger a re-run, but it has not.

Rerunning commonswiki manually via:

Verify everything looks good:

ssh snapshot1010.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki --dryrun
would kill processes []

Verify there is no lock:

$ pwd
/mnt/dumpsdata/xmldatadumps/private/commonswiki
dumpsgen@snapshot1010:/mnt/dumpsdata/xmldatadumps/private/commonswiki$ ls
20240101  20240120  20240201  20240220  20240301  20240320  20240401  20240420  20240501

Rerun on same node snapshot1010.eqiad.wmnet since we got time:

cd /srv/deployment/dumps/dumps/xmldumps-backup
screen -S commonswiki-20240501-rerun
bash ./worker --date last --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki

commonswiki continues to run along with no warning emails for ~14 hours now. Looks like it may finish successfully!

BTullis triaged this task as High priority.Fri, May 10, 2:40 PM
BTullis moved this task from Backlog to Active on the Dumps-Generation board.

commonswiki had finished successfully over the weekend.

I've just reset the sensor on DAG mediawiki_wikitext_current that had timed out.

mediawiki_wikitext_history is currently running, about ~1d 7h into the Spark job.

This concludes this particular saga of the Dumps 1.0. Tune in for next month's! :D