Page MenuHomePhabricator

Dumps of enwiki and wikidatawiki for 20241001 have not started
Open, HighPublic

Description

We continue to see disruption in the scheduling of regular dumps for reasons unknown.

The latest link for wikidatawiki is still pointing to the 20240920 dump:
https://dumps.wikimedia.org/wikidatawiki/latest/

https://dumps.wikimedia.org/enwiki/20241001/ states that the dump has yet to start and the https://dumps.wikimedia.org/enwiki/20241001/dumpstatus.json returns 404.

This may be related to T375928: skwikibooks dumps failing and also seems like another occurrence of T375692: enwiki dump for 20240920 is delayed.

Event Timeline

The wikidata dump is supposed to happen on snapshot1011 but when I look at it, I can see that there are still processes owned by dumpsgen that are related to 20240920

image.png (321×1 px, 98 KB)

They also seem to be doing a lot of sleeping.

According to https://dumps.wikimedia.org/wikidatawiki/20240920/ this dump is complete, so I am going to kill these stray processes and try to start the wikidata dump for 20241001.

Killed all processes starting with the parent fulldumps.sh

btullis@snapshot1011:~$ sudo kill 3754960
btullis@snapshot1011:~$ pstree -ap dumpsgen
No processes found.

Now restarting the service:

btullis@snapshot1011:~$ sudo systemctl restart fulldumps-rest.service

I could have waited until 20:05 for it to start by itself, but I wanted to see if it exited quickly.
I have started this in a screen session, as it does not return my terminal to me.

Looks to be proceeding correctly, though.
I can see content in /mnt/dumpsdata/xmldatadumps/private/wikidatawiki/20241001/dumplog.txt

Looking at enwiki on snapshot1012, I can see that there is still a process running and the commands look right, but it doesn't seem to be doing anything.

btullis@snapshot1012:~$ pstree -al dumpsgen
systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject fulldumps-rest --mail-to root@snapshot1012.eqiad.wmnet --only-on-error /usr/local/bin/fulldumps.sh 01 14 enwiki full 18 silent
  └─fulldumps.sh /usr/local/bin/fulldumps.sh 01 14 enwiki full 18 silent
      └─python3 /srv/deployment/dumps/dumps/xmldumps-backup/dumpscheduler.py --slots 18 --commands /etc/dumps/stages/stages_full_enwiki --cache /etc/dumps/cache/running_cache.txt --directory /srv/deployment/dumps/dumps/xmldumps-backup --formatvars STARTDATE=20241001

There is a dumplog.txt file in the correct location, but it clearly stopped after a few seconds of running.

btullis@snapshot1012:/mnt/dumpsdata/xmldatadumps/private/enwiki/20241001$ tail -f dumplog.txt 
2024-10-01 08:05:12: enwiki Creating /mnt/dumpsdata/xmldatadumps/public/enwiki/20241001 ...
2024-10-01 08:05:13: enwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/enwiki/20241001 ...
2024-10-01 08:05:13: enwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/private/enwiki/20241001 ...
2024-10-01 08:05:13: enwiki Cleaning up old dumps for enwiki
Preparing for job createdirs of enwiki
2024-10-01 08:05:13: enwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/enwiki/20241001 ...
2024-10-01 08:05:13: enwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/private/enwiki/20241001 ...
2024-10-01 08:05:14: enwiki Completed job createdirs for enwiki

I will kill this process and restart the service too.

I have tried killing and restarting the service several times, but with no effect.
It hangs when running this interactively:

dumpsgen@snapshot1012:/home/btullis$ /usr/local/bin/fulldumps.sh 01 14 enwiki full 18 silent

I can trace it down a little further, to where this command is hanging:

dumpsgen@snapshot1012:/home/btullis$ /usr/bin/python3 /srv/deployment/dumps/dumps/xmldumps-backup/dumpscheduler.py --slots 18 --commands /etc/dumps/stages/stages_full_enwiki --cache /etc/dumps/cache/running_cache.txt --directory /srv/deployment/dumps/dumps/xmldumps-backup --formatvars STARTDATE=20241001

I can also see that no files have been opened on NFS by running sudo lsof -N.

OK, the dump of enwiki is now under way.
I am running it with the command:

dumpsgen@snapshot1012:/home/btullis$ /usr/local/bin/fulldumps.sh 01 14 enwiki full 28 silent

The difference is the number of slots, which we reduced from 28 to 18 in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070558 for T373904: Lower the available slots for the dump of enwiki to lower presure on databases

Something about the dumpscheduler.py presumably doesn't like only having 18 slots, so I will investigate that further.