Page MenuHomePhabricator

Dumps of enwiki and wikidatawiki for 20241001 have not started
Closed, ResolvedPublic

Description

We continue to see disruption in the scheduling of regular dumps for reasons unknown.

The latest link for wikidatawiki is still pointing to the 20240920 dump:
https://dumps.wikimedia.org/wikidatawiki/latest/

https://dumps.wikimedia.org/enwiki/20241001/ states that the dump has yet to start and the https://dumps.wikimedia.org/enwiki/20241001/dumpstatus.json returns 404.

This may be related to T375928: skwikibooks dumps failing and also seems like another occurrence of T375692: enwiki dump for 20240920 is delayed.

Event Timeline

The wikidata dump is supposed to happen on snapshot1011 but when I look at it, I can see that there are still processes owned by dumpsgen that are related to 20240920

image.png (321×1 px, 98 KB)

They also seem to be doing a lot of sleeping.

According to https://dumps.wikimedia.org/wikidatawiki/20240920/ this dump is complete, so I am going to kill these stray processes and try to start the wikidata dump for 20241001.

Killed all processes starting with the parent fulldumps.sh

btullis@snapshot1011:~$ sudo kill 3754960
btullis@snapshot1011:~$ pstree -ap dumpsgen
No processes found.

Now restarting the service:

btullis@snapshot1011:~$ sudo systemctl restart fulldumps-rest.service

I could have waited until 20:05 for it to start by itself, but I wanted to see if it exited quickly.
I have started this in a screen session, as it does not return my terminal to me.

Looks to be proceeding correctly, though.
I can see content in /mnt/dumpsdata/xmldatadumps/private/wikidatawiki/20241001/dumplog.txt

Looking at enwiki on snapshot1012, I can see that there is still a process running and the commands look right, but it doesn't seem to be doing anything.

btullis@snapshot1012:~$ pstree -al dumpsgen
systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject fulldumps-rest --mail-to root@snapshot1012.eqiad.wmnet --only-on-error /usr/local/bin/fulldumps.sh 01 14 enwiki full 18 silent
  └─fulldumps.sh /usr/local/bin/fulldumps.sh 01 14 enwiki full 18 silent
      └─python3 /srv/deployment/dumps/dumps/xmldumps-backup/dumpscheduler.py --slots 18 --commands /etc/dumps/stages/stages_full_enwiki --cache /etc/dumps/cache/running_cache.txt --directory /srv/deployment/dumps/dumps/xmldumps-backup --formatvars STARTDATE=20241001

There is a dumplog.txt file in the correct location, but it clearly stopped after a few seconds of running.

btullis@snapshot1012:/mnt/dumpsdata/xmldatadumps/private/enwiki/20241001$ tail -f dumplog.txt 
2024-10-01 08:05:12: enwiki Creating /mnt/dumpsdata/xmldatadumps/public/enwiki/20241001 ...
2024-10-01 08:05:13: enwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/enwiki/20241001 ...
2024-10-01 08:05:13: enwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/private/enwiki/20241001 ...
2024-10-01 08:05:13: enwiki Cleaning up old dumps for enwiki
Preparing for job createdirs of enwiki
2024-10-01 08:05:13: enwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/enwiki/20241001 ...
2024-10-01 08:05:13: enwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/private/enwiki/20241001 ...
2024-10-01 08:05:14: enwiki Completed job createdirs for enwiki

I will kill this process and restart the service too.

I have tried killing and restarting the service several times, but with no effect.
It hangs when running this interactively:

dumpsgen@snapshot1012:/home/btullis$ /usr/local/bin/fulldumps.sh 01 14 enwiki full 18 silent

I can trace it down a little further, to where this command is hanging:

dumpsgen@snapshot1012:/home/btullis$ /usr/bin/python3 /srv/deployment/dumps/dumps/xmldumps-backup/dumpscheduler.py --slots 18 --commands /etc/dumps/stages/stages_full_enwiki --cache /etc/dumps/cache/running_cache.txt --directory /srv/deployment/dumps/dumps/xmldumps-backup --formatvars STARTDATE=20241001

I can also see that no files have been opened on NFS by running sudo lsof -N.

OK, the dump of enwiki is now under way.
I am running it with the command:

dumpsgen@snapshot1012:/home/btullis$ /usr/local/bin/fulldumps.sh 01 14 enwiki full 28 silent

The difference is the number of slots, which we reduced from 28 to 18 in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070558 for T373904: Lower the available slots for the dump of enwiki to lower presure on databases

Something about the dumpscheduler.py presumably doesn't like only having 18 slots, so I will investigate that further.

Change #1080265 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Revert "Lower the number of slots that the enwiki dump uses"

https://gerrit.wikimedia.org/r/1080265

The dumps are proceeding. frwiki is still running, as is wikidatawiki. labswiki has a problem dumping content, but that is being tracked in T374952: Figure why we can't dump labswiki, aka Wikitech.

The dumps_fillin_wd service has run on snapshot1015 for the first time, I believe, and claims to have run to completion.

Oct 11 08:05:08 snapshot1015 systemd[1]: Starting snapshot - full dumps - fillin - wikidata...
Oct 14 14:54:59 snapshot1015 bash[3380982]: 20241011080508 getting stubs
Oct 14 14:54:59 snapshot1015 bash[3380982]: 20241011080510 checking that stubs are present
Oct 14 14:54:59 snapshot1015 bash[3380982]: 20241011080510 getting parts list
Oct 14 14:54:59 snapshot1015 bash[3380982]: 20241011080510 getting jobinfo arg
Oct 14 14:54:59 snapshot1015 bash[3380982]: 20241011080510 Doing parts "25", "26", "27"
Oct 14 14:54:59 snapshot1015 bash[3380982]: 20241011080511 jobinfo_arg: 25:65585259:75798893,26:75798894:88185873,27:88185874:124395022
Oct 14 14:54:59 snapshot1015 bash[3380982]: 20241011080511 doing fillin
Oct 14 14:54:59 snapshot1015 bash[3380982]: echo Started (3380983) 20241011080511 > /mnt/dumpsdata/xmldatadumps/temp/w/wikidatawiki/wikidatawiki-fixups-20241001-25-27-status.txt
Oct 14 14:54:59 snapshot1015 bash[3380982]: /bin/bash /srv/deployment/dumps/dumps/xmldumps-backup/fixup_scripts/do_dumptextpass_jobs.sh --wiki wikidatawiki --config /etc/dumps/confs/wikidump.conf.dumps:wd --dat>
Oct 14 14:54:59 snapshot1015 bash[3380982]: 20241014145450 (3380983) Dump fillin for wikidatawiki complete.
Oct 14 14:54:59 snapshot1015 bash[3380982]: 20241014145459 really done
Oct 14 14:54:59 snapshot1015 systemd[1]: dumps_fillin_wd.service: Succeeded.

I submitted a revert for the reduction of slots used for the enwiki dump, so that it will hopefully start on time on the 20th. We could still research why it didn't start with 18 slots available, but I didn't want to overlook it.

Change #1080265 merged by Btullis:

[operations/puppet@production] Revert "Lower the number of slots that the enwiki dump uses"

https://gerrit.wikimedia.org/r/1080265

Resolving this ticket, although I will be monitoring the progress of the dumps that are still in progress and I will make a note to check whether the 20241020 dumps start correctly, on Monday morning.