Page MenuHomePhabricator

Some dumps are not available since mid may 2024
Closed, ResolvedPublic

Description

Several dumps are not available:

Impact:

  • The categories graph on wdqs nodes is not updated: Categories update lag on wdqs1016 is CRITICAL: CRITICAL - Categories lag: 12 days
  • The search airflow dags importing RDF dumps are failing

Event Timeline

It might be related to T325228: Migrate Dumps Snapshot hosts from Buster to Bullseye and this patch: 1029220: Move dumps::generation::worker::dumper_misc_crons_only role | https://gerrit.wikimedia.org/r/c/operations/puppet/+/1029220 which was merged on May 16th.

All of the dumps mentioned here are in this list: T325228#9781322
...and I believe are managed from this manifest and related classes: https://github.com/wikimedia/operations-puppet/blob/production/modules/snapshot/manifests/systemdjobs.pp

I will check the logs of the jobs now.

I'm looking into this, but I haven't found an exact cause yet.
Looking at wikibase/wikidatawiki first:
I can see that on the current dumps generation NFS server dumpsdata1006 we have new files, but also some broken symlinks:

image.png (382×1 px, 158 KB)

However, on the current distribution server clouddumps1002, we don't have any of these new files, no files newer than May 15th.

image.png (688×1 px, 256 KB)

On the host that generates these dumps snapshot1017 we have a number of systemd process running:

image.png (388×1 px, 117 KB)

...and log files being generated currently in /var/log/wikidatadump:

image.png (226×936 px, 72 KB)

So at the moment, my first line of enquiry is the rsync process that should be pulling new files from dumpsdata1006 to clouddumps1002.

Change #1036626 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure snapshot1017 to be the misc cron snapshot runner

https://gerrit.wikimedia.org/r/1036626

This is the list of jobs that run on this miscellaneous cron job host:

  • adds-changes
  • categoriesrdf-dump-daily
  • categoriesrdf-dump
  • cirrussearch-dump-s1
  • cirrussearch-dump-s11
  • cirrussearch-dump-s2
  • cirrussearch-dump-s3
  • cirrussearch-dump-s4
  • cirrussearch-dump-s5
  • cirrussearch-dump-s6
  • cirrussearch-dump-s7
  • cirrussearch-dump-s8
  • cirrussearch-dump
  • commonsjson-dump
  • commonsrdf-dump
  • global_blocks_dump
  • growth_mentorship_dump
  • list-media-per-project
  • pagetitles-ns0
  • pagetitles-ns6
  • shorturls
  • wikidatajson-dump
  • wikidatajson-lexemes-dump
  • wikidatardf-all-dumps
  • wikidatardf-lexemes-dumps
  • wikidatardf-truthy-dumps
  • xlation-dumps

Whilst checking with @JAllemandou we think we have found the problem. The NFS server host was set incorrectly on snapshot1017, so it was saving the dumps to dumpsdata1006 instead of dumpsdata1003.

We have a patch to switch the NFS server from dumpsdata1006 to dumpsdata1003. https://gerrit.wikimedia.org/r/c/1036626

However, it won't apply cleanly unless snapshot1017 can unmount /mnt/dumpsdata and re-mount it.
The following active dumps have files open on dumpsdata1006, so it will not be possible for this to happen.

btullis@snapshot1017:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki$ sudo lsof -N
COMMAND     PID     USER   FD   TYPE DEVICE    SIZE/OFF       NODE NAME
gzip    2376144 dumpsgen    1w   REG   0,53  5719244800 2512942465 /mnt/dumpsdata/xmldatadumps/temp/commonswiki-20240527-cirrussearch-content.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    2539850 dumpsgen    1w   REG   0,53 31442550784 2512943304 /mnt/dumpsdata/xmldatadumps/temp/enwiki-20240527-cirrussearch-general.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3106988 dumpsgen    1w   REG   0,53  5338562560 2512945446 /mnt/dumpsdata/xmldatadumps/temp/metawiki-20240527-cirrussearch-general.json.gz (dumpsdata1006.eqiad.wmnet:/data)
bash    3194400  btullis  cwd    DIR   0,53        4096  933756942 /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki (dumpsdata1006.eqiad.wmnet:/data)
gzip    3198641 dumpsgen    1w   REG   0,53   710410240 2512945823 /mnt/dumpsdata/xmldatadumps/temp/wikidatawiki-20240528-cirrussearch-content.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3266326 dumpsgen    1w   REG   0,53    64602112 2512946170 /mnt/dumpsdata/xmldatadumps/temp/wikidata-all.4-batch138.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3267842 dumpsgen    1w   REG   0,53    44957696 2512946178 /mnt/dumpsdata/xmldatadumps/temp/wikidata-all.6-batch138.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3268237 dumpsgen    1w   REG   0,53    43843584 2512946185 /mnt/dumpsdata/xmldatadumps/temp/wikidata-all.5-batch138.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3268406 dumpsgen    1w   REG   0,53    78331904 2512946186 /mnt/dumpsdata/xmldatadumps/temp/wikidatattl-all.0-batch67.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3268421 dumpsgen    1w   REG   0,53    77512704 2512946187 /mnt/dumpsdata/xmldatadumps/temp/wikidatattl-all.1-batch67.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3268437 dumpsgen    1w   REG   0,53    78659584 2512946188 /mnt/dumpsdata/xmldatadumps/temp/wikidatattl-all.6-batch67.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3268451 dumpsgen    1w   REG   0,53    77725696 2512946189 /mnt/dumpsdata/xmldatadumps/temp/wikidatattl-all.2-batch67.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3268458 dumpsgen    1w   REG   0,53    77938688 2512946190 /mnt/dumpsdata/xmldatadumps/temp/wikidatattl-all.4-batch67.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3268470 dumpsgen    1w   REG   0,53    77430784 2512946191 /mnt/dumpsdata/xmldatadumps/temp/wikidatattl-all.5-batch67.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3269092 dumpsgen    1w   REG   0,53    38731776 2512946192 /mnt/dumpsdata/xmldatadumps/temp/wikidata-all.7-batch138.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3269238 dumpsgen    1w   REG   0,53    70270976 2512946193 /mnt/dumpsdata/xmldatadumps/temp/wikidatattl-all.3-batch67.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3270792 dumpsgen    1w   REG   0,53    38354944 2512946165 /mnt/dumpsdata/xmldatadumps/temp/wikidata-all.2-batch143.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3271304 dumpsgen    1w   REG   0,53    49922048 2512946198 /mnt/dumpsdata/xmldatadumps/temp/wikidatattl-all.7-batch67.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3274097 dumpsgen    1w   REG   0,53    20856832 2512946199 /mnt/dumpsdata/xmldatadumps/temp/commonsttl-mediainfo.4-batch171.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3274100 dumpsgen    1w   REG   0,53    16072704 2512946200 /mnt/dumpsdata/xmldatadumps/temp/commons-mediainfo.4-batch171.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3274765 dumpsgen    1w   REG   0,53    14893056 2512946202 /mnt/dumpsdata/xmldatadumps/temp/wikidata-all.1-batch143.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3275733 dumpsgen    1w   REG   0,53     9732096 2512946173 /mnt/dumpsdata/xmldatadumps/temp/commons-mediainfo.7-batch171.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3275736 dumpsgen    1w   REG   0,53    11812864 2512946203 /mnt/dumpsdata/xmldatadumps/temp/commonsttl-mediainfo.7-batch171.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3276025 dumpsgen    1w   REG   0,53     8388608 2512946206 /mnt/dumpsdata/xmldatadumps/temp/commons-mediainfo.5-batch171.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3276028 dumpsgen    1w   REG   0,53     9977856 2512946207 /mnt/dumpsdata/xmldatadumps/temp/commonsttl-mediainfo.5-batch171.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3276053 dumpsgen    1w   REG   0,53     7323648 2512946208 /mnt/dumpsdata/xmldatadumps/temp/wikidata-all.0-batch143.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3276225 dumpsgen    1w   REG   0,53     8110080 2512946210 /mnt/dumpsdata/xmldatadumps/temp/commons-mediainfo.6-batch171.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3276228 dumpsgen    1w   REG   0,53     9633792 2512946211 /mnt/dumpsdata/xmldatadumps/temp/commonsttl-mediainfo.6-batch171.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3276783 dumpsgen    1w   REG   0,53   214220800 2512946215 /mnt/dumpsdata/xmldatadumps/temp/pnbwiki-20240527-cirrussearch-content.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3276860 dumpsgen    1w   REG   0,53     6930432 2512946216 /mnt/dumpsdata/xmldatadumps/temp/commonsttl-mediainfo.0-batch171.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3276863 dumpsgen    1w   REG   0,53     5832704 2512946217 /mnt/dumpsdata/xmldatadumps/temp/commons-mediainfo.0-batch171.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3277238 dumpsgen    1w   REG   0,53     4423680 2512946219 /mnt/dumpsdata/xmldatadumps/temp/commons-mediainfo.1-batch171.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3277239 dumpsgen    1w   REG   0,53     5357568 2512946218 /mnt/dumpsdata/xmldatadumps/temp/commonsttl-mediainfo.1-batch171.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3277312 dumpsgen    1w   REG   0,53     3522560 2512946220 /mnt/dumpsdata/xmldatadumps/temp/commons-mediainfo.3-batch171.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3277319 dumpsgen    1w   REG   0,53     4390912 2512946221 /mnt/dumpsdata/xmldatadumps/temp/commonsttl-mediainfo.3-batch171.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3277452 dumpsgen    1w   REG   0,53     2719744 2512946169 /mnt/dumpsdata/xmldatadumps/temp/commons-mediainfo.2-batch171.json.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3277455 dumpsgen    1w   REG   0,53     3473408 2512946222 /mnt/dumpsdata/xmldatadumps/temp/commonsttl-mediainfo.2-batch171.gz (dumpsdata1006.eqiad.wmnet:/data)
gzip    3277749 dumpsgen    1w   REG   0,53     1605632 2512946224 /mnt/dumpsdata/xmldatadumps/temp/wikidata-all.3-batch139.json.gz (dumpsdata1006.eqiad.wmnet:/data)
sudo    3278010     root  cwd    DIR   0,53        4096  933756942 /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki (dumpsdata1006.eqiad.wmnet:/data)
lsof    3278011     root  cwd    DIR   0,53        4096  933756942 /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki (dumpsdata1006.eqiad.wmnet:/data)
lsof    3278012     root  cwd    DIR   0,53        4096  933756942 /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki (dumpsdata1006.eqiad.wmnet:/data)

I believe that we may be able to fix this by running the following on dumpsdata1006:

/usr/local/bin/rsync-via-primary.sh --do_rsync_misc --do_rsync_miscsubs --miscdumpsdir /data/otherdumps --miscremotedirs clouddumps1001.wikimedia.org::data/xmldatadumps/public/other/,clouddumps1002.wikimedia.org::data/xmldatadumps/public/other/ --miscsubdirs incr,categoriesrdf --miscremotesubs dumpsdata1007.eqiad.wmnet::data/otherdumps/

This is the command that is run by the dumps-rsyncer systemd service on dumpsdata1003, as created by the dumps::generation::server::rsyncer_misc class.

We will want to be careful that it doesn't get overrwitten by the same time running again from dumpsdata1003, so it might be best to disable that timer on dumpsdata1003, until https://gerrit.wikimedia.org/r/c/1036626 can be merged and applied.

A dry-run of that command looks like this:

dumpsgen@dumpsdata1003:/data/otherdumps/wikibase/wikidatawiki$ /usr/local/bin/rsync-via-primary.sh --dryrun --do_rsync_misc --do_rsync_miscsubs --miscdumpsdir /data/otherdumps --miscremotedirs clouddumps1001.wikimedia.org::data/xmldatadumps/public/other/,clouddumps1002.wikimedia.org::data/xmldatadumps/public/other/ --miscsubdirs incr,categoriesrdf --miscremotesubs dumpsdata1007.eqiad.wmnet::data/otherdumps/
/usr/bin/rsync -a --contimeout=600 --timeout=600 --bwlimit=80000 /data/otherdumps/categoriesrdf /data/otherdumps/cirrussearch /data/otherdumps/commons /data/otherdumps/contenttranslation /data/otherdumps/globalblocks /data/otherdumps/growthmentorship /data/otherdumps/imageinfo /data/otherdumps/incr /data/otherdumps/machinevision /data/otherdumps/mediatitles /data/otherdumps/pagetitles /data/otherdumps/shorturls /data/otherdumps/testfiles /data/otherdumps/wikibase /data/otherdumps/wikidata clouddumps1001.wikimedia.org::data/xmldatadumps/public/other/
/usr/bin/rsync -a --contimeout=600 --timeout=600 --bwlimit=80000 /data/otherdumps/categoriesrdf /data/otherdumps/cirrussearch /data/otherdumps/commons /data/otherdumps/contenttranslation /data/otherdumps/globalblocks /data/otherdumps/growthmentorship /data/otherdumps/imageinfo /data/otherdumps/incr /data/otherdumps/machinevision /data/otherdumps/mediatitles /data/otherdumps/pagetitles /data/otherdumps/shorturls /data/otherdumps/testfiles /data/otherdumps/wikibase /data/otherdumps/wikidata clouddumps1002.wikimedia.org::data/xmldatadumps/public/other/
/usr/bin/rsync -a --contimeout=600 --timeout=600 --bwlimit=80000 /data/otherdumps/incr dumpsdata1007.eqiad.wmnet::data/otherdumps/
/usr/bin/rsync -a --contimeout=600 --timeout=600 --bwlimit=80000 /data/otherdumps/categoriesrdf dumpsdata1007.eqiad.wmnet::data/otherdumps/

I have disabled puppet on dumpsadata1003 and temporarily disabled the dumps-rsyncer systemd timer.

btullis@dumpsdata1003:~$ sudo disable-puppet btullis-T366043
btullis@dumpsdata1003:~$ sudo systemctl disable dumps-rsyncer
Removed /etc/systemd/system/multi-user.target.wants/dumps-rsyncer.service.
btullis@dumpsdata1003:~$ sudo systemctl stop dumps-rsyncer
btullis@dumpsdata1003:~$ sudo systemctl status dumps-rsyncer
● dumps-rsyncer.service - Dumps misc rsyncer service
     Loaded: loaded (/lib/systemd/system/dumps-rsyncer.service; disabled; vendor preset: enabled)
     Active: inactive (dead)

May 28 11:52:39 dumpsdata1003 systemd[1]: Stopping Dumps misc rsyncer service...
May 28 11:52:39 dumpsdata1003 systemd[1]: dumps-rsyncer.service: Succeeded.
May 28 11:52:39 dumpsdata1003 systemd[1]: Stopped Dumps misc rsyncer service.
May 28 11:52:39 dumpsdata1003 systemd[1]: dumps-rsyncer.service: Consumed 4d 8h 5min 23.068s CPU time.

I will proceed to run the dumps sync using the ad-hoc command above.

Running the following command as the dumpsgen user, in a screen session, on dumpsdata1006.

dumpsgen@dumpsdata1006:/home/btullis$ /usr/local/bin/rsync-via-primary.sh --do_rsync_misc --do_rsync_miscsubs --miscdumpsdir /data/otherdumps --miscremotedirs clouddumps1001.wikimedia.org::data/xmldatadumps/public/other/,clouddumps1002.wikimedia.org::data/xmldatadumps/public/other/ --miscsubdirs incr,categoriesrdf --miscremotesubs dumpsdata1007.eqiad.wmnet::data/otherdumps/

Screenshot_20240528_191318_Firefox.jpg (1×1 px, 314 KB)

It looks like the https://dumps.wikimedia.org/other/categoriesrdf/daily/ have been populated with the latest daily dumps. The others haven't changed yet.

@dcausse are you able to rerun the airflow DAGs on the new dailies yet?

@BTullis thanks! Categories are reloaded via a cronjob on all WDQS machine, the job is about to run in 30 mins

@BTullis thanks! Categories are reloaded via a cronjob on all WDQS machine, the job is about to run in 30 mins

I can confirm that this fixed the issue for the wdqs categories lag, alerts are resolving

Gehel triaged this task as High priority.
Gehel moved this task from Incoming to 2024.05.27 - 2024.06.16 on the Data-Platform-SRE board.

Change #1036626 merged by Btullis:

[operations/puppet@production] Configure snapshot1017 to be the misc cron snapshot runner

https://gerrit.wikimedia.org/r/1036626

I had to kill three running wikibase dumps on snapshot1017

  • commonsrdf-dump
  • commonsjson-dump
  • wikidatajson-dump

...and then I was able to change the NFS server with puppet.

Notice: /Stage[main]/Profile::Dumps::Generation::Worker::Common/Snapshot::Dumps::Datamount[dumpsdatamount]/Mount[/mnt/dumpsdata]/device: device changed 'dumpsdata1006.eqiad.wmnet:/data' to 'dumpsdata1003.eqiad.wmnet:/data'
Info: Computing checksum on file /etc/fstab
Info: /Stage[main]/Profile::Dumps::Generation::Worker::Common/Snapshot::Dumps::Datamount[dumpsdatamount]/Mount[/mnt/dumpsdata]: Scheduling refresh of Mount[/mnt/dumpsdata]
Info: Mount[/mnt/dumpsdata](provider=parsed): Remounting
Notice: /Stage[main]/Profile::Dumps::Generation::Worker::Common/Snapshot::Dumps::Datamount[dumpsdatamount]/Mount[/mnt/dumpsdata]: Triggered 'refresh' from 1 event
Info: /Stage[main]/Profile::Dumps::Generation::Worker::Common/Snapshot::Dumps::Datamount[dumpsdatamount]/Mount[/mnt/dumpsdata]: Scheduling refresh of Mount[/mnt/dumpsdata]

I will restart these services, but check that the rsync-via-primary script that is running on dumpsdata1006 has copied the latest versions to dumpsdata1003 before doing so.

I briefly attempted to sync the relevant contents of /data/xmldatadumps/temp from dumpsdata1006 from dumpsdata1003, but this was excluded by the rsync configuration.

So I have just restarted the dumps on snapshot1017 with:

btullis@snapshot1017:~$ sudo systemctl restart commonsjson-dump.service commonsrdf-dump.service wikidatajson-dump.service

I believe that this is now resolved, but please do let me know if you feel that this is incorrect.