Hi, in the main page (Wikimedia Downloads) there is a lot of wiki indicated as done but in detail not all files have been created. Examplet fot itwiki, in main page I see "2025-01-05 17:21:52 itwiki: Dump complete" but when I open the page for the download i see a lot of file not ready: exemple "waiting All pages, current versions only."
Regards
Valter
Description
Related Objects
Event Timeline
To explain myself better:
in https://dumps.wikimedia.org/backup-index-bydb.html they are all 'Done' except Commons, in reality, most (perhaps all) of the dumps are incomplete.
Thanks for the report.
We did temporary disable the enwiki dumps (T368098#10420647), but it should have not affected other wikis such as itwiki.
Will investigate.
From snapshot1014:
itwiki run for 20250101 appears done:
xcollazo@snapshot1014:/mnt/dumpsdata/xmldatadumps/private/itwiki/20250101$ tail -n 1 dumplog.txt 2025-01-05 17:22:34: itwiki SUCCESS: done.
When visiting https://dumps.wikimedia.org/itwiki/20250101/, however, as reported there are missing artifacts:
waiting All pages with complete edit history (.7z) waiting All pages with complete page edit history (.bz2) waiting Recombine Log events to all pages and users itwiki-20250101-pages-logging.xml.gz waiting Log events to all pages and users. This contains the log of actions performed on pages and users. itwiki-20250101-pages-logging1.xml.gz itwiki-20250101-pages-logging2.xml.gz itwiki-20250101-pages-logging3.xml.gz itwiki-20250101-pages-logging4.xml.gz itwiki-20250101-pages-logging5.xml.gz itwiki-20250101-pages-logging6.xml.gz waiting Recombine all pages, current versions only. itwiki-20250101-pages-meta-current.xml.bz2 waiting All pages, current versions only.
Taking artifacts waiting All pages with complete edit history (.7z) as an example, the logs claim they were indeed generated:
xcollazo@snapshot1014:/mnt/dumpsdata/xmldatadumps/private/itwiki/20250101$ cat dumplog.txt | grep meta-history | tail -n 10 2025-01-05 17:21:13: itwiki Checksumming itwiki-20250101-pages-meta-history6.xml-p9316862p9633128.7z via md5 2025-01-05 17:21:18: itwiki Checksumming itwiki-20250101-pages-meta-history6.xml-p9316862p9633128.7z via sha1 2025-01-05 17:21:20: itwiki Adding symlink /mnt/dumpsdata/xmldatadumps/public/itwiki/latest/itwiki-latest-pages-meta-history6.xml-p9633129p10012680.7z -> ../20250101/itwiki-20250101-pages-meta-history6.xml-p9633129p10012680.7z 2025-01-05 17:21:20: itwiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/itwiki/latest/itwiki-latest-pages-meta-history6.xml-p9633129p10012680.7z-rss.xml 2025-01-05 17:21:20: itwiki Checksumming itwiki-20250101-pages-meta-history6.xml-p9633129p10012680.7z via md5 2025-01-05 17:21:26: itwiki Checksumming itwiki-20250101-pages-meta-history6.xml-p9633129p10012680.7z via sha1 2025-01-05 17:21:27: itwiki Adding symlink /mnt/dumpsdata/xmldatadumps/public/itwiki/latest/itwiki-latest-pages-meta-history6.xml-p10012681p10343795.7z -> ../20250101/itwiki-20250101-pages-meta-history6.xml-p10012681p10343795.7z 2025-01-05 17:21:27: itwiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/itwiki/latest/itwiki-latest-pages-meta-history6.xml-p10012681p10343795.7z-rss.xml 2025-01-05 17:21:27: itwiki Checksumming itwiki-20250101-pages-meta-history6.xml-p10012681p10343795.7z via md5 2025-01-05 17:21:31: itwiki Checksumming itwiki-20250101-pages-meta-history6.xml-p10012681p10343795.7z via sha1
And if I try to fetch the files from snapshot1014, they are indeed available:
xcollazo@snapshot1014:/mnt/dumpsdata/xmldatadumps/public/itwiki/20250101$ ls -lsha *meta-history*.bz2 | tail -n 10 1.8G -rw-r--r-- 1 dumpsgen dumpsgen 1.8G Jan 5 10:55 itwiki-20250101-pages-meta-history6.xml-p7301333p7640794.bz2 2.1G -rw-r--r-- 1 dumpsgen dumpsgen 2.1G Jan 5 11:00 itwiki-20250101-pages-meta-history6.xml-p7640795p7847892.bz2 972M -rw-r--r-- 1 dumpsgen dumpsgen 972M Jan 5 10:47 itwiki-20250101-pages-meta-history6.xml-p7847893p8084008.bz2 1.6G -rw-r--r-- 1 dumpsgen dumpsgen 1.6G Jan 5 10:52 itwiki-20250101-pages-meta-history6.xml-p8084009p8293262.bz2 1.1G -rw-r--r-- 1 dumpsgen dumpsgen 1.1G Jan 5 10:47 itwiki-20250101-pages-meta-history6.xml-p8293263p8520399.bz2 1.4G -rw-r--r-- 1 dumpsgen dumpsgen 1.4G Jan 5 10:51 itwiki-20250101-pages-meta-history6.xml-p8520400p8742496.bz2 1.3G -rw-r--r-- 1 dumpsgen dumpsgen 1.3G Jan 5 11:26 itwiki-20250101-pages-meta-history6.xml-p8742497p8991348.bz2 1.3G -rw-r--r-- 1 dumpsgen dumpsgen 1.3G Jan 5 11:28 itwiki-20250101-pages-meta-history6.xml-p8991349p9316861.bz2 1.2G -rw-r--r-- 1 dumpsgen dumpsgen 1.2G Jan 5 11:32 itwiki-20250101-pages-meta-history6.xml-p9316862p9633128.bz2 1.3G -rw-r--r-- 1 dumpsgen dumpsgen 1.3G Jan 5 11:32 itwiki-20250101-pages-meta-history6.xml-p9633129p10012680.bz2
So it looks like we need to run the rsync script manually? @Milimetric can you share the steps?
@BTullis Dan seems OOO till next week. Do you recall how to run the rsync script manually?
There is a service that runs continuously on dumpsdata1006.
The service is a called dumps-rsyncer.service
It claims to be running:
btullis@dumpsdata1006:~$ systemctl status dumps-rsyncer.service
● dumps-rsyncer.service - Dumps rsyncer service
Loaded: loaded (/lib/systemd/system/dumps-rsyncer.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2024-08-05 14:39:35 UTC; 5 months 4 days ago
Main PID: 1301 (bash)
Tasks: 2 (limit: 76753)
Memory: 1.9G
CPU: 1w 2d 17h 48min 6.927s
CGroup: /system.slice/dumps-rsyncer.service
├─ 1301 /bin/bash /usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clou>
└─2524573 sleep 600
Jan 09 14:36:50 dumpsdata1006 dumps-rsyncer[2524480]: ls: write error: Broken pipe
Jan 09 14:36:57 dumpsdata1006 dumps-rsyncer[2524504]: ls: write error: Broken pipe
Jan 09 14:37:05 dumpsdata1006 dumps-rsyncer[2524526]: ls: write error: Broken pipeBut there are some broken pipe errors there, plus it is in a sleep 600 part of the cycle.
btullis@dumpsdata1006:~$ pgrep -fa rsync 1301 /bin/bash /usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clouddumps1001.wikimedia.org::data/xmldatadumps/public/,clouddumps1002.wikimedia.org::data/xmldatadumps/public/ 1343 /usr/bin/rsync --daemon --no-detach
The command that it actually runs is called /usr/local/bin/rsync-via-primary.sh and it syncs to three destination servers.
I will restart the service, to see if it kicks in properly.
btullis@dumpsdata1006:~$ systemctl status dumps-rsyncer.service
● dumps-rsyncer.service - Dumps rsyncer service
Loaded: loaded (/lib/systemd/system/dumps-rsyncer.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2025-01-09 14:49:26 UTC; 2s ago
Main PID: 2525445 (bash)
Tasks: 2 (limit: 76753)
Memory: 6.7M
CPU: 2.565s
CGroup: /system.slice/dumps-rsyncer.service
├─2525445 /bin/bash /usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clou>
└─2525469 /usr/bin/rsync -a --contimeout=600 --timeout=600 --bwlimit=80000 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.html>
Jan 09 14:49:26 dumpsdata1006 systemd[1]: Started Dumps rsyncer service.
Jan 09 14:49:26 dumpsdata1006 dumps-rsyncer[2525449]: ls: write error: Broken pipeI think that it's the HTML files that are not being synced properly.
Here we can see that the index.html file for itwiki/202501 on dumpsdata1006 shows that the dump is complete.
btullis@dumpsdata1006:/data/xmldatadumps/public/itwiki/20250101$ grep -A1 'class="status"' index.html
<p class="status">
<span class='done'>Dump complete</span>Running the same command on clouddumps1002 shows the partial dumps.
btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/itwiki/20250101$ grep -A1 'class="status"' index.html
<p class="status">
<span class='partial-dump'>Partial dump</span>The dumps-rsyncer.service that I restarted on dumpsdata1006 specifically exludes *.html files, so I think that it must be another sync process that is broken.
I think that it is something to do with this make_statusfiles_tarball call here.
It is supposed to create a tarball of all of the most recent html and status files here.
The file is up-to-date on clouddumps1002...
btullis@clouddumps1002:~$ ls -l /srv/dumps/xmldatadumps/public/dumpstatusfiles.tar.gz -rw-r--r-- 1 root root 3619402 Jan 9 19:19 /srv/dumps/xmldatadumps/public/dumpstatusfiles.tar.gz
But when extracting it, we can see that it only contains the md5 and sha1 checksum files. No HTML files per wiki.
btullis@clouddumps1002:~$ tar xzvf /srv/dumps/xmldatadumps/public/dumpstatusfiles.tar.gz btullis@clouddumps1002:~$ tree|head -n 30 . ├── 404.html ├── aawiki │ └── latest │ ├── aawiki-latest-md5sums.txt -> ../20250101/aawiki-20250101-md5sums.txt │ └── aawiki-latest-sha1sums.txt -> ../20250101/aawiki-20250101-sha1sums.txt ├── aawikibooks │ └── latest │ ├── aawikibooks-latest-md5sums.txt -> ../20250101/aawikibooks-20250101-md5sums.txt │ └── aawikibooks-latest-sha1sums.txt -> ../20250101/aawikibooks-20250101-sha1sums.txt ├── aawiktionary │ └── latest │ ├── aawiktionary-latest-md5sums.txt -> ../20250101/aawiktionary-20250101-md5sums.txt │ └── aawiktionary-latest-sha1sums.txt -> ../20250101/aawiktionary-20250101-sha1sums.txt ├── abwiki │ └── latest │ ├── abwiki-latest-md5sums.txt -> ../20250101/abwiki-20250101-md5sums.txt │ └── abwiki-latest-sha1sums.txt -> ../20250101/abwiki-20250101-sha1sums.txt ├── abwiktionary │ └── latest │ ├── abwiktionary-latest-md5sums.txt -> ../20250101/abwiktionary-20250101-md5sums.txt │ └── abwiktionary-latest-sha1sums.txt -> ../20250101/abwiktionary-20250101-sha1sums.txt ├── acewiki │ └── latest │ ├── acewiki-latest-md5sums.txt -> ../20250101/acewiki-20250101-md5sums.txt │ └── acewiki-latest-sha1sums.txt -> ../20250101/acewiki-20250101-sha1sums.txt ├── advisorywiki │ └── latest │ ├── advisorywiki-latest-md5sums.txt -> ../20250101/advisorywiki-20250101-md5sums.txt │ └── advisorywiki-latest-sha1sums.txt -> ../20250101/advisorywiki-20250101-sha1sums.txt
btullis@clouddumps1002:~$ ls -l *.html -rw-r--r-- 1 btullis wikidev 127 Jul 17 2009 404.html -rw-r--r-- 1 btullis wikidev 125918 Jan 9 19:18 backup-index-bydb.html -rw-r--r-- 1 btullis wikidev 125923 Jan 9 19:18 backup-index.html -rw-r--r-- 1 btullis wikidev 107505 Mar 25 2014 backup-index-sorted.html -rw-r--r-- 1 btullis wikidev 108044 Jul 2 2014 backup-index-test-bydb.html -rw-r--r-- 1 btullis wikidev 108049 Jul 2 2014 backup-index-test.html -rwxr-xr-x 1 btullis wikidev 2392 Sep 7 2011 backups-of-old-wikis.html
IIRC, we've been here before. Perhaps we should wait until @Milimetric comes back, he had a manual way of resyncing all this, is just that, AFAIK, it is not documented.
Dan points to T364045#10019074 as the procedure.
I will attempt this now for itwiki see if it works.
$ hostname -f dumpsdata1006.eqiad.wmnet /usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/itwiki dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/ /usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/itwiki clouddumps1001.wikimedia.org::data/xmldatadumps/public/ /usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/itwiki clouddumps1002.wikimedia.org::data/xmldatadumps/public/
Now when we visit https://dumps.wikimedia.org/itwiki/20250101/, we see all files as expected.
So this is it. Seems like we have to run this manually for all wikis.
Ran with a wildcard to hit all wikis:
$ hostname -f dumpsdata1006.eqiad.wmnet /usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/* dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/ /usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/* clouddumps1001.wikimedia.org::data/xmldatadumps/public/ /usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/* clouddumps1002.wikimedia.org::data/xmldatadumps/public/
Tried a couple wikis:
https://dumps.wikimedia.org/eswiki/20250101/
https://dumps.wikimedia.org/simplewiki/20250101/
https://dumps.wikimedia.org/dewiki/20250101/
They all look good.
@ValterVB please confirm things look good from your side, and if so, please close this ticket. Thanks!
Thanks @xcollazo. Will this delay impact the timeliness of any of the downstream pipelines?
This particular issue will not affect timelines. But, there is another issue that might: T383568.
Again same problem: "2025-01-23 20:35:30 itwiki: Dump complete" but some dump aren't ready: example "waiting All pages, current versions only."
I've applied same fix as in T383030#10449493.
Verified that itwiki looks good at https://dumps.wikimedia.org/itwiki/20250120/.
@ValterVB please verify.
Seems frwiki at https://dumps.wikimedia.org/frwiki/20250120/ has the same problem: "Partial dump" since many days, no progress visible.
There are 2 skipped files on itwiki (also in other wiki):
2025-01-23 18:58:22 skipped All pages with complete edit history (.7z)
2025-01-23 18:58:22 skipped All pages with complete page edit history (.bz2)
I don't need them, but maybe they are a sign of problems.
Thanks
We typically do two runs per month. The one that starts on the 1st of the month includes those two jobs you mention.
The one that starts on the 20th skips them. So I think we are good.