Page MenuHomePhabricator

Wikimedia Downloads not complete
Closed, ResolvedPublic

Description

Hi, in the main page (Wikimedia Downloads) there is a lot of wiki indicated as done but in detail not all files have been created. Examplet fot itwiki, in main page I see "2025-01-05 17:21:52 itwiki: Dump complete" but when I open the page for the download i see a lot of file not ready: exemple "waiting All pages, current versions only."
Regards
Valter

Event Timeline

To explain myself better:
in https://dumps.wikimedia.org/backup-index-bydb.html they are all 'Done' except Commons, in reality, most (perhaps all) of the dumps are incomplete.

Thanks for the report.

We did temporary disable the enwiki dumps (T368098#10420647), but it should have not affected other wikis such as itwiki.

Will investigate.

From snapshot1014:

itwiki run for 20250101 appears done:

xcollazo@snapshot1014:/mnt/dumpsdata/xmldatadumps/private/itwiki/20250101$ tail -n 1 dumplog.txt 
2025-01-05 17:22:34: itwiki SUCCESS: done.

When visiting https://dumps.wikimedia.org/itwiki/20250101/, however, as reported there are missing artifacts:

waiting All pages with complete edit history (.7z)
waiting All pages with complete page edit history (.bz2)
waiting Recombine Log events to all pages and users
itwiki-20250101-pages-logging.xml.gz
waiting Log events to all pages and users.
This contains the log of actions performed on pages and users.
itwiki-20250101-pages-logging1.xml.gz
itwiki-20250101-pages-logging2.xml.gz
itwiki-20250101-pages-logging3.xml.gz
itwiki-20250101-pages-logging4.xml.gz
itwiki-20250101-pages-logging5.xml.gz
itwiki-20250101-pages-logging6.xml.gz
waiting Recombine all pages, current versions only.
itwiki-20250101-pages-meta-current.xml.bz2
waiting All pages, current versions only.

Taking artifacts waiting All pages with complete edit history (.7z) as an example, the logs claim they were indeed generated:

xcollazo@snapshot1014:/mnt/dumpsdata/xmldatadumps/private/itwiki/20250101$ cat dumplog.txt | grep meta-history | tail -n 10
2025-01-05 17:21:13: itwiki Checksumming itwiki-20250101-pages-meta-history6.xml-p9316862p9633128.7z via md5
2025-01-05 17:21:18: itwiki Checksumming itwiki-20250101-pages-meta-history6.xml-p9316862p9633128.7z via sha1
2025-01-05 17:21:20: itwiki Adding symlink /mnt/dumpsdata/xmldatadumps/public/itwiki/latest/itwiki-latest-pages-meta-history6.xml-p9633129p10012680.7z -> ../20250101/itwiki-20250101-pages-meta-history6.xml-p9633129p10012680.7z
2025-01-05 17:21:20: itwiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/itwiki/latest/itwiki-latest-pages-meta-history6.xml-p9633129p10012680.7z-rss.xml 
2025-01-05 17:21:20: itwiki Checksumming itwiki-20250101-pages-meta-history6.xml-p9633129p10012680.7z via md5
2025-01-05 17:21:26: itwiki Checksumming itwiki-20250101-pages-meta-history6.xml-p9633129p10012680.7z via sha1
2025-01-05 17:21:27: itwiki Adding symlink /mnt/dumpsdata/xmldatadumps/public/itwiki/latest/itwiki-latest-pages-meta-history6.xml-p10012681p10343795.7z -> ../20250101/itwiki-20250101-pages-meta-history6.xml-p10012681p10343795.7z
2025-01-05 17:21:27: itwiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/itwiki/latest/itwiki-latest-pages-meta-history6.xml-p10012681p10343795.7z-rss.xml 
2025-01-05 17:21:27: itwiki Checksumming itwiki-20250101-pages-meta-history6.xml-p10012681p10343795.7z via md5
2025-01-05 17:21:31: itwiki Checksumming itwiki-20250101-pages-meta-history6.xml-p10012681p10343795.7z via sha1

And if I try to fetch the files from snapshot1014, they are indeed available:

xcollazo@snapshot1014:/mnt/dumpsdata/xmldatadumps/public/itwiki/20250101$ ls -lsha *meta-history*.bz2 | tail -n 10
 1.8G -rw-r--r-- 1 dumpsgen dumpsgen  1.8G Jan  5 10:55 itwiki-20250101-pages-meta-history6.xml-p7301333p7640794.bz2
 2.1G -rw-r--r-- 1 dumpsgen dumpsgen  2.1G Jan  5 11:00 itwiki-20250101-pages-meta-history6.xml-p7640795p7847892.bz2
 972M -rw-r--r-- 1 dumpsgen dumpsgen  972M Jan  5 10:47 itwiki-20250101-pages-meta-history6.xml-p7847893p8084008.bz2
 1.6G -rw-r--r-- 1 dumpsgen dumpsgen  1.6G Jan  5 10:52 itwiki-20250101-pages-meta-history6.xml-p8084009p8293262.bz2
 1.1G -rw-r--r-- 1 dumpsgen dumpsgen  1.1G Jan  5 10:47 itwiki-20250101-pages-meta-history6.xml-p8293263p8520399.bz2
 1.4G -rw-r--r-- 1 dumpsgen dumpsgen  1.4G Jan  5 10:51 itwiki-20250101-pages-meta-history6.xml-p8520400p8742496.bz2
 1.3G -rw-r--r-- 1 dumpsgen dumpsgen  1.3G Jan  5 11:26 itwiki-20250101-pages-meta-history6.xml-p8742497p8991348.bz2
 1.3G -rw-r--r-- 1 dumpsgen dumpsgen  1.3G Jan  5 11:28 itwiki-20250101-pages-meta-history6.xml-p8991349p9316861.bz2
 1.2G -rw-r--r-- 1 dumpsgen dumpsgen  1.2G Jan  5 11:32 itwiki-20250101-pages-meta-history6.xml-p9316862p9633128.bz2
 1.3G -rw-r--r-- 1 dumpsgen dumpsgen  1.3G Jan  5 11:32 itwiki-20250101-pages-meta-history6.xml-p9633129p10012680.bz2

So it looks like we need to run the rsync script manually? @Milimetric can you share the steps?

@BTullis Dan seems OOO till next week. Do you recall how to run the rsync script manually?

There is a service that runs continuously on dumpsdata1006.
The service is a called dumps-rsyncer.service
It claims to be running:

btullis@dumpsdata1006:~$ systemctl status dumps-rsyncer.service 
● dumps-rsyncer.service - Dumps rsyncer service
     Loaded: loaded (/lib/systemd/system/dumps-rsyncer.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2024-08-05 14:39:35 UTC; 5 months 4 days ago
   Main PID: 1301 (bash)
      Tasks: 2 (limit: 76753)
     Memory: 1.9G
        CPU: 1w 2d 17h 48min 6.927s
     CGroup: /system.slice/dumps-rsyncer.service
             ├─   1301 /bin/bash /usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clou>
             └─2524573 sleep 600

Jan 09 14:36:50 dumpsdata1006 dumps-rsyncer[2524480]: ls: write error: Broken pipe
Jan 09 14:36:57 dumpsdata1006 dumps-rsyncer[2524504]: ls: write error: Broken pipe
Jan 09 14:37:05 dumpsdata1006 dumps-rsyncer[2524526]: ls: write error: Broken pipe

But there are some broken pipe errors there, plus it is in a sleep 600 part of the cycle.

btullis@dumpsdata1006:~$ pgrep -fa rsync
1301 /bin/bash /usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clouddumps1001.wikimedia.org::data/xmldatadumps/public/,clouddumps1002.wikimedia.org::data/xmldatadumps/public/
1343 /usr/bin/rsync --daemon --no-detach

The command that it actually runs is called /usr/local/bin/rsync-via-primary.sh and it syncs to three destination servers.

I will restart the service, to see if it kicks in properly.

btullis@dumpsdata1006:~$ systemctl status dumps-rsyncer.service 
● dumps-rsyncer.service - Dumps rsyncer service
     Loaded: loaded (/lib/systemd/system/dumps-rsyncer.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2025-01-09 14:49:26 UTC; 2s ago
   Main PID: 2525445 (bash)
      Tasks: 2 (limit: 76753)
     Memory: 6.7M
        CPU: 2.565s
     CGroup: /system.slice/dumps-rsyncer.service
             ├─2525445 /bin/bash /usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clou>
             └─2525469 /usr/bin/rsync -a --contimeout=600 --timeout=600 --bwlimit=80000 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.html>

Jan 09 14:49:26 dumpsdata1006 systemd[1]: Started Dumps rsyncer service.
Jan 09 14:49:26 dumpsdata1006 dumps-rsyncer[2525449]: ls: write error: Broken pipe

I think that it's the HTML files that are not being synced properly.

Here we can see that the index.html file for itwiki/202501 on dumpsdata1006 shows that the dump is complete.

btullis@dumpsdata1006:/data/xmldatadumps/public/itwiki/20250101$ grep -A1 'class="status"' index.html 
        <p class="status">
                <span class='done'>Dump complete</span>

Running the same command on clouddumps1002 shows the partial dumps.

btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/itwiki/20250101$ grep -A1 'class="status"' index.html
        <p class="status">
                <span class='partial-dump'>Partial dump</span>

The dumps-rsyncer.service that I restarted on dumpsdata1006 specifically exludes *.html files, so I think that it must be another sync process that is broken.

I think that it is something to do with this make_statusfiles_tarball call here.

It is supposed to create a tarball of all of the most recent html and status files here.

The file is up-to-date on clouddumps1002...

btullis@clouddumps1002:~$ ls -l /srv/dumps/xmldatadumps/public/dumpstatusfiles.tar.gz 
-rw-r--r-- 1 root root 3619402 Jan  9 19:19 /srv/dumps/xmldatadumps/public/dumpstatusfiles.tar.gz

But when extracting it, we can see that it only contains the md5 and sha1 checksum files. No HTML files per wiki.

btullis@clouddumps1002:~$ tar xzvf /srv/dumps/xmldatadumps/public/dumpstatusfiles.tar.gz 

btullis@clouddumps1002:~$ tree|head -n 30
.
├── 404.html
├── aawiki
│   └── latest
│       ├── aawiki-latest-md5sums.txt -> ../20250101/aawiki-20250101-md5sums.txt
│       └── aawiki-latest-sha1sums.txt -> ../20250101/aawiki-20250101-sha1sums.txt
├── aawikibooks
│   └── latest
│       ├── aawikibooks-latest-md5sums.txt -> ../20250101/aawikibooks-20250101-md5sums.txt
│       └── aawikibooks-latest-sha1sums.txt -> ../20250101/aawikibooks-20250101-sha1sums.txt
├── aawiktionary
│   └── latest
│       ├── aawiktionary-latest-md5sums.txt -> ../20250101/aawiktionary-20250101-md5sums.txt
│       └── aawiktionary-latest-sha1sums.txt -> ../20250101/aawiktionary-20250101-sha1sums.txt
├── abwiki
│   └── latest
│       ├── abwiki-latest-md5sums.txt -> ../20250101/abwiki-20250101-md5sums.txt
│       └── abwiki-latest-sha1sums.txt -> ../20250101/abwiki-20250101-sha1sums.txt
├── abwiktionary
│   └── latest
│       ├── abwiktionary-latest-md5sums.txt -> ../20250101/abwiktionary-20250101-md5sums.txt
│       └── abwiktionary-latest-sha1sums.txt -> ../20250101/abwiktionary-20250101-sha1sums.txt
├── acewiki
│   └── latest
│       ├── acewiki-latest-md5sums.txt -> ../20250101/acewiki-20250101-md5sums.txt
│       └── acewiki-latest-sha1sums.txt -> ../20250101/acewiki-20250101-sha1sums.txt
├── advisorywiki
│   └── latest
│       ├── advisorywiki-latest-md5sums.txt -> ../20250101/advisorywiki-20250101-md5sums.txt
│       └── advisorywiki-latest-sha1sums.txt -> ../20250101/advisorywiki-20250101-sha1sums.txt
btullis@clouddumps1002:~$ ls -l *.html
-rw-r--r-- 1 btullis wikidev    127 Jul 17  2009 404.html
-rw-r--r-- 1 btullis wikidev 125918 Jan  9 19:18 backup-index-bydb.html
-rw-r--r-- 1 btullis wikidev 125923 Jan  9 19:18 backup-index.html
-rw-r--r-- 1 btullis wikidev 107505 Mar 25  2014 backup-index-sorted.html
-rw-r--r-- 1 btullis wikidev 108044 Jul  2  2014 backup-index-test-bydb.html
-rw-r--r-- 1 btullis wikidev 108049 Jul  2  2014 backup-index-test.html
-rwxr-xr-x 1 btullis wikidev   2392 Sep  7  2011 backups-of-old-wikis.html

IIRC, we've been here before. Perhaps we should wait until @Milimetric comes back, he had a manual way of resyncing all this, is just that, AFAIK, it is not documented.

IIRC, we've been here before. Perhaps we should wait until @Milimetric comes back, he had a manual way of resyncing all this, is just that, AFAIK, it is not documented.

Dan points to T364045#10019074 as the procedure.

I will attempt this now for itwiki see if it works.

$ hostname -f
dumpsdata1006.eqiad.wmnet

/usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/itwiki dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/
/usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/itwiki clouddumps1001.wikimedia.org::data/xmldatadumps/public/
/usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/itwiki clouddumps1002.wikimedia.org::data/xmldatadumps/public/

Now when we visit https://dumps.wikimedia.org/itwiki/20250101/, we see all files as expected.

So this is it. Seems like we have to run this manually for all wikis.

Ran with a wildcard to hit all wikis:

$ hostname -f
dumpsdata1006.eqiad.wmnet

/usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/* dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/
/usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/* clouddumps1001.wikimedia.org::data/xmldatadumps/public/
/usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/* clouddumps1002.wikimedia.org::data/xmldatadumps/public/

Tried a couple wikis:

https://dumps.wikimedia.org/eswiki/20250101/
https://dumps.wikimedia.org/simplewiki/20250101/
https://dumps.wikimedia.org/dewiki/20250101/

They all look good.

@ValterVB please confirm things look good from your side, and if so, please close this ticket. Thanks!

I downloaded the file and everything seems OK. Thank you

xcollazo claimed this task.

Thanks @xcollazo. Will this delay impact the timeliness of any of the downstream pipelines?

Thanks @xcollazo. Will this delay impact the timeliness of any of the downstream pipelines?

This particular issue will not affect timelines. But, there is another issue that might: T383568.

Again same problem: "2025-01-23 20:35:30 itwiki: Dump complete" but some dump aren't ready: example "waiting All pages, current versions only."

Problem not solved, happened again with the dumps of 23 January 2025.

I've applied same fix as in T383030#10449493.

Verified that itwiki looks good at https://dumps.wikimedia.org/itwiki/20250120/.

@ValterVB please verify.

Seems frwiki at https://dumps.wikimedia.org/frwiki/20250120/ has the same problem: "Partial dump" since many days, no progress visible.

There are 2 skipped files on itwiki (also in other wiki):
2025-01-23 18:58:22 skipped All pages with complete edit history (.7z)
2025-01-23 18:58:22 skipped All pages with complete page edit history (.bz2)
I don't need them, but maybe they are a sign of problems.

Thanks

There are 2 skipped files on itwiki (also in other wiki):
2025-01-23 18:58:22 skipped All pages with complete edit history (.7z)
2025-01-23 18:58:22 skipped All pages with complete page edit history (.bz2)
I don't need them, but maybe they are a sign of problems.

Thanks

We typically do two runs per month. The one that starts on the 1st of the month includes those two jobs you mention.

The one that starts on the 20th skips them. So I think we are good.

Oops, I didn't know that. Thanks