Page MenuHomePhabricator

frwiki trashing for 20241001 dump
Closed, ResolvedPublic

Description

As per https://groups.google.com/a/wikimedia.org/g/ops-dumps/c/seiCsdrBvu0/m/SQSUr9x0AAAJ

frwiki has been trashing and not making progress:

PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p11295p15049.bz2.inprog at least 4 hours older than lock
PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p1210p2359.bz2.inprog at least 4 hours older than lock
PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p19692p24985.bz2.inprog at least 4 hours older than lock
PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p1p1209.bz2.inprog at least 4 hours older than lock
PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p2360p3431.bz2.inprog at least 4 hours older than lock
PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p31371p38442.bz2.inprog at least 4 hours older than lock
PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p3432p6575.bz2.inprog at least 4 hours older than lock
PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p6576p8686.bz2.inprog at least 4 hours older than lock
PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p8687p11294.bz2.inprog at least 4 hours older than lock
PROBLEM: labswiki has file labswiki/20241001/labswiki-20241001-pages-meta-history.xml.bz2.inprog at least 4 hours older than lock

Event Timeline

xcollazo changed the task status from Open to In Progress.Oct 10 2024, 5:18 PM

Figure where things are running:

pwd
/mnt/dumpsdata/xmldatadumps/private/frwiki

cat lock_20241001 
snapshot1010.eqiad.wmnet 844031

Now let's go to the node running frwiki and nuke all running processes:

ssh snapshot1010.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki frwiki --dryrun
would kill processes ['2201330', '2201332', '2201334', '2201698', '2201716', '2201753']

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki commonswiki

Remove stale lock:

rm /mnt/dumpsdata/xmldatadumps/private/frwiki/lock_20241001

Now we wait, see if the dump gets picked up automatically again.

Not sure if related, but seeing lots of

PHP Notice: fwrite(): write of 137 bytes failed with errno=32 Broken pipe

on frwiki coming from dumps since 0 hours today.

From @BTullis via Slack:

Ben killed the frwiki as it was not making progress. Then:

Now frwiki is back to running only on snapshot1010 - It still seems to be making progress, but I don't know how to make it any faster. The files in /mnt/dumpsdata/xmldatadumps/public/frwiki/20241001 are still being written to regularly.
The retry related errors are still appearing in /mnt/dumpsdata/xmldatadumps/private/frwiki/20241001/dumplog.txt but it seems that it has been like this for a long time, as
@Milimetric
pointed out in the other thread: T315902 New error "DB is set and has not been closed by the Load Balancer" for certain bad revisions during page content dumps
So at this point I'm not sure what else we can do about the frwiki dumps, other than leave them and keep monitoring: https://dumps.wikimedia.org/frwiki/20241001/

From @BTullis via Slack:

Ben killed the frwiki as it was not making progress. Then:

Now frwiki is back to running only on snapshot1010

For clarification, I only killed the frwiki processes that were running on snapshot1012.
I was looking at the output from:

btullis@cumin1002:~$ sudo cumin A:snapshot 'lsof -N|grep frwiki'

...and I saw that both snapshot1010 and snapshot1012 had the text file /mnt/dumpsdata/xmldatadumps/private/frwiki/20241001/dumplog.txt open for writing.

I wasn't used to seeing a dump running on multiple servers at the same time and thought that it was probably not helping. So I killed those processes that were running on snapshot1012.

As mentioned elsewhere, we should explore whether followinga process similar to what was done on T346969: Clean up Bad Blobs would help the frwiki situation.

Well we are in another weird state.

Progress seems to be made, as /mnt/dumpsdata/xmldatadumps/private/frwiki/20241001/dumplog.txt is being written to:

$ date
Tue 15 Oct 2024 04:47:46 PM UTC
$ ls -lsha dumplog.txt 
175M -rw-r--r-- 1 dumpsgen dumpsgen 175M Oct 15 16:48 dumplog.txt

However.. there is no lock file! :

xcollazo@snapshot1010:/mnt/dumpsdata/xmldatadumps/private/frwiki$ ls -ls *lock*
ls: cannot access '*lock*': No such file or directory

So, let's kill the process one more time...

ssh snapshot1010.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki frwiki --dryrun
would kill processes ['3571243', '3571245', '3571246', '3571247', '3571248', '3571304', '3571548', '3571673', '3571711', '3571722']

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki frwiki

Remove stale files:

ls *inprog*
frwiki-20241001-pages-meta-history1.xml-p11295p15049.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p114983p127153.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p1210p2359.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p141400p155089.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p15050p19691.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p19692p24985.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p1p1209.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p2360p3431.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p24986p31370.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p31371p38442.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p3432p6575.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p38443p44198.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p44199p50597.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p50598p56007.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p56008p63399.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p63400p72015.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p6576p8686.bz2.inprog
frwiki-20241001-pages-meta-history1.xml-p8687p11294.bz2.inprog
frwiki-20241001-pages-meta-history2.xml-p321562p340412.bz2.inprog
frwiki-20241001-pages-meta-history2.xml-p524440p538104.bz2.inprog
frwiki-20241001-pages-meta-history2.xml-p600164p639273.bz2.inprog
frwiki-20241001-pages-meta-history2.xml-p679071p717138.bz2.inprog
frwiki-20241001-pages-meta-history2.xml-p793698p835207.bz2.inprog
frwiki-20241001-pages-meta-history3.xml-p1050823p1102944.bz2.inprog
frwiki-20241001-pages-meta-history3.xml-p1102945p1158283.bz2.inprog
frwiki-20241001-pages-meta-history3.xml-p1214055p1261475.bz2.inprog
frwiki-20241001-pages-meta-history3.xml-p1261476p1292831.bz2.inprog
frwiki-20241001-pages-meta-history3.xml-p1398944p1457344.bz2.inprog
frwiki-20241001-pages-meta-history3.xml-p1922905p1996751.bz2.inprog
frwiki-20241001-pages-meta-history3.xml-p2306040p2491170.bz2.inprog
frwiki-20241001-pages-meta-history3.xml-p2491171p2697716.bz2.inprog
frwiki-20241001-pages-meta-history3.xml-p2697717p2781677.bz2.inprog
frwiki-20241001-pages-meta-history4.xml-p2977215p3068240.bz2.inprog
frwiki-20241001-pages-meta-history4.xml-p3068241p3157746.bz2.inprog
frwiki-20241001-pages-meta-history4.xml-p3157747p3246400.bz2.inprog
frwiki-20241001-pages-meta-history4.xml-p3246401p3375561.bz2.inprog
frwiki-20241001-pages-meta-history4.xml-p3818811p3916324.bz2.inprog
frwiki-20241001-pages-meta-history4.xml-p3916325p4007512.bz2.inprog
frwiki-20241001-pages-meta-history4.xml-p4007513p4094891.bz2.inprog
frwiki-20241001-pages-meta-history4.xml-p4094892p4207173.bz2.inprog
frwiki-20241001-pages-meta-history4.xml-p4207174p4315792.bz2.inprog
frwiki-20241001-pages-meta-history4.xml-p4402791p4515908.bz2.inprog
frwiki-20241001-pages-meta-history4.xml-p4515909p4619006.bz2.inprog
frwiki-20241001-pages-meta-history4.xml-p4619007p4683649.bz2.inprog
frwiki-20241001-pages-meta-history4.xml-p4683650p4777417.bz2.inprog
dumpsgen@snapshot1010:/mnt/dumpsdata/xmldatadumps/public/frwiki/20241001$ rm *inprog*

IIRC, we are out of the cron range, so will have to relaunch manually.

Ok even after the above, we still had a process writing with no lock.

@BTullis assisted to identify the process.

I killed via kill -9, and now we have a healthy worker:

pstree -a dumpsgen
bash /srv/deployment/dumps/dumps/xmldumps-backup/monitor --configfile/etc/dumps/confs/wikidump.conf
  └─python3 ./monitor.py /etc/dumps/confs/wikidump.conf.dumps:monitor

bash
  └─pstree -a dumpsgen

systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject fulldumps-rest --mail-toroot@snapsh
  └─fulldumps.sh /usr/local/bin/fulldumps.sh 01 14 regular full 28 silent
      └─python3 /srv/deployment/dumps/dumps/xmldumps-backup/dumpscheduler.py --slots 28 --commands/etc/
          └─sh -c...
              └─bash ./worker --skipdone --exclusive --log --configfile...
                  └─python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis ...
                      └─2*[{python3}]

With a proper lock:

$ pwd
/mnt/dumpsdata/xmldatadumps/private/frwiki
$ cat lock_20241001 
snapshot1010.eqiad.wmnet 361974

Ok, so I have to point out that I found a lot of the familiar Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways errors. We're used to seeing these for bad/corrupt revisions. But now we're seeing very large numbers for good revisions. From a count in the latest logs, I see:

frwiki: 572359 instances
enwiki: 169580 instances

My opinion is that the dumps, even if they finish, will be corrupt because of these errors. We should stop them and investigate. We need help resourcing that.

frwiki continued to trash, as per https://groups.google.com/a/wikimedia.org/g/ops-dumps/c/F4M-ecqwHGA/m/CQcjVqxMAAAJ.

I tried multiple times to kill it via the usual dumpadmin.py --kill, but it kept coming back, even though it is the 17th, and it should stop retrying on the 14th.

It took some coercing, but I was able to kill -9 the frwiki dump by killing at the worker python process:

$ pstree -ap dumpsgen
bash,3495475 /srv/deployment/dumps/dumps/xmldumps-backup/monitor --configfile...
  └─sleep,4072146 15

sh,844027 -c...
  └─bash,844028 ./worker --skipdone --exclusive --log --configfile...
      └─python3,3619742 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --log ...
          ├─php7.4,4071988 /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=frwiki...
          │   ├─sh,4072014 -c...
          │   │   └─lbzip2,4072025 -n 1
          │   │       ├─{lbzip2},4072029
          │   │       ├─{lbzip2},4072032
          │   │       └─{lbzip2},4072033
          │   └─sh,4072017 -c...
          │       └─7za,4072022 e -bd -so...
          ├─php7.4,4071989 /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=frwiki...
          │   ├─sh,4072013 -c...
          │   │   └─lbzip2,4072020 -n 1
          │   │       ├─{lbzip2},4072026
...

Then remove stale files:

dumpsgen@snapshot1010:/mnt/dumpsdata/xmldatadumps/public/frwiki/20241001$ ls *inprog | wc -l
36
dumpsgen@snapshot1010:/mnt/dumpsdata/xmldatadumps/public/frwiki/20241001$ rm *inprog

Ok, now let's rerun the whole thing on a spare snapshot node. Will do that on separate comment.

Launched a new run of frwiki on spare node snapshot1014.eqiad.wmnet via:

ssh snapshot1014.eqiad.wmnet
screen -S rerun-frwiki-20241001
sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

bash ./worker --date 20241001 --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki frwiki

do we have any updates on this matter? I see the last log from yesterday

do we have any updates on this matter? I see the last log from yesterday

Copypasting from Slack:

From me:

The frwiki manual run continues to constantly throw the errors like so:

Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways
getting/checking text tt:215383 failed (Generic error while obtaining text for id tt:215383) for revision 215383
      865
      
       (Will retry 2 more times)

It continues to make progress, but it is very (very!) slow.

From @Milimetric

I think the errors mean we don't have good data anyway, even if it finishes. So I'm trying to find time to debug this today

The manual frwiki dump run now has multiple files stuck:

ystemd timer ran the following command:

/bin/bash /usr/local/bin/job_watcher.sh --dumpsbasedir /data/xmldatadumps/public --locksbasedir /data/xmldatadumps/private

Its return value was 0 and emitted the following output:

PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p1210p2359.bz2.inprog at least 4 hours older than lock
PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p1p1209.bz2.inprog at least 4 hours older than lock
PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p2360p3431.bz2.inprog at least 4 hours older than lock
PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p6576p8686.bz2.inprog at least 4 hours older than lock
PROBLEM: frwiki has file frwiki/20241001/frwiki-20241001-pages-meta-history1.xml-p8687p11294.bz2.inprog at least 4 hours older than lock

Since it is running on a spare node, I will let it continue, but it is unlikely to finish.

Thus we must accept the fact that there will be no full frwiki dump this month. At this time I think our best bet is to try and fix the root problem via T377594.

Ok, so I have to point out that I found a lot of the familiar Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways errors. We're used to seeing these for bad/corrupt revisions. But now we're seeing very large numbers for good revisions.

I was asked to take a look at this part. Looking at the code, it is clear that the issue is a symptom of "failed" revisions and fixing that removes this part too. But on "db is not closed" I think it's just that mw for really long time now doesn't let any layer above rdbms to really close the db connection or say "pick another database". The whole $this->rotateDb(); is outdated and should go away. It's adding pressure on the databases as well for no real good reason (maybe it was useful long time ago but definitely not anymore).

Ok, so I have to point out that I found a lot of the familiar Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways errors. We're used to seeing these for bad/corrupt revisions. But now we're seeing very large numbers for good revisions.

I was asked to take a look at this part. Looking at the code, it is clear that the issue is a symptom of "failed" revisions and fixing that removes this part too.

There are more details at T377594#10243166 that present evidence of some of these revisions being good revisions.

But on "db is not closed" I think it's just that mw for really long time now doesn't let any layer above rdbms to really close the db connection or say "pick another database". The whole $this->rotateDb(); is outdated and should go away. It's adding pressure on the databases as well for no real good reason (maybe it was useful long time ago but definitely not anymore).

@aaron has a patch that removes rotateDb() over at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1078491. @Ladsgroup would you mind reviewing that patch?

Ok, so I have to point out that I found a lot of the familiar Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways errors. We're used to seeing these for bad/corrupt revisions. But now we're seeing very large numbers for good revisions.

I was asked to take a look at this part. Looking at the code, it is clear that the issue is a symptom of "failed" revisions and fixing that removes this part too.

There are more details at T377594#10243166 that present evidence of some of these revisions being good revisions.

Yup, that's why I put the failed in scare quotes. The issue just shows up when you have a different problem and causes red herrings.

But on "db is not closed" I think it's just that mw for really long time now doesn't let any layer above rdbms to really close the db connection or say "pick another database". The whole $this->rotateDb(); is outdated and should go away. It's adding pressure on the databases as well for no real good reason (maybe it was useful long time ago but definitely not anymore).

@aaron has a patch that removes rotateDb() over at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1078491. @Ladsgroup would you mind reviewing that patch?

That patch needed manual rebase when you commented. Now it's fixed. I reviewed it and soon can go in. Dumps 1.0 is breaking a lot of layering in mw. If a replica is giving you "bad" data. In almost all cases, you can't get a better one from another replica.

Makes sense, thank you for your help @Ladsgroup.

As per T377594#10260629, we are going to pause the XML Dumps until we fix the root cause of all these issues.

Thus killing the manul frwiki job that has been running (and thrashing) all along since T376926#10246904:

ssh snapshot1014.eqiad.wmnet

screen -xS rerun-frwiki-20241001

CTRL-C
xcollazo closed this task as Resolved.EditedOct 25 2024, 2:30 PM

Closing this, root cause finding will happen on T377594.