Page MenuHomePhabricator

cebwiki notification that abstract dumps failed
Closed, ResolvedPublic0 Estimated Story Points

Description

Email was marked 2019-06-26 15:28:21 UTC and here's the log entry via kibana: https://logstash.wikimedia.org/goto/c3f05e0b84a8f4b2e70f828f3648d326

Nothing else of interest in syslogs or anywhere else. No idea why mcrouter should time out. Nothing interesting in SAL around that time. Keeping the command around in case the mcrouter whine is a ruse:

/srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=cebwiki /srv/mediawiki/php-1.34.0-wmf.8 --plugin=AbstractFilter:/srv/mediawiki/php-1.34.0-wmf.8/extensions/ActiveAbstract/includes/AbstractFilter.php --current --report=1000 --namespaces=0 --output=file:/mnt/dumpsdata/xmldatadumps/temp/c/cebwiki/cebwiki-20190620-abstract.xml.gz.inprog_tmp --filter=namespace:NS_MAIN --filter=noredirect --filter=abstract --skip-header --start=7200001 --skip-footer --end 7210001

Event Timeline

ArielGlenn triaged this task as Medium priority.Jun 26 2019, 6:17 PM
ArielGlenn created this task.

Note that the job is rerunning now, presumably it will finish up normally.

Note that it's unclear whether the log error was in any way related to the abstract dumps failure. The logs for the dumps themselves show:

...
2019-06-26 15:19:12: cebwiki (ID 60395) 999 pages (103.5|103.5/sec all|curr), 1000 revs (103.6|103.6/sec all|curr), ETA 2019-06-27 16:32:10 [max 9402052]
2019-06-26 15:19:21: cebwiki (ID 60395) 1999 pages (106.8|220.8/sec all|curr), 2000 revs (106.9|110.5/sec all|curr), ETA 2019-06-27 15:44:54 [max 9402052]
2019-06-26 15:19:30: cebwiki (ID 60395) 2999 pages (108.9|339.3/sec all|curr), 3000 revs (108.9|113.1/sec all|curr), ETA 2019-06-27 15:17:59 [max 9402052]
2019-06-26 15:19:39: cebwiki (ID 60395) 3999 pages (108.3|426.0/sec all|curr), 4000 revs (108.3|106.5/sec all|curr), ETA 2019-06-27 15:26:00 [max 9402052]
2019-06-26 15:19:49: cebwiki (ID 60395) 4999 pages (106.7|504.3/sec all|curr), 5000 revs (106.7|100.9/sec all|curr), ETA 2019-06-27 15:47:16 [max 9402052]
2019-06-26 15:20:00: cebwiki (ID 60395) 5999 pages (104.0|552.5/sec all|curr), 6000 revs (104.0|92.1/sec all|curr), ETA 2019-06-27 16:26:09 [max 9402052]
2019-06-26 15:20:10: cebwiki (ID 60395) 6999 pages (102.5|662.7/sec all|curr), 7000 revs (102.5|94.7/sec all|curr), ETA 2019-06-27 16:47:15 [max 9402052]
2019-06-26 15:20:20: cebwiki (ID 60395) 7999 pages (102.7|832.1/sec all|curr), 8000 revs (102.7|104.0/sec all|curr), ETA 2019-06-27 16:44:32 [max 9402052]
2019-06-26 15:20:30: cebwiki (ID 60395) 8999 pages (102.8|935.7/sec all|curr), 9000 revs (102.9|104.0/sec all|curr), ETA 2019-06-27 16:42:29 [max 9402052]
2019-06-26 15:20:40: cebwiki (ID 60395) 9927 pages (101.9|1002.2/sec all|curr), 9927 revs (101.9|93.6/sec all|curr), ETA 2019-06-27 16:56:35 [max 9402052]
2019-06-26 15:20:51: cebwiki (ID 60758) 999 pages (93.0|93.0/sec all|curr), 1000 revs (93.1|93.1/sec all|curr), ETA 2019-06-27 19:23:44 [max 9402064]
2019-06-26 15:22:06: cebwiki (ID 60971) 999 pages (119.9|119.9/sec all|curr), 1000 revs (120.0|120.0/sec all|curr), ETA 2019-06-27 13:08:05 [max 9402075]
2019-06-26 15:24:18: cebwiki (ID 61388) 999 pages (115.9|115.9/sec all|curr), 1000 revs (116.1|116.1/sec all|curr), ETA 2019-06-27 13:54:22 [max 9402094]
failed job after max retries
Error from command(s): /usr/bin/python3 xmlabstracts.py --config /etc/dumps/confs/wikidump.conf.dumps --wiki cebwiki --outfiles=/mnt/dumpsdata/xmldatadumps/public/cebwiki/20190620/cebwiki-2
0190620-abstract.xml.gz.inprog --variants= 
2019-06-26 15:28:21: cebwiki *** exception! error producing abstract dump
2019-06-26 15:28:21: cebwiki ['Traceback (most recent call last):\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/runner.py", line 402, in do_run_item\n    item.dump(self)\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/jobs.py", line 201, in dump\n    self.run(runner)\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/xmljobs.py", line 434, in run\n    raise BackupError("error producing abstract dump")\n', 'dumps.exceptions.BackupError: error producing abstract dump\n']
2019-06-26 15:28:21: cebwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/cebwiki/latest ...
2019-06-26 15:28:21: cebwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/cebwiki/latest ...
2019-06-26 15:28:21: cebwiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/cebwiki/latest/cebwiki-latest-stub-meta-history.xml.gz-rss.xml 
...

which means a failure significantly later than the timeout. Additionally I see that same error for other projects, also from the abstract dumps job, and no corresponding failure of the dump step.

No other projects affected, as shown by a grep through all the dump logs for the June 20th run:

root@dumpsdata1001:/data/xmldatadumps/private# grep -i err */20190620/dumplog.txt | grep -v 'Generic error'
cebwiki/20190620/dumplog.txt:Error from command(s): /usr/bin/python3 xmlstubs.py --config /etc/dumps/confs/wikidump.conf.dumps --wiki cebwiki --articles /mnt/dumpsdata/xmldatadumps/public/cebwiki/20190620/cebwiki-20190620-stub-articles.xml.gz.inprog --history /mnt/dumpsdata/xmldatadumps/public/cebwiki/20190620/cebwiki-20190620-stub-meta-history.xml.gz.inprog --current /mnt/dumpsdata/xmldatadumps/public/cebwiki/20190620/cebwiki-20190620-stub-meta-current.xml.gz.inprog 
cebwiki/20190620/dumplog.txt:2019-06-22 10:49:39: cebwiki *** exception! error producing stub files
cebwiki/20190620/dumplog.txt:2019-06-22 10:49:39: cebwiki ['Traceback (most recent call last):\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/runner.py", line 402, in do_run_item\n    item.dump(self)\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/jobs.py", line 201, in dump\n    self.run(runner)\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/xmljobs.py", line 222, in run\n    raise BackupError("error producing stub files")\n', 'dumps.exceptions.BackupError: error producing stub files\n']
cebwiki/20190620/dumplog.txt:Error from command(s): /usr/bin/python3 xmlabstracts.py --config /etc/dumps/confs/wikidump.conf.dumps --wiki cebwiki --outfiles=/mnt/dumpsdata/xmldatadumps/public/cebwiki/20190620/cebwiki-20190620-abstract.xml.gz.inprog --variants= 
cebwiki/20190620/dumplog.txt:2019-06-26 15:28:21: cebwiki *** exception! error producing abstract dump
cebwiki/20190620/dumplog.txt:2019-06-26 15:28:21: cebwiki ['Traceback (most recent call last):\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/runner.py", line 402, in do_run_item\n    item.dump(self)\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/jobs.py", line 201, in dump\n    self.run(runner)\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/xmljobs.py", line 434, in run\n    raise BackupError("error producing abstract dump")\n', 'dumps.exceptions.BackupError: error producing abstract dump\n']


root@dumpsdata1001:/data/xmldatadumps/private# grep xcept */20190620/dumplog.txt
cebwiki/20190620/dumplog.txt:2019-06-22 10:49:39: cebwiki *** exception! error producing stub files
cebwiki/20190620/dumplog.txt:2019-06-22 10:49:39: cebwiki ['Traceback (most recent call last):\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/runner.py", line 402, in do_run_item\n    item.dump(self)\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/jobs.py", line 201, in dump\n    self.run(runner)\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/xmljobs.py", line 222, in run\n    raise BackupError("error producing stub files")\n', 'dumps.exceptions.BackupError: error producing stub files\n']
cebwiki/20190620/dumplog.txt:2019-06-26 15:28:21: cebwiki *** exception! error producing abstract dump
cebwiki/20190620/dumplog.txt:2019-06-26 15:28:21: cebwiki ['Traceback (most recent call last):\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/runner.py", line 402, in do_run_item\n    item.dump(self)\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/jobs.py", line 201, in dump\n    self.run(runner)\n', '  File "/srv/deployment/dumps/dumps-cache/revs/fd6100a5e98e125b8256f2fd4b9d0ac2d259a735/xmldumps-backup/dumps/xmljobs.py", line 434, in run\n    raise BackupError("error producing abstract dump")\n', 'dumps.exceptions.BackupError: error producing abstract dump\n']

These piecewise jobs, run by calling xmlabstracts.py, have virtually no error logging at the point where an error might be seen. Remedying that now. This will also enable error logging for pagelogs and stub dumps.
Note that this is separate from MediaWiki logs from dumpBackup.php going to kibana.

The repeat run of abstracts finished without incident.

Change 519427 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] tiny bit more error reporting when we do stubs/pagelogs/abstracts

https://gerrit.wikimedia.org/r/519427

Change 519427 merged by ArielGlenn:
[operations/dumps@master] tiny bit more error reporting when we do stubs/pagelogs/abstracts

https://gerrit.wikimedia.org/r/519427

Waiting to deploy until the current run is done, probably in 2 or 3 days.

And I forgot to deploy it, woops. Will do at the end of this run.

However in the meantime there were several new failures for wikidata pages-articles files, with the same sort of limited information available. The jobs in question left intact inprog files around which appeared to be complete; I moved them into place and shot the pages-articles recombine so that it could rerun, which it has done with no further issues.

It's unclear whether these two incidents are cases of the same bug but I'll treat them as such for now.

It turns out that Maintenance.php had this changeset applied: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/478503/ which would disappear all maintenance script exception stack traces for readers looking for them on stderr. Presumably they will be back now due to https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/523687/

In the meantime the error logging patch has been deployed, and we've not seen another failure of this sort, so I'll close this task.