Page MenuHomePhabricator

data retrieval/write issues via NFS on dumpsdata1001, impacting some dump jobs
Closed, ResolvedPublic

Description

Saw some weird behavior when trying to run a couple of dumps recompress jobs just now from the command line on our test host. Went to dumpsdata1001 and saw some 'bash: ls: write error' messages in syslog. Shot the job with the relevant pid (the rolling rsync), ran puppet to restart it, here's the result:

Apr  1 20:06:19 dumpsdata1001 puppet-agent[32018]: Using configured environment 'production'
Apr  1 20:06:19 dumpsdata1001 puppet-agent[32018]: Retrieving pluginfacts
Apr  1 20:06:19 dumpsdata1001 puppet-agent[32018]: Retrieving plugin
Apr  1 20:06:20 dumpsdata1001 puppet-agent[32018]: Loading facts
Apr  1 20:06:27 dumpsdata1001 puppet-agent[32018]: Caching catalog for dumpsdata1001.eqiad.wmnet
Apr  1 20:06:28 dumpsdata1001 puppet-agent[32018]: (/Stage[main]/Base::Environment/Tidy[/var/tmp/core]) Tidying 0 files
Apr  1 20:06:28 dumpsdata1001 puppet-agent[32018]: Applying configuration version '1522613183'
Apr  1 20:06:29 dumpsdata1001 crontab[32246]: (root) LIST (root)
Apr  1 20:06:29 dumpsdata1001 crontab[32248]: (root) LIST (prometheus)
Apr  1 20:06:29 dumpsdata1001 crontab[32250]: (root) LIST (dumpsgen)
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Reloading.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Started ACPI event daemon.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Listening on ACPID Listen Socket.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Mounted /data.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Mounted /.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Created slice system-systemd\x2dfsck.slice.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Started File System Check on /dev/mapper/dumpsdata1001--vg-data.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Found device /dev/mapper/dumpsdata1001--vg-data.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Found device /sys/devices/virtual/block/dm-1.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Found device /dev/dm-1.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Found device /dev/disk/by-id/dm-name-dumpsdata1001--vg-data.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Found device /dev/disk/by-id/dm-uuid-LVM-VTvoMeEuXXfcVQYa4SA1qNFIFu1joqDnMoheU3qiU5HXrS778c0gCAo59ZvKeivQ.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Found device /dev/disk/by-uuid/c99c516f-72cd-4f53-83a9-31591a9d1b4a.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Found device /dev/dumpsdata1001-vg/data.
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Starting Dumps rsyncer service...
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Failed to reset devices.list on /system.slice: Invalid argument
Apr  1 20:06:32 dumpsdata1001 systemd[1]: Started Dumps rsyncer service.
Apr  1 20:06:32 dumpsdata1001 bash[32427]: ls: write error: Broken pipe
Apr  1 20:06:32 dumpsdata1001 puppet-agent[32018]: (/Stage[main]/Dumps::Generation::Server::Rsyncer/Base::Service_unit[dumps-rsyncer]/Service[dumps-rsyncer]/ensure) ensure changed 'stopped' to 'running'
Apr  1 20:06:32 dumpsdata1001 puppet-agent[32018]: (/Stage[main]/Dumps::Generation::Server::Rsyncer/Base::Service_unit[dumps-rsyncer]/Service[dumps-rsyncer]) Unscheduling refresh on Service[dumps-rsyncer]
Apr  1 20:06:36 dumpsdata1001 puppet-agent[32018]: Applied catalog in 7.60 seconds

I see nothing in dmesg, no error emails either.

Event Timeline

ArielGlenn created this task.

The original error I saw on snapshot1001 was:

2018-04-01 19:56:56: cawiki Checksumming cawiki-20180301-langlinks.sql.gz via sha1
Traceback (most recent call last):
  File "./worker.py", line 515, in <module>
    main()
  File "./worker.py", line 494, in main
    result = runner.run()
  File "/srv/deployment/dumps/dumps-cache/revs/982cebda3bae34041d38eec420d4c1fadc323433/xmldumps-backup/dumps/runner.py", line 783, in run
    prereq_job = self.do_run_item(item)
  File "/srv/deployment/dumps/dumps-cache/revs/982cebda3bae34041d38eec420d4c1fadc323433/xmldumps-backup/dumps/runner.py", line 722, in do_run_item
    self.dumpjobdata.do_after_job(item, self.dump_item_list.dump_items)
  File "/srv/deployment/dumps/dumps-cache/revs/982cebda3bae34041d38eec420d4c1fadc323433/xmldumps-backup/dumps/runnerutils.py", line 1011, in do_after_job
    self.symlinks.cleanup_symlinks()
  File "/srv/deployment/dumps/dumps-cache/revs/982cebda3bae34041d38eec420d4c1fadc323433/xmldumps-backup/dumps/runnerutils.py", line 722, in cleanup_symlinks
    realfilepath = os.readlink(link)
OSError: [Errno 5] Input/output error: '/mnt/dumpsdata/xmldatadumps/public/cawiki/latest/cawiki-latest-imagelinks.sql.gz'

There are no errors in the log that indicate nfs from dumpsdata1001 is slow right now.

I'm pretty out of commission right now so I have no idea if these are related or even if that first error is somehow normal. Will look tomorrow.

Well. The ls errors seem to be consistent, probably emitted at the conclusion of some pipeline in the bash script. They happen even when nothing else is going on, on the host, so they can be ignored for now, though it would be nice not to clutter up the logs.

The other issue 'Input/output error' is not so easy to account for. I neglected to mention that on two runs of the same dump command the error was for attempts to read the link of a different file. An attempt to run this job just now appears to run ok.

Within the space of one minute I saw:

dumpsgen@snapshot1001:/srv/deployment/dumps/dumps/xmldumps-backup$ ls -l /mnt/dumpsdata/xmldatadumps/public/cawiki/latest/cawiki-latest-imagelinks.sql.gz
ls: cannot read symbolic link /mnt/dumpsdata/xmldatadumps/public/cawiki/latest/cawiki-latest-imagelinks.sql.gz: Input/output error
lrwxrwxrwx 1 dumpsgen dumpsgen 45 Apr  1 17:51 /mnt/dumpsdata/xmldatadumps/public/cawiki/latest/cawiki-latest-imagelinks.sql.gz

and

dumpsgen@snapshot1001:/srv/deployment/dumps/dumps/xmldumps-backup$ ls -l /mnt/dumpsdata/xmldatadumps/public/cawiki/latest/cawiki-latest-imagelinks.sql.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 45 Apr  1 17:51 /mnt/dumpsdata/xmldatadumps/public/cawiki/latest/cawiki-latest-imagelinks.sql.gz -> ../20180401/cawiki-20180401-imagelinks.sql.gz
root@dumpsdata1001:/data/xmldatadumps/private/wikidatawiki/20180401# zcat /data/xmldatadumps/public/wikidatawiki/20180401/wikidatawiki-20180401-stub-meta-history27.xml.gz | cat -vte | more
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http:/
/www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">$
  <siteinfo>$
    <sitename>Wikidata</sitename>$
    <dbname>wikidatawiki</dbname>$
    <base>https://www.wikidata.org/wiki/Wikidata:Main_Page</base>$
    <generator>MediaWiki 1.31.0-wmf.26</generator>$
    <case>first-letter</case>$
    <namespaces>$
      <namespace key="-2" case="first-letter">Media</namespace>$
      <namespace key="-1" case="first-letter">Special</namespace>$
      <namespace key="0" case="first-letter" />$
      <namespace key="1" case="first-letter">Talk</namespace>$
      <namespace key="2" case="first-letter">User</namespace>$
      <namespace key="3" case="first-letter">User talk</namespace>$
      <namespace key="4" case="first-letter">Wikidata</namespace>$
      <namespace key="5" case="first-letter">Wikidata talk</namespace>$
      <namespace key="6" case="first-letter">File</namespace>$
      <namespace key="7" case="first-letter">File talk</namespace>$
      <namespace key="8" case="first-letter">MediaWiki</namespace>$
      <namespace key="9" case="first-letter">MediaWiki talk</namespace>$
      <namespace key="10" case="first-letter">Template</namespace>$
      <namespace key="11" case="first-letter">Template talk</namespace>$
      <namespace key="12" case="first-letter">Help</namespace>$
      <namespace key="13" case="first-letter">Help talk</namespace>$
      <namespace key="14" case="first-letter">Category</namespace>$
      <namespace key="15" case="first-letter">Category talk</namespace>$
      <namespace key="120" case="first-letter">Property</namespace>$
      <namespace key="121" case="first-letter">Property talk</namespace>$
      <namespace key="122" case="first-letter">Query</namespace>$
      <namespace key="123" case="first-letter">Query talk</namespace>$
      <namespace key="828" case="first-letter">Module</namespace>$
      <namespace key="829" case="first-letter">Module talk</namespace>$
      <namespace key="1198" case="first-letter">Translations</namespace>$
      <namespace key="1199" case="first-letter">Translations talk</namespace>$
      <namespace key="2300" case="first-letter">Gadget</namespace>$
      <namespace key="2301" case="first-letter">Gadget talk</namespace>$
      <namespace key="2302" case="case-sensitive">Gadget definition</namespace>$
      <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>$
      <namespace key="2600" case="first-letter">Topic</namespace>$
    </namespaces>$
  </siteinfo>$
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^

We saw a few of these last month too, garbage output in stubs files. I wonder if this is the result of too much stress on the NFS filesystem. atop doesn't show anything too crazy for i/o levels on dumpsdata1001 itself.

Grafana shows network usage pretty high, we have two interfaces so we could bring up the second one, but can the arrays/controller handle it?

Στιγμιότυπο από 2018-04-02 10-57-35.png (277×866 px, 67 KB)

Στιγμιότυπο από 2018-04-02 10-57-47.png (253×580 px, 48 KB)

We could consider QoS, prioritizing NFS packets, or alternatively, deprioritizing rsync packets, to see what impact that has, remaining with the single interface active. It's fine for rsync to take longer.

In the meantime I'm going to check to see which stub files have nulls in them from this run, add a check for intact header when stub files are written, and get the current run sorted out.

from /data/xmldatadumps/private on dumpsdata1001:

list=*wik*
for wikiname in $list; do echo -n "$wikiname:  "; zcat "/data/xmldatadumps/public/${wikiname}/20180401/${wikiname}-20180401-stub-meta-history.xml.gz" | cat -vte | head -60 | grep '@^@' | wc -l; done | grep -v '  0'

A few public wikis turned up with no recombined files; those were commonswiki (I checked by hand, no issues), enwiki, and wikidatawiki.
Checks of enwiki and wikidatawiki show:

root@dumpsdata1001:/data/xmldatadumps/private# for num in `seq 1 27`; do echo -n "wikidata ${num}:  "; zcat "/data/xmldatadumps/public/wikidatawiki/20180401/wikidatawiki-20180401-stub-meta-history${num}.xml.gz"  | cat -vte | head -60 | grep '@^@' | wc -l; done | grep -v '  0'
wikidata 27:  1
root@dumpsdata1001:/data/xmldatadumps/private# for num in `seq 1 27`; do echo -n "enwiki ${num}:  "; zcat "/data/xmldatadumps/public/enwiki/20180401/enwiki-20180401-stub-meta-history${num}.xml.gz"  | cat -vte | head -60 | grep '@^@' | wc -l; done | grep -v '  0'

Repeated this for stubs-articles and stubs-meta-current, file 27 for wikidata was broken in all three cases.
So it's just the one set of files. I've removed the bad files, I'll let the page content files for wikidata complete, then remove any bogus page content generation files we might have from those, mark the page content and stub files as failed so that the missing files get produced later.

ArielGlenn renamed this task from Write issues on dumpsdata1001 to data retrieval/write issues via NFS on dumpsdata1001, impacting some dump jobs.Apr 2 2018, 8:26 AM

Other things we could do, just to get the dumps to run smoothly:

  • move enwikidata or wikidatawiki run to write to dumpsdata1002. This would mean every snapshot host gets two mounts, config files become difficult, and rsyncs between dumpsdata1001 and dumpsdata1002 get complicated.
  • do fewer stubs at a time; cut back either the number of enwiki jobs or the number of wikidatawiki jobs or both. This is fast because it's only a config setting, but we already do only 9 jobs at a time.
  • do fewer stubs at a time by restricting the number of small wikis that run this job at the same time; this could be done by changing the dump scheduler input file.

On an earlier run of the stubs job for wikidata this month, we have:

Traceback (most recent call last):
  File "xmlstubs.py", line 246, in <module>
    main()
  File "xmlstubs.py", line 242, in main
    dostubsbackup(wiki, history_file, current_file, articles_file, wikiconf, start, end, dryrun)
  File "xmlstubs.py", line 138, in dostubsbackup
    5000, 20000, '</page>\n', callback)
  File "/srv/deployment/dumps/dumps-cache/revs/982cebda3bae34041d38eec420d4c1fadc323433/xmldumps-backup/xmlstreams.py", line 73, in do_xml_stream
    do_xml_piece(piece_command, outfiles, ends_with, dryrun)
  File "/srv/deployment/dumps/dumps-cache/revs/982cebda3bae34041d38eec420d4c1fadc323433/xmldumps-backup/xmlstreams.py", line 243, in do_xml_piece
    catfile(outfiles[filetype]['temp'], outfiles[filetype]['compr'])
  File "/srv/deployment/dumps/dumps-cache/revs/982cebda3bae34041d38eec420d4c1fadc323433/xmldumps-backup/xmlstreams.py", line 128, in catfile
    with open(inputfile, "r") as fhandle:
IOError: [Errno 2] No such file or directory: '/mnt/dumpsdata/xmldatadumps/temp/wikidatawiki-20180401-stub-meta-history7.xml.gz.inprog_tmp'

The file just appears to be (temporarily) gone. This looks similar to the readlink issue mentioned in the first comment.

Change 423465 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] check stub files after production to see if they have binary crap

https://gerrit.wikimedia.org/r/423465

Change 423465 merged by ArielGlenn:
[operations/dumps@master] check stub files after production to see if they have binary crap

https://gerrit.wikimedia.org/r/423465

I've rerun the one bad stub, set the dump scheduler on the appropriate host to start up wikidata dumps again in a few minutes, and that will do for now for emergency measures. Also needed: start doing stubs in small files that can be catted together (hm, we'll have the header problem if we are not careful), and only rerun the pieces that aren't there if the job fails.

After the relevant reading, I'm leaning hard towards this being some sort of NFS cache thing. Need to to some testing to see if I can reproduce semi-at-will; some notes will be appearing here soon.

For people keeping track, I'm trying a couple scripts that create/read/write/remove large numbers of files in a directory, to see if anything trips. So far, nothing, but these tests have all been on the failover nfs server. Next I guess I"ll try them on the active one.

Change 426888 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] show the stacktrace for errors from dump job run in most cases

https://gerrit.wikimedia.org/r/426888

Change 426888 merged by ArielGlenn:
[operations/dumps@master] show the stacktrace for errors from dump job run in most cases

https://gerrit.wikimedia.org/r/426888

I have gone back through the 'no such file' errors for the past 5 months. The vast majority are stubs; a few are page dump files, and a couple are abstract files. See P6995 for the list.

Change 427127 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] log more xmlstream exceptions, process all input files before exit

https://gerrit.wikimedia.org/r/427127

Change 427127 merged by ArielGlenn:
[operations/dumps@master] log more xmlstream exceptions, process all input files before exit

https://gerrit.wikimedia.org/r/427127

Change 427603 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] set actimeo=0 on snapshot1006 nfs mount for the next dump run

https://gerrit.wikimedia.org/r/427603

I'm planning to try disabling the nfs attribute cache for files and directories on one of the snapshots for the Apr 20 run, to see what impact that might have on the issues described here, and to monitor performance. The reports I've seen say that actimeo=0 gives much better throughput than noac.

Change 427684 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] keep intact output files from stubs/abstracts/logs around for retries

https://gerrit.wikimedia.org/r/427684

Change 427684 merged by ArielGlenn:
[operations/dumps@master] keep intact output files from stubs/abstracts/logs around for retries

https://gerrit.wikimedia.org/r/427684

Change 427603 merged by ArielGlenn:
[operations/puppet@production] set actimeo=0 on snapshot1006 nfs mount for the next dump run

https://gerrit.wikimedia.org/r/427603

Change 428267 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] no attribute caching on snaphot1005

https://gerrit.wikimedia.org/r/428267

Change 428267 merged by ArielGlenn:
[operations/puppet@production] no attribute caching on snaphot1005

https://gerrit.wikimedia.org/r/428267

Change 429647 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] disable nfs file attr caching on the last of the snapshot hosts

https://gerrit.wikimedia.org/r/429647

Change 429647 merged by ArielGlenn:
[operations/puppet@production] disable nfs file attr caching on the last of the snapshot hosts

https://gerrit.wikimedia.org/r/429647

Change 429761 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] cap dump rsyncs to peers at bandwidth of 40000K per second

https://gerrit.wikimedia.org/r/429761

Change 429761 merged by ArielGlenn:
[operations/puppet@production] cap dump rsyncs to peers at bandwidth of 40000K per second

https://gerrit.wikimedia.org/r/429761

This month's run looks good, no nulls in stub files, no other weirdness either so I'm going to close this.