Page MenuHomePhabricator

nfs-exports.service is failing on labstore1001 often
Closed, ResolvedPublic

Description

From today:

daemon.log:Dec 22 03:49:56 labstore1001 systemd[1]: Unit nfs-exports.service entered failed state.
daemon.log:Dec 22 04:44:26 labstore1001 systemd[1]: Unit nfs-exports.service entered failed state.
daemon.log:Dec 22 05:08:57 labstore1001 systemd[1]: Unit nfs-exports.service entered failed state.
daemon.log:Dec 22 14:17:43 labstore1001 systemd[1]: Unit nfs-exports.service entered failed state.
daemon.log:Dec 22 15:03:51 labstore1001 systemd[1]: Unit nfs-exports.service entered failed state.
daemon.log:Dec 22 15:22:14 labstore1001 systemd[1]: Unit nfs-exports.service entered failed state.
daemon.log:Dec 22 15:31:59 labstore1001 systemd[1]: Unit nfs-exports.service entered failed state.
daemon.log:Dec 22 18:36:05 labstore1001 systemd[1]: Unit nfs-exports.service entered failed state.
daemon.log:Dec 22 19:43:30 labstore1001 systemd[1]: Unit nfs-exports.service entered failed state.
daemon.log:Dec 22 21:57:17 labstore1001 systemd[1]: Unit nfs-exports.service entered failed state.

seems to go back to at least the 14th

daemon.log.1:Dec 14 09:49:38 labstore1001 systemd[1]: Unit nfs-exports.service entered failed state.
daemon.log.1:Dec 14 14:11:16 labstore1001 systemd[1]: Unit nfs-exports.service entered failed state.
daemon.log.1:Dec 14 18:49:14 labstore1001 systemd[1]: Unit nfs-exports.service entered failed state.

74 times or so in the logs we have. I ran it with --dry-run and did a bit of debugging. I ran it with --debug but that didn't have much in way of extra output (I also haven't been able to see it fail on demand). @yuvipanda I was hoping you would have a good idea.

We are seeing a lot of intermediate weirdness in labs and I'm wondering if this is related ala T121998

Event Timeline

chasemp created this task.Dec 22 2015, 10:22 PM
chasemp assigned this task to yuvipanda.
chasemp raised the priority of this task from to High.
chasemp updated the task description. (Show Details)
chasemp added subscribers: chasemp, yuvipanda.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 22 2015, 10:22 PM
chasemp updated the task description. (Show Details)Dec 22 2015, 10:38 PM
chasemp set Security to None.
yuvipanda lowered the priority of this task from High to Normal.Dec 23 2015, 9:39 AM

I am pretty sure this is all LDAP flakiness, since this hits LDAP pretty heavily. It's also low-criticalness service, if it fails new instances in projects that have NFS don't get NFS for a while, so it isn't that bad. Am investigating T121998 with @paravoid now and pretty sure it's unrelated.

https://gerrit.wikimedia.org/r/#/c/261266/ for mitigation, since most of the failures seem to be for the really large tools project (which has more than 100 instances)

chasemp added a comment.EditedDec 28 2015, 9:12 PM

for posterity I did a bunch of digging on this to come to the above:

sudo journalctl -u nfs-exports.service --no-pager > /home/rush/nfsexport.log

 Dec 22 15:03:51 labstore1001 nfs-exports-daemon[23045]: Traceback (most recent call last):
Dec 22 15:03:51 labstore1001 nfs-exports-daemon[23045]: File "/usr/local/bin/nfs-exports-daemon", line 84, in get_
instance_ips
Dec 22 15:03:51 labstore1001 nfs-exports-daemon[23045]: data = json.loads(urlopen(url, timeout=5).read().decode('u
tf-8'))

So the script seems to have been failing on failure to grab project info from wikitech and in sorting it is correlated to projects with lots of instances (was the thinking):

labstore1001:/home/rush# grep -i 'Error' nfsexport.log  | awk '{print $16}' | sort | uniq -c
      1 deployment-prep
      1 megacron
      1 openstack
     61 tools

I did a bunch of poking like manual fetches and digging through wikitech logs for instances of failure and didn't come up with anything definitive

curl 'https://wikitech.wikimedia.org/w/api.php?action=query&list=novainstances&niproject=tools&niregion=eqiad&format=json'

But I hope increasing the (arbitrary at the time 5s) timeout a bit to see if that helps is reasonable for now as it seems to finish in >3s for most of my requests and some variance there is not unreasonable with wikitech load.

chasemp closed this task as Resolved.Jan 12 2016, 5:31 PM

I don't see any of this in january at all now