Page MenuHomePhabricator

HDFS check topology alert is currently broken
Closed, ResolvedPublic

Description

As noticed by Ben in https://gerrit.wikimedia.org/r/c/operations/puppet/+/728391/ an-worker1033 is not listed correctly in our HDFS topology map, ending up in:

elukey@an-master1001:~$ sudo -u hdfs hdfs dfsadmin -printTopology
[..]

Rack: /eqiad/default/rack
   10.64.36.9:50010 (an-worker1133.eqiad.wmnet)

The strange thing is that we have an alert for this use case in profile::hadoop::master, that should execute the following:

elukey@an-master1001:~$ /usr/bin/sudo /usr/local/bin/kerberos-run-command hdfs /usr/local/lib/nagios/plugins/check_hdfs_topology
CRITICAL: There is at least one node in the default rack.

So the local check seems to work, but for some reason icinga thinks that everything is ok (the alert is green). I tried to force another run from the UI, and this is what the an-master1001's nagios server logs:

Oct 08 14:03:43 an-master1001 sudo[14391]:   nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/bin/kerberos-run-command hdfs /usr/local/bin/check_hdfs_active_namenode
Oct 08 14:03:43 an-master1001 sudo[14391]: pam_unix(sudo:session): session opened for user root by (uid=0)
Oct 08 14:03:43 an-master1001 kerberos-run-command[14392]: User nagios executes as user hdfs the command ['/usr/local/bin/check_hdfs_active_namenode']

That again it looks good. On the alert1001 side, this should be what gets executed:

elukey@alert1001:~$ /usr/lib/nagios/plugins/check_nrpe -2 -u -H an-master1001.eqiad.wmnet -c check_check_hdfs_topology -t 10
connect to address 2620:0:861:104:10:64:5:26 port 5666: Connection refused
OK
elukey@alert1001:~$ /usr/lib/nagios/plugins/check_nrpe -4 -2 -u -H an-master1001.eqiad.wmnet -c check_check_hdfs_topology -t 10
OK

I see the same output for both in the an-master1001's nagios server, no idea why this happens.

Event Timeline

elukey triaged this task as High priority.Oct 8 2021, 2:05 PM
elukey created this task.

I can confirm this, but I haven't yet found the reason for it.

As an aside, I'm not sure that the logs in the third code block in the description are correct.
The log lines refer to check_hdfs_active_namenode instead of check_hdfs_topology.

I checked /var/log/auth.log on an-master1001 after running /usr/lib/nagios/plugins/check_nrpe -4 -H an-master1001.eqiad.wmnet -c check_check_hdfs_topology on alert1001.

The results in the log file look like this, which seems fine to me.

Oct  8 15:37:24 an-master1001 sudo:   nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/bin/kerberos-run-command hdfs /usr/local/lib/nagios/plugins/check_hdfs_topology
Oct  8 15:37:24 an-master1001 sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
Oct  8 15:37:24 an-master1001 sudo:     root : TTY=unknown ; PWD=/ ; USER=hdfs ; COMMAND=/usr/bin/hdfs dfsadmin -printTopology
Oct  8 15:37:24 an-master1001 sudo: pam_unix(sudo:session): session opened for user hdfs by (uid=0)
Oct  8 15:37:26 an-master1001 sudo: pam_unix(sudo:session): session closed for user hdfs
Oct  8 15:37:26 an-master1001 sudo: pam_unix(sudo:session): session closed for user root

Still investigating for the cause.

Ah yes I copied the wrong block, but the commands executed for the hdfs topology is basically the same! Sorry :)

Not a problem. It's just a bit of a mystery why the correct return code isn't coming back over NRPE.

Change 728562 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: remove sudo usage in the check_hdfs_topology

https://gerrit.wikimedia.org/r/728562

Change 728562 merged by Elukey:

[operations/puppet@production] hadoop: remove sudo usage in the check_hdfs_topology

https://gerrit.wikimedia.org/r/728562

The change seems to have worked, we are now seeing an alert for topology. In theory after merging + deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/728391 we should see recovery as well.

Thanks @elukey - I confirm that your fix of removing the double sudo call seems to have fixed the check.
I'll merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/728391 now.

The alert has now been fixed and deploying the change above, followed by a restart of the Hadoop masters resulted in the check passing again. Marking this as done.