Page MenuHomePhabricator

boron passive checks aren't being collected
Closed, ResolvedPublic

Description

boron appears to be sending passive checks to neon...

Feb 19 19:35:02 boron nagios_nsca[5078]: STDOUT [208.80.154.14] 1 data packet(s) sent to host successfully.

...and they appear to be making it to the nsca collector...

root@neon:~# grep boron /var/lib/nagios/rw/nsca.dump |wc

80     891   11363

...but icinga is reporting them awol. Why?

Event Timeline

Jgreen created this task.Feb 19 2015, 7:41 PM
Jgreen claimed this task.
Jgreen raised the priority of this task from to Medium.
Jgreen updated the task description. (Show Details)
Jgreen added a project: acl*sre-team.
Jgreen added a subscriber: Jgreen.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 19 2015, 7:41 PM

probably relevant -- we recently upgraded boron from precise to trusty, and someone mentioned that nsca may be broken for trusty?

faidon added a subscriber: faidon.Mar 27 2015, 3:52 PM

Notifications for boron have been disabled. Whoever fixes this must re-enable them or we will lose future alerts.

Dzahn added a subscriber: Dzahn.Mar 27 2015, 4:33 PM

probably relevant -- we recently upgraded boron from precise to trusty, and someone mentioned that nsca may be broken for trusty?

how recent was that? i looked at the logfile as you mentioned and the last entry is:

[1419344402] PROCESS_SERVICE_CHECK_RESULT;boron;check_zombie;0;PROCS OK: 0 processes with STATE = Z

that's a UNIX timestamp that converts to Tue, 23 Dec 2014 14:20:02 GMT

so it looks more like neon stopped receiving them around last Chrismas.

regarding firewalling:

neon has a hole for:

ACCEPT tcp -- 10.64.40.0/24 anywhere tcp dpt:nsca

and boron.frack.eqiad.wmnet has address 10.64.40.66

So i ran tcpdump to check for traffic on 5667 (that's nsca) and i see incoming packets from aluminium, bellatrix, payments* and several others (codfw), yet those are not defined hosts in icinga so they won't match anything. the hosts would have to be added to site.pp!?

I don't see packets from boron yet.

regarding boron: i managed to get a shell as unprivileged user, but please remind me how to get root?

nevermind, i found the sudo password for boron.

so i can also confirm outgoing firewall rule seems to be there, destination port 5667 on neon's IP. but indeed, no packets seem to be going out when checking with tcpdump dst port 5667

confirmed send_nsca is installed on boron and can send packets over to neon:

@boron:/etc/cron.d# /usr/sbin/send_nsca -H neon.wikimedia.org

<-->

@neon:~# tcpdump port 5667 | grep boron

also:

@boron:/etc/cron.d# echo -e "boron\tcheck_disk\t0\ttest" | /usr/sbin/send_nsca -H neon.wikimedia.org -c /etc/send_nsca.cfg

1 data packet(s) sent to host successfully.

refs:

https://gerrit.wikimedia.org/r/#/c/520/2
https://rt.wikimedia.org/Ticket/Display.html?id=1151#txn-29893
https://rt.wikimedia.org/Ticket/Display.html?id=1151#txn-30073


also found this:

https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Webrequest_partitions#Monitoring

" Icinga runs on precise, and trusty's send_nsca and precise's Icinga cannot talk to each other"

confirmed it's trusty's version of send_nsca. i could use the one from precise and it worked:

echo -e "boron\tcheck_disk\t0\ttest" | /tmp/send_nsca -H neon.wikimedia.org -c /etc/send_nsca.cfg

[1427501526] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;boron;check_load;0;test
[1427501536] PASSIVE SERVICE CHECK: boron;check_load;0;test
[1427501536] SERVICE ALERT: boron;check_load;OK;HARD;2;test

now just the question. how should we install the precise version on trusty properly?

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=boron

works again. re-enabled notifications. but it's a hack for now. so don't close yet.

Frack has an internal repo on boron, using reprepro. I guess we build a package and host it there.

Jgreen added a comment.Apr 2 2015, 1:44 PM

hacked around this with puppet in frack, it clobbers the trusty binary with the one from precise, and modifies nsca-client.md5sums with the correct md5sum

Jgreen closed this task as Resolved.Apr 2 2015, 1:45 PM