Page MenuHomePhabricator

Fix labstore checks on cloudstore1008/9
Closed, ResolvedPublic0 Estimated Story Points

Description

The labstore status checks are all showing UNKNOWN: No valid datapoints found for cloudstore1008/9 due to a configuration problem. This is likely a diamond vs. prometheus thing.

Event Timeline

Bstorm triaged this task as Medium priority.Jun 6 2019, 11:28 PM
Bstorm created this task.

Change 516967 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: re-enable diamond collectors for monitoring

https://gerrit.wikimedia.org/r/516967

Change 516967 merged by Bstorm:
[operations/puppet@production] cloudstore: re-enable diamond collectors for monitoring

https://gerrit.wikimedia.org/r/516967

Ok, so that fixed all the checks except the NFS service one. That might be a stretch incompatibility, depending.
Since that was the fix, I now am wondering what, if any, replacement there is for being able to page on a trend. I haven't seen anything like that running against prometheus here so far? I'll have to dig deeper.

Nope, the NFS one is purely an issue of firewall rules. I'll have to figure out exactly what to open up in order to make showmount -e work.

So this needs to connect to 111 over UDP from the public network, which is currently is not allowed to do.

Change 517470 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: move secondary monitoring stuff into profile and fix it

https://gerrit.wikimedia.org/r/517470

Change 517470 merged by Bstorm:
[operations/puppet@production] cloudstore: move secondary monitoring stuff into profile and fix it

https://gerrit.wikimedia.org/r/517470

Change 518079 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add sudo config for the nagios user

https://gerrit.wikimedia.org/r/518079

Change 518079 merged by Bstorm:
[operations/puppet@production] cloudstore: add sudo config for the nagios user

https://gerrit.wikimedia.org/r/518079

And when that runs on puppet, I see we are green.

Change 524528 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudstore: allow rpc.mountd traffic between hosts

https://gerrit.wikimedia.org/r/524528

That patch ^ fixes the NRPE error CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds when the host not supporting the VIP runs showmount.

Change 524528 merged by Jhedden:
[operations/puppet@production] cloudstore: allow rpc.mountd traffic between hosts

https://gerrit.wikimedia.org/r/524528