Page MenuHomePhabricator

Fix labstore checks on cloudstore1008/9
Closed, ResolvedPublic0 Story Points

Description

The labstore status checks are all showing UNKNOWN: No valid datapoints found for cloudstore1008/9 due to a configuration problem. This is likely a diamond vs. prometheus thing.

Event Timeline

Bstorm triaged this task as Normal priority.Jun 6 2019, 11:28 PM
Bstorm created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 6 2019, 11:28 PM
Bstorm moved this task from Backlog to Shared Storage on the Data-Services board.Jun 6 2019, 11:28 PM
Bstorm moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Change 516967 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: re-enable diamond collectors for monitoring

https://gerrit.wikimedia.org/r/516967

Change 516967 merged by Bstorm:
[operations/puppet@production] cloudstore: re-enable diamond collectors for monitoring

https://gerrit.wikimedia.org/r/516967

Ok, so that fixed all the checks except the NFS service one. That might be a stretch incompatibility, depending.
Since that was the fix, I now am wondering what, if any, replacement there is for being able to page on a trend. I haven't seen anything like that running against prometheus here so far? I'll have to dig deeper.

Nope, the NFS one is purely an issue of firewall rules. I'll have to figure out exactly what to open up in order to make showmount -e work.

So this needs to connect to 111 over UDP from the public network, which is currently is not allowed to do.

Change 517470 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: move secondary monitoring stuff into profile and fix it

https://gerrit.wikimedia.org/r/517470

Change 517470 merged by Bstorm:
[operations/puppet@production] cloudstore: move secondary monitoring stuff into profile and fix it

https://gerrit.wikimedia.org/r/517470

Change 518079 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add sudo config for the nagios user

https://gerrit.wikimedia.org/r/518079

Change 518079 merged by Bstorm:
[operations/puppet@production] cloudstore: add sudo config for the nagios user

https://gerrit.wikimedia.org/r/518079

Bstorm closed this task as Resolved.Jun 20 2019, 5:54 PM

And when that runs on puppet, I see we are green.

Change 524528 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudstore: allow rpc.mountd traffic between hosts

https://gerrit.wikimedia.org/r/524528

That patch ^ fixes the NRPE error CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds when the host not supporting the VIP runs showmount.

Change 524528 merged by Jhedden:
[operations/puppet@production] cloudstore: allow rpc.mountd traffic between hosts

https://gerrit.wikimedia.org/r/524528