Page MenuHomePhabricator

icinga config error for new rows E/R
Closed, ResolvedPublic

Description

When I setup a new host via T302937, icinga started to throw errors due to icinga not knowing "lsw1-f1-eqiad.mgmt.eqiad.wmnet" but the host attempts to use it as a parent.

Event Timeline

So I had to decom the host overnight so I wouldn't leave icinga broken. However, not sure how to add lsw1-f1-eqiad.mgmt.eqiad.wmnet so it works like lsw1-e3-eqiad.mgmt.eqiad.wmnet

Failed to run Homer on lsw1-f1-eqiad.mgmt.eqiad.wmnet: Command '['/usr/local/bin/homer', 'lsw1-f1-eqiad.mgmt.eqiad.wmnet', 'commit', 'Host decommission - robh@cumin1001 - T302937']' returned non-zero exit status 1.

The decom script also failed, seems it cannot account for this.

When this host was installed and added to Icinga config by puppet, it broke Icinga config. The error was:

Error: 'lsw1-f1-eqiad.mgmt.eqiad.wmnet' is not a valid parent for host 'dumpsdata1007'

When looking at "parents" in Icinga config it is noteworthy that almost none of them use FQDN, they all have short names like:

objects/puppet_hosts.cfg:	parents                        msw1-eqiad
objects/puppet_hosts.cfg:	parents                        asw2-d-eqiad

the new switches though use FQDN and point to full .mgmt. names like:

lsw1-e3-eqiad.mgmt.eqiad.wmnet

Though this is not the real issue because Icinga is just fine with "lsw1-e3-eqiad.mgmt.eqiad.wmnet" in the config.

What it does NOT like though is "lsw1-f1-eqiad.mgmt.eqiad.wmnet".

This is relatively impactful because if for some reason Icinga or alert1001 would be restarted then Icinga would be completely down due to this.

Once the host was decom'ed again and puppet removed all its checks from Icinga config Icinga was happy again.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/764791 should fix the issue.

About hostname vs. FQDN is because the devices use LLDP to know what is their parent switch. However Juniper decided to start advertising their FQDN in a more recent version (and it's not configurable)...
I'd prefer to have the short-name in there, so maybe the fix would be to `.split(".")[0] somewhere and update the monitoring file.

@RobH apologies for this, I was working on an improved version of the CR Arzhel lists above yesterday. But it should have occurred to me that bringing hosts live in advance of that would throw errors.

I will endeavor to get the improved version merged today before you get online, if not I will merge the above and we can try agian.

thanks.

Change 767774 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Icinga: use parent switch shortname

https://gerrit.wikimedia.org/r/767774

Change 767774 merged by Ayounsi:

[operations/puppet@production] Icinga: use parent switch shortname

https://gerrit.wikimedia.org/r/767774

Change 767835 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add new eqiad switches to monitoring and align for all L3 switches

https://gerrit.wikimedia.org/r/767835

Change 767835 abandoned by Cathal Mooney:

[operations/puppet@production] Add new eqiad switches to monitoring and align for all L3 switches

Reason:

Myself and Arzhel both tried to make same changes at same time, abandoned in favour of his CR: 767835

https://gerrit.wikimedia.org/r/767835

dumpsdata1007 looks good in Icinga now after being re-added, following the above patches being merged.

Apologies for the oversight here @RobH, hopefully that's the last niggle we hit in the new row!