Page MenuHomePhabricator

Shinken keeps alerting about long gone instances
Closed, ResolvedPublic

Description

I have received an email notification from Shinken about two unreachable instances, but the instances have been deleted a while ago. Example:

Notification Type: PROBLEM
Host: integration-slave-docker-1046
State: DOWN
Address: 172.16.1.115
Info: CRITICAL - Host Unreachable (172.16.1.115)

Date/Time: Sun 17 Mar 21:03:27 UTC 2019

integration-slave-docker-1046 has been deleted recently.

Ditto with integration-publishing-02 deleted on March 19th 2019.

I think Shinken generates its configuration automatically. I can not remember from which source though. LDAP?

Event Timeline

hashar created this task.Mar 12 2019, 7:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 12 2019, 7:27 PM
Peachey88 updated the task description. (Show Details)Mar 13 2019, 9:04 AM
hashar renamed this task from Host DOWN alert for integration-publishing02! to Host DOWN alert for integration-publishing02 and integration-slave-docker-1046.Mar 18 2019, 9:23 AM
hashar updated the task description. (Show Details)
Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-releng) [2019-03-18T09:55:50Z] <hashar> deleting shutdowned instance integration-publisher02 , we do not use it anymore since doc publishing got overhauled ( T137890 ) # T218146

The configuration is generated via a python script modules/shinken/files/shinkengen in operations/puppet. It seems to query OpenStack for a list of instances on the project.

Turns out integration-publishing02 has been shutdown since it is no more used (T137890).

hashar renamed this task from Host DOWN alert for integration-publishing02 and integration-slave-docker-1046 to Host DOWN alert for integration-slave-docker-1046.Mar 18 2019, 9:57 AM
hashar updated the task description. (Show Details)

9 days after I have deleted the instance integration-publishing02, I still get notifications:

Notification Type: PROBLEM
Host: integration-publishing02
State: DOWN
Address: 172.16.4.5
Info: CRITICAL - Host Unreachable (172.16.4.5)

Date/Time: Wed 27 Mar 06:48:02 UTC 2019
hashar renamed this task from Host DOWN alert for integration-slave-docker-1046 to Shinken keeps alerting about long gone instances.Mar 27 2019, 7:36 AM
hashar updated the task description. (Show Details)

@GTirloni apparently you have acted recently on shinken-02.shinken.eqiad.wmflabs instance. Would you mind checking the config? It seems to be stall at some arbitrary old point in time.

Supposedly puppet runs the script /usr/local/bin/shinkengen which get list of instances from openstack and generate the conf. It is executed on each puppet run unless /usr/local/bin/shinkengen --test-if-up-to-date and would notify the shinken service.

I notably still get notified for instances that no more exist :( integration-publishing02.integration.eqiad.wmflabs and integration-slave-docker-1046.integration.eqiad.wmflabs.

GTirloni triaged this task as Normal priority.
$ grep publishing02 /etc/shinken/generated/integration.cfg 
    host_name        integration-publishing02

shinkengen apparently stopped updating all projects except for tools on March 11th:

# ls -ltr
total 96
-rw-r--r-- 1 root    root    20860 Mar 11 14:57 deployment-prep.cfg
-rw-r--r-- 1 shinken shinken   711 Mar 11 14:57 extdist.cfg
-rw-r--r-- 1 shinken shinken  5370 Mar 11 14:57 analytics.cfg
-rw-r--r-- 1 shinken shinken  7303 Mar 11 14:57 integration.cfg
-rw-r--r-- 1 shinken shinken   244 Mar 11 14:57 shinken.cfg
-rw-r--r-- 1 shinken shinken   630 Mar 11 14:57 cvn.cfg
-rw-r--r-- 1 shinken shinken   255 Mar 11 14:57 codesearch.cfg
-rw-r--r-- 1 shinken shinken 39509 Mar 27 12:47 tools.cfg

The configuration seems okay:

# cat /etc/shinkengen.yaml 
projects: [ 'tools', 'deployment-prep', 'extdist', 'analytics', 'integration', 'shinken', 'cvn', 'codesearch' ]
output_path: '/etc/shinken/generated'
site: eqiad
keystone_host: cloudcontrol1003.wikimedia.org
keystone_port: 5000
puppet_enc_host: labs-puppetmaster.wikimedia.org

./modules/shinken/templates/shinkengen.yaml.erb uses the ::site variable to define the region and that doesn't match our region name anymore (the old OpenStack region was also named eqiad and everything was fine but now the new region is named eqiad1-r).

Change 499516 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] shinken: Convert to role/profile

https://gerrit.wikimedia.org/r/499516

Andrew added a subscriber: Andrew.Mar 27 2019, 7:35 PM

As of just now, it looked to me like shinkengen was running properly but throwing an error due to a permissions problem. My guess is that 'shinkengen' was run as root and generated a new file that could not then be overwritten by the standard shinkengen run (which is prompted by puppet and runs as user 'shinken').

I wiped out the /etc/shinken/generated dir and started over and things seem happy now. This will probably also solve the zombie alerts, but time will tell.

Note, by the way, that in theory the multi-region issue was fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/494264/

./modules/shinken/templates/shinkengen.yaml.erb uses the ::site variable to define the region and that doesn't match our region name

I am pretty sure that the 'site' setting in shinkengen.yaml is only used to generate the fqdn, <site>.wmflabs. Since that's eqiad for both regions I think this is moot, although I could be overlooking another use of that var.

GTirloni removed GTirloni as the assignee of this task.Apr 3 2019, 10:19 AM
GTirloni removed a subscriber: GTirloni.Apr 3 2019, 10:25 AM
hashar added a comment.Apr 3 2019, 1:35 PM

I think it is solved? At least I have not received ghost notifications this week.

It's still broken, I just received a host down alert for an instance I deleted two hours ago.

I was working on shinken just now, and it was broken/down for a couple of days due to adding a new project without a contact group. So I suspect today's warning wasn't evidence of shinkengen being broken.

Change 499516 merged by Andrew Bogott:
[operations/puppet@production] shinken: Convert to role/profile

https://gerrit.wikimedia.org/r/499516

hashar closed this task as Resolved.May 2 2019, 6:08 AM

I am no more receiving ghost notifications, so I guess the Shinken configuration has been properly fixed.

Additionally I get alert send to betacluster-alerts mailing list which stopped at some point.