Page MenuHomePhabricator

too many puppet failures (puppet errors on stat hosts)
Closed, DuplicatePublic

Description

Today the "widespread puppet failures" monitoring alert triggered (https://puppetboard.wikimedia.org/nodes?status=failed).

But it wasn't caused by one change that broke things globally.

It was just that special cases had added up. Different hosts have different unrelated puppet errors but overall the number is just crossing our alerting threshold.

So of course this _could_ mean the threshold is too low but more like it means we should fix some errors we can fix.

I wasn't sure if I should make a single ticket with checkboxes (this has been disliked by some before) or a ticket for each type of server (this has also been disliked before) or just do nothing and wait for the next time someone notices it on IRC.

So instead I picked one group of servers where I saw multiple in the list. So maybe if we can fix those we are under the threshold again.

But we could also use this ticket to add ALL other failed hosts.

Starting with the stat* hosts:

  • stat1004 - Error: Execution of '/usr/bin/scap deploy-local --repo wikimedia/discovery/analytics -D log_json:False'
  • stat1005 - Error: Execution of '/usr/bin/scap deploy-local --repo analytics/hdfs-tools/deploy -D log_json:False'
  • stat1007 - Error: Execution of '/usr/bin/scap deploy-local --repo performance/asoranking -D log_json:False'
  • stat1008 - Error: Execution of '/usr/bin/scap deploy-local --repo wikimedia/discovery/analytics -D log_json:False'

Event Timeline

I've noticed some scap deploy-local failures on logstash hosts too for phatality, investigating

I got this when running as deploy-service:

deploy-service@logstash1032:~$ scap  deploy-local --repo releng/phatality -D log_json:False
Traceback (most recent call last):
  File "/usr/bin/scap", line 44, in <module>
    from scap import cli
ModuleNotFoundError: No module named 'scap'

I am getting the same ModuleNotFoundError on arclamp1001 (where scap predictably fails too)

Yes that's right, I'll followup there! thank you