Page MenuHomePhabricator

a couple longer running icinga alerts to be fixed
Closed, DeclinedPublic

Description

A couple Icinga alerts that have been crit for a while but are probably not actually CRIT, but are creating alerting noise, combined into one ticket:

maps related:

  • "Maps - OSM synchronization lag - eqiad" link
  • "Maps - OSM synchronization lag - codfw" link
  • "Maps tiles generation" link

analytics related:

  • an-master1001 - CRITICAL - degraded: The following units failed: hadoop-clean-fairscheduler-event-logs.service link
  • an-test-master1001 - CRITICAL - degraded: The following units failed: hadoop-clean-fairscheduler-event-logs.service link
  • an-tool1005 - Memcached connect to address 10.64.36.117 and port 11211: Connection refused link
  • aqs1008.mgmt - SSH - CRITICAL - Socket timeout after 10 seconds link

cloud related:

  • cloudstore1008 - eno2 reporting no carrier. link
  • cloudstore1009 - CRITICAL - degraded: The following units failed: purge_vm_backup.service link

infra foundation related:

  • deneb - DPKG CRITICAL dpkg reports broken packages link
  • idp-test1002 - CRITICAL - degraded: The following units failed: memcached.service link
  • idp-test1002 - connect to address 208.80.154.72 and port 11000: Connection refused link
  • idp-test2002 - Check no envoy runtime configuration is left persistent - connect to address 127.0.0.1 and port 9631: Connection refused link
  • mirror1001 - CRITICAL - degraded: The following units failed: nginx.service link
  • netbox1002 - CRITICAL - degraded: The following units failed: rq-netbox.service,wmf_auto_restart_rq-netbox.service link

releng/serviceops related:

  • deploy2002 - CRITICAL - degraded: The following units failed: wmf_auto_restart_imagecatalog.service (imagecatalog service does not exist on non-active deployment server but the restart service does) link
  • ganeti1023 - Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] link

dumps related:

  • dumpsdata1002 - CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service link

Event Timeline

We can just fix them but we can also question if they should/can be removed on non-active hosts (via puppet changes), whether they should really be CRIT etc.

I don't think one big meta task will work out, it'll show up in too many workboards (even if e.g. some bits are done) and there's also the issue that a task can only have one assignee. So I think to be practical it would need to be split into sub teams and then tagged with the respective tasks.

I have tried pinging individual IRC channels as well as individual tasks in the past. Additionally I have pointed out my concern about missing Icinga notifications in different venus. I don't know what the solution is to get attention to Icinga alerts.