A couple Icinga alerts that have been crit for a while but are probably not actually CRIT, but are creating alerting noise, combined into one ticket:
maps related:
- "Maps - OSM synchronization lag - eqiad" link
- "Maps - OSM synchronization lag - codfw" link
- "Maps tiles generation" link
analytics related:
- an-master1001 - CRITICAL - degraded: The following units failed: hadoop-clean-fairscheduler-event-logs.service link
- an-test-master1001 - CRITICAL - degraded: The following units failed: hadoop-clean-fairscheduler-event-logs.service link
- an-tool1005 - Memcached connect to address 10.64.36.117 and port 11211: Connection refused link
- aqs1008.mgmt - SSH - CRITICAL - Socket timeout after 10 seconds link
cloud related:
- cloudstore1008 - eno2 reporting no carrier. link
- cloudstore1009 - CRITICAL - degraded: The following units failed: purge_vm_backup.service link
infra foundation related:
- deneb - DPKG CRITICAL dpkg reports broken packages link
- idp-test1002 - CRITICAL - degraded: The following units failed: memcached.service link
- idp-test1002 - connect to address 208.80.154.72 and port 11000: Connection refused link
- idp-test2002 - Check no envoy runtime configuration is left persistent - connect to address 127.0.0.1 and port 9631: Connection refused link
- mirror1001 - CRITICAL - degraded: The following units failed: nginx.service link
- netbox1002 - CRITICAL - degraded: The following units failed: rq-netbox.service,wmf_auto_restart_rq-netbox.service link
releng/serviceops related:
- deploy2002 - CRITICAL - degraded: The following units failed: wmf_auto_restart_imagecatalog.service (imagecatalog service does not exist on non-active deployment server but the restart service does) link
- ganeti1023 - Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] link
dumps related:
- dumpsdata1002 - CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service link