Page MenuHomePhabricator

WMCS: hundred of phabricator tickets were created for some alerts
Open, HighPublic

Description

See screenshot:

image.png (1×1 px, 263 KB)

We would need to:

  • stop immediately whatever task-creation workflow is producing them
  • close/handle all hundred opened tasks
  • rethink our usage of alerts-to-phabricator workflow

Related Objects

Mentioned In
T331425: JobUnavailable
T331432: NodeDown
T331433: NodeDown
T331431: NodeDown
T331430: NodeDown
T331435: NodeDown
T331436: NodeDown
T331437: NodeDown
T331434: NodeDown
T331439: NodeDown
T331438: NodeDown
T331440: NodeDown
T331768: SystemdUnitDownForLong cloudbackup1003:9100 Unit backup_vms.service on node cloudbackup1003 has been down for long.
T331532: SystemdUnitDownForLong cloudbackup1004:9100 Unit backup_vms.service on node cloudbackup1004 has been down for long.
T331779: SystemdUnitDownForLong cloudbackup1003:9100 Unit backup_vms.service on node cloudbackup1003 has been down for long.
T331788: SystemdUnitDownForLong cloudbackup1003:9100
T331794: SystemdUnitDownForLong cloudbackup1003:9100
T331798: SystemdUnitDownForLong cloudbackup1003:9100
T331808: SystemdUnitDownForLong cloudbackup1003:9100
T331802: SystemdUnitDownForLong cloudbackup1003:9100
T331813: SystemdUnitDownForLong cloudbackup1003:9100
T331821: SystemdUnitDownForLong cloudbackup1003:9100
T331825: SystemdUnitDownForLong cloudbackup1003:9100
T331828: SystemdUnitDownForLong cloudbackup1003:9100
T331832: SystemdUnitDownForLong cloudbackup1003:9100
T331846: SystemdUnitDownForLong cloudbackup1003:9100
T331866: SystemdUnitDownForLong cloudbackup1003:9100
T331885: SystemdUnitDownForLong cloudbackup1003:9100
T331918: SystemdUnitDownForLong cloudbackup1003:9100
T331942: SystemdUnitDownForLong cloudbackup1003:9100
T331962: SystemdUnitDownForLong cloudbackup1003:9100
T331958: SystemdUnitDownForLong cloudbackup1003:9100
T332111: SystemdUnitDownForLong cloudbackup1003:9100 Unit backup_vms.service on node cloudbackup1003 has been down for long.
T332130: SystemdUnitDownForLong cloudbackup1003:9100 Unit backup_vms.service on node cloudbackup1003 has been down for long.
T331982: SystemdUnitDownForLong cloudbackup1003:9100
T332001: JobUnavailable The Prometheus job pdns_rec running on cloud@ has been unable to scrape 50% of its targets. Check if the targets are reachable and exporting metrics.
T332237: SystemdUnitDownForLong cloudbackup1003:9100
T332213: SystemdUnitDownForLong cloudbackup1003:9100
T332184: SystemdUnitDownForLong cloudbackup1003:9100
T332251: SystemdUnitDownForLong cloudbackup1003:9100
T332239: SystemdUnitDownForLong cloudbackup1003:9100
T332271: SystemdUnitDownForLong cloudbackup1003:9100
T332303: SystemdUnitDownForLong cloudbackup1003:9100
T332347: Duplicate the maintain-harbor deployment for "tools"
T332360: SystemdUnitDownForLong cloudbackup1003:9100
T332323: SystemdUnitDownForLong cloudbackup1003:9100
T332350: SystemdUnitDownForLong cloudbackup1003:9100
T332376: SystemdUnitDownForLong cloudbackup1004:9100 Unit backup_vms.service on node cloudbackup1004 has been down for long.
T332367: SystemdUnitDownForLong cloudbackup1003:9100
T332372: SystemdUnitDownForLong cloudbackup1003:9100
T332526: NeutronAgentDown cloudvirt-wdqs1002 A Neutron agent is down, VMs will have connectivity issues
T332536: NeutronAgentDown cloudvirt-wdqs1002 A Neutron agent is down, VMs will have connectivity issues
T332531: NeutronAgentDown cloudvirt-wdqs1002 A Neutron agent is down, VMs will have connectivity issues
T332631: NeutronAgentDown cloudvirt-wdqs1002 A Neutron agent is down, VMs will have connectivity issues
T332606: NeutronAgentDown cloudvirt-wdqs1002 A Neutron agent is down, VMs will have connectivity issues
T332559: NeutronAgentDown cloudvirt-wdqs1002 A Neutron agent is down, VMs will have connectivity issues
T332652: NeutronAgentDown cloudvirt-wdqs1002 A Neutron agent is down, VMs will have connectivity issues
T332658: NeutronAgentDown cloudvirt-wdqs1002 A Neutron agent is down, VMs will have connectivity issues
T332668: NeutronAgentDown cloudvirt-wdqs1002 A Neutron agent is down, VMs will have connectivity issues
T332665: NeutronAgentDown cloudvirt-wdqs1002 A Neutron agent is down, VMs will have connectivity issues
T332853: SystemdUnitDownForLong cloudcontrol1005:9100 Unit backup_cinder_volumes.service on node cloudcontrol1005 has been down for long.
T332860: SystemdUnitDownForLong cloudcontrol1005:9100 Unit backup_cinder_volumes.service on node cloudcontrol1005 has been down for long.
T332875: SystemdUnitDownForLong cloudcontrol1005:9100 Unit backup_cinder_volumes.service on node cloudcontrol1005 has been down for long.
T332914: SystemdUnitDownForLong cloudcontrol1005:9100 Unit backup_cinder_volumes.service on node cloudcontrol1005 has been down for long.
T333255: NeutronAgentDown cloudvirt1021 A Neutron agent is down, VMs will have connectivity issues
T333256: NeutronAgentDown cloudvirt1022 A Neutron agent is down, VMs will have connectivity issues
T333253: NeutronAgentDown cloudvirt1017 A Neutron agent is down, VMs will have connectivity issues
T333278: NeutronAgentDown cloudvirt1017 A Neutron agent is down, VMs will have connectivity issues
T333279: NeutronAgentDown cloudvirt1021 A Neutron agent is down, VMs will have connectivity issues
T333280: NeutronAgentDown cloudvirt1022 A Neutron agent is down, VMs will have connectivity issues
T333305: NeutronAgentDown cloudvirt1017 A Neutron agent is down, VMs will have connectivity issues
T333148: SystemdUnitFailed clouddb2002-dev:9100
T333147: SystemdUnitFailed cloudservices2004-dev:9100 ifup@eno1.service Failed on cloudservices2004-dev:9100
T333149: SystemdUnitFailed cloudservices2005-dev:9100 ifup@eno1.service Failed on cloudservices2005-dev:9100
T333151: SystemdUnitFailed cloudcontrol2005-dev:9100 purge_vm_backup.service Failed on cloudcontrol2005-dev:9100
T333150: SystemdUnitFailed cloudcontrol2001-dev:9100 purge_vm_backup.service Failed on cloudcontrol2001-dev:9100
T333152: SystemdUnitFailed cloudcontrol2004-dev:9100
T333154: SystemdUnitFailed cloudvirt1019:9100 export_smart_data_dump.service Failed on cloudvirt1019:9100
T333193: SystemdUnitFailed clouddb2002-dev:9100
T333195: SystemdUnitFailed cloudvirt1019:9100 export_smart_data_dump.service Failed on cloudvirt1019:9100
T333194: SystemdUnitFailed cloudservices2005-dev:9100 ifup@eno1.service Failed on cloudservices2005-dev:9100
T333196: SystemdUnitFailed cloudservices2004-dev:9100 ifup@eno1.service Failed on cloudservices2004-dev:9100
T333197: SystemdUnitFailed cloudcontrol2001-dev:9100 purge_vm_backup.service Failed on cloudcontrol2001-dev:9100
T333198: SystemdUnitFailed cloudcontrol2005-dev:9100 purge_vm_backup.service Failed on cloudcontrol2005-dev:9100
T333202: SystemdUnitFailed cloudweb2002-dev:9100 wikitech_run_jobs.service Failed on cloudweb2002-dev:9100
T333201: SystemdUnitFailed cloudvirt2002-dev:9100 prometheus-node-cloudvirt-libvirt-stats.service Failed on cloudvirt2002-dev:9100
T333203: SystemdUnitFailed clouddb2002-dev:9100
T333199: SystemdUnitFailed cloudcontrol2004-dev:9100
T333229: SystemdUnitFailed cloudservices2005-dev:9100 ifup@eno1.service Failed on cloudservices2005-dev:9100
T333230: SystemdUnitFailed cloudservices2004-dev:9100 ifup@eno1.service Failed on cloudservices2004-dev:9100
T333214: SystemdUnitFailed labstore1004:9100 disable-tool.service Failed on labstore1004:9100
T333231: SystemdUnitFailed cloudvirt1019:9100 export_smart_data_dump.service Failed on cloudvirt1019:9100
T333232: SystemdUnitFailed cloudcontrol2005-dev:9100 purge_vm_backup.service Failed on cloudcontrol2005-dev:9100
T333233: SystemdUnitFailed cloudcontrol2001-dev:9100 purge_vm_backup.service Failed on cloudcontrol2001-dev:9100
T333234: SystemdUnitFailed cloudcontrol2004-dev:9100
T333237: SystemdUnitFailed clouddb2002-dev:9100
T333258: SystemdUnitFailed cloudservices2005-dev:9100 ifup@eno1.service Failed on cloudservices2005-dev:9100
T333259: SystemdUnitFailed cloudcontrol2005-dev:9100 purge_vm_backup.service Failed on cloudcontrol2005-dev:9100
T333260: SystemdUnitFailed cloudvirt1019:9100 export_smart_data_dump.service Failed on cloudvirt1019:9100
T333261: SystemdUnitFailed cloudservices2004-dev:9100 ifup@eno1.service Failed on cloudservices2004-dev:9100
T333262: SystemdUnitFailed cloudcontrol2004-dev:9100
T333263: SystemdUnitFailed cloudcontrol2001-dev:9100 purge_vm_backup.service Failed on cloudcontrol2001-dev:9100
T333265: SystemdUnitFailed clouddb2002-dev:9100
T333274: SystemdUnitFailed labstore1004:9100 disable-tool.service Failed on labstore1004:9100
T333284: SystemdUnitFailed cloudvirt1019:9100 export_smart_data_dump.service Failed on cloudvirt1019:9100
T333286: SystemdUnitFailed cloudcontrol2001-dev:9100 purge_vm_backup.service Failed on cloudcontrol2001-dev:9100
T333287: SystemdUnitFailed clouddb2002-dev:9100
T333293: SystemdUnitFailed cloudweb1004:9100 wikitech_run_jobs.service Failed on cloudweb1004:9100
T333296: SystemdUnitFailed cloudbackup1004:9100 backup_vms.service Failed on cloudbackup1004:9100
T333303: SystemdUnitFailed cloudvirt1045:9100 prometheus-node-cloudvirt-libvirt-stats.service Failed on cloudvirt1045:9100
T333306: SystemdUnitFailed labstore1004:9100 disable-tool.service Failed on labstore1004:9100
T333307: SystemdUnitFailed cloudcontrol2001-dev:9100 purge_vm_backup.service Failed on cloudcontrol2001-dev:9100
T333308: SystemdUnitFailed cloudservices2004-dev:9100 ifup@eno1.service Failed on cloudservices2004-dev:9100
T333309: SystemdUnitFailed cloudvirt1019:9100 export_smart_data_dump.service Failed on cloudvirt1019:9100
T333310: SystemdUnitFailed cloudcontrol2005-dev:9100 purge_vm_backup.service Failed on cloudcontrol2005-dev:9100
T333311: SystemdUnitFailed cloudcontrol2004-dev:9100
T333313: SystemdUnitFailed clouddb2002-dev:9100
Mentioned Here
T331425: JobUnavailable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

via IRC @fgiunchedi mentioned that this might be related to some recent changes while migrating some alerts from icinga to alert-manager.

aborrero mentioned this in T331440: NodeDown.
aborrero mentioned this in T331439: NodeDown.
aborrero mentioned this in T331438: NodeDown.
aborrero mentioned this in T331436: NodeDown.
aborrero mentioned this in T331437: NodeDown.
aborrero mentioned this in T331434: NodeDown.
aborrero mentioned this in T331435: NodeDown.
aborrero mentioned this in T331432: NodeDown.
aborrero mentioned this in T331433: NodeDown.
aborrero mentioned this in T331431: NodeDown.
aborrero mentioned this in T331430: NodeDown.

Change 903613 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: temp downgrade systemdunitfailed to warning, exclude wmcs

https://gerrit.wikimedia.org/r/903613

I thought it's not a bug but a feature :D

Change 903613 merged by Filippo Giunchedi:

[operations/alerts@master] sre: temp downgrade systemdunitfailed to warning, exclude wmcs

https://gerrit.wikimedia.org/r/903613

hey @Aklapper in the batch operation 3630 I made a mistake and set the ticket title instead of adding a comment. Not sure what to do with that, if try reverting somehow. On the other hand who cares, those tickets shouldn't have been created in the first place.

Change 904559 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] alertmanager: disable phabricator task creation for WMCS alerts

https://gerrit.wikimedia.org/r/904559

Change 904559 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] alertmanager: update phabricator project for WMCS alerts

https://gerrit.wikimedia.org/r/904559

in the batch operation 3630 I made a mistake and set the ticket title instead of adding a comment

@aborrero: As you like. If you think it's really bad I could run a script to revert. But if nobody complains I'd just leave it as is.

in the batch operation 3630 I made a mistake and set the ticket title instead of adding a comment

@aborrero: As you like. If you think it's really bad I could run a script to revert. But if nobody complains I'd just leave it as is.

Thanks, let's leave it as is.