Page MenuHomePhabricator

VictorOps 'escalator' did not work on 2022-12-03
Closed, ResolvedPublic

Description

All three of the pages from this time (over the weekend, so no business hours) show no escalation to paging, merely a recovery after 5 minutes:

https://portal.victorops.com/ui/wikimedia/incident/3174/timeline

see also T313603

Event Timeline

Had a look into this, and seems like vo-escalate is stuck. Running since 2 weeks ago

alert1001:/var/log# systemctl status vo-escalate.service
● vo-escalate.service - "Escalate VO unpaged incidents"
   Loaded: loaded (/lib/systemd/system/vo-escalate.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2022-11-20 18:29:06 UTC; 2 weeks 0 days ago
 Main PID: 9150 (python3)
    Tasks: 1 (limit: 4915)
   Memory: 23.9M
   CGroup: /system.slice/vo-escalate.service
           └─9150 /usr/bin/python3 klaxon/victorops.py escalate_unpaged <redacted>

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
alert1001:/var/log# systemctl status vo-escalate.timer
● vo-escalate.timer - Periodic execution of vo-escalate.service
   Loaded: loaded (/lib/systemd/system/vo-escalate.timer; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2022-08-03 13:46:56 UTC; 4 months 2 days ago
  Trigger: n/a
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

And can see the process still running since Nov20

klaxon    9150  0.0  0.0  42644 34244 ?        Ss   Nov20   0:00 /usr/bin/python3 klaxon/victorops.py escalate_unpaged <redacted>

Tempted to kill this process, but will leave it as-is for investigation in the broken state as we're still within business hours.

Is there a way to monitor this so we can get an alert if it happens again?

Change 864776 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] vo-escalate: kill process if run time exceeds 10s

https://gerrit.wikimedia.org/r/864776

Change 864787 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Add a timeout for vo-escalate

https://gerrit.wikimedia.org/r/864787

Both the processes on alert1001 and alert2001 have been stuck for a while.

Both are blocked on read() calls on a TCP socket to the same Cloudflare-owned IPv6 address which is supposedly still in state ESTABLISHED.

On alert2001 this has supposedly been the case since Oct 20; on alert1001 since Nov 20.

I'll also add some timeouts at the requests library level in the code itself.

Change 864787 merged by CDanis:

[operations/puppet@production] Add a timeout for vo-escalate

https://gerrit.wikimedia.org/r/864787

Change 864776 abandoned by Herron:

[operations/puppet@production] vo-escalate: kill process if run time exceeds 10s

Reason:

https://gerrit.wikimedia.org/r/864776

When I Klaxoned about T327001 it took almost exactly 10 minutes to get a reaction, and at least @MatthewVernon reported that they got an email only instead of a proper page. So that makes me a bit suspicious if the fix for this bug works properly.

FWIW:
https://portal.victorops.com/ui/wikimedia/incident/3226/timeline shows the initial page at 09:01 and VO at least attempting to escalate it immediately.

I got emailed at 09:01 and 09:06, so effectively immediatley; I didn't get an SMS, but I think that's because of how I have VO configured.

IRC chatter suggests some people did get paged (but were unavailable), but at least some SREs (@Joe ) didn't.

So I think this was escalation working as we want, but then some issue with how VO sent some of the pages.

Can confirm I was paged on Jan. 14 - 3:06 AM Trying to contact Leo Mata for #3226, sending SMS in entries from Splunk (verifiable on my device)

lmata triaged this task as Medium priority.Jan 16 2023, 5:21 PM
lmata moved this task from Inbox to Backlog on the Observability-Alerting board.

Keeping it open in case there is another reoccurrence but am hopeful the lag was primarily due to off-hours paging.

CDanis claimed this task.

When I Klaxoned about T327001 it took almost exactly 10 minutes to get a reaction, and at least @MatthewVernon reported that they got an email only instead of a proper page. So that makes me a bit suspicious if the fix for this bug works properly.

Hi taavi,

Your klaxon was escalated to the whole team within a minute of it being sent :)