Page MenuHomePhabricator

Investigate smsglobal delivery failures from 2015-06-13 weekend
Closed, DeclinedPublic

Description

During a critical outage event this weekend, icinga sent alert pages via smsglobal to virtually all of our operations personnel, as we were mostly all in our awake_hours windows. I happened to copy one such line to irc at the time, which was this:

[1434221938] SERVICE NOTIFICATION: bblack;text-lb.esams.wikimedia.org;LVS HTTP IPv4;CRITICAL;notify-by-sms-gateway;Connection refused

When I observed icinga.log at the time, I am definitely sure there was a block of such matching lines for everyone. However, @Joe @akosiaris and @faidon confirmed on IRC they never received these texts, indicating probably an smsglobal delivery failure specific to our EU ops. @BBlack and @Gage did receive the pages for sure.

Additionally, all log lines of the paging event, including the one pasted above, are gone from our icinga logs. There seems to be a history gap after rotation between icinga.log.1 and icinga.log, which covers this timeframe. I'll file a separate task about that...

Event Timeline

BBlack created this task.Jun 14 2015, 3:56 AM
BBlack raised the priority of this task from to High.
BBlack updated the task description. (Show Details)
BBlack added a project: acl*sre-team.
BBlack added subscribers: BBlack, RobH, Dzahn and 5 others.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 14 2015, 3:56 AM

Fwiw I never got a page either and I am not on the sms global list, but
instead use Verizon email gateway.

there seem to be ongoing, I checked mail logs on polonium for my number and it shows two entries: 2015-07-16 10:17:14 and 2015-07-16 10:26:34 respectively a page for citoid and the recovery, I got only the recovery SMS

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 21 2015, 1:46 PM
mark assigned this task to RobH.Jul 22 2015, 10:40 AM

Rob, can you take a look at this? SMSGlobal is consistenly unreliable for us, and we either need to get it fixed soon or move to another solution...

RobH added a subscriber: Joe.Jul 22 2015, 6:53 PM

Logging into smsglobal, and looking @ the outgoing reports, shows a lot of yellow (sent, but not confirmed received) and red (failed).

Tracking down @fgiunchedi as an example, shows both of those texts as sent (yellow). That he got one and not the other isn't reflected in the report.

We've had a LOT of these issues in the past. @Joe had started to tinker with another provider's API, as our relocating providers also means we have to enable API use for these notifications, not simply email to sms gateway. (The other providers use api.)

RobH added a comment.Jul 22 2015, 6:53 PM

@Joe: Do you happen to recall which service you tested and liked the best? I'm still trying to find old tasks from then to find my lists.

RobH added a comment.Jul 22 2015, 7:02 PM

seems we already had wikimedia.pagerduty.com.

Chase advised that OIT has an account, as well as his own demo account from a few months ago. he is going to track it down and send me the info.

emailed to follow up on reactivating

RobH added a comment.Jul 22 2015, 7:17 PM

We're getting the old demo account reactivated for testing.

RobH changed the task status from Open to Stalled.Jul 22 2015, 7:57 PM

On the issue of smsglobal failures.

I've called and opened a ticket for the issue. However, as their side shows sent, I expect nothing of consequence from our issue request. In the past, they have failed to return calls, and followup on cases has resulted in no further information being provided.

We started to migrate away last fall, but other projects took priority.

As such, I'm going to lower the priority to low and stall it.

Task T106589 will track our migration to pagerduty.

RobH lowered the priority of this task from High to Low.Jul 22 2015, 7:59 PM
RobH added a comment.Jul 26 2015, 3:32 AM

We just had another failure to page to me, as the lvs ipv6 false alarm went off again. pagerduty test worked fine, but smsglobal failed.

Krenair updated the task description. (Show Details)Aug 15 2015, 9:27 PM
Krenair set Security to None.
Krenair removed a subscriber: Unknown Object (User).
RobH closed this task as Declined.Sep 29 2015, 4:13 PM

No more investigation, just migrated away from them as a vendor instead.