Page MenuHomePhabricator

db1075 (s3 master) crashed - BBU failure
Closed, ResolvedPublic

Description

db1075 (s3) primary master crashed due to BBU failure (T233535) and caused all the s3 wikis to be on read-only from 18:48 UTC to around 19:15 UTC
Reads were not affected

HW logs:

/system1/log1/record13
  Targets
  Properties
    number=13
    severity=Caution
    date=09/22/2019
    time=18:37
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show


</system1/log1>hpiLO-> show record14

status=0
status_tag=COMMAND COMPLETED
Mon Sep 23 05:01:26 2019



/system1/log1/record14
  Targets
  Properties
    number=14
    severity=Caution
    date=09/22/2019
    time=19:00
    description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.
  Verbs
    cd version exit show

I rebooted the host and after a few checks I started mysql again
This host is scheduled to be failed over on Tuesday 24th - T230783: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC

A new BBU for this host should be bought: {T233567}

A BBU failure has resulted on hosts crashing entirely before T225391: db1077 crashed T231638: db1074 crashed: Broken BBU
This host is part of a 6 hosts batch {T118174} T128753: Rack and Initial setup db1074-79 and 3 of them have had BBU failures.

Event Timeline

wiki_willy added a subtask: Unknown Object (Task).
wiki_willy added a project: ops-eqdfw.
Marostegui edited projects, added ops-eqiad; removed ops-eqdfw.
Marostegui updated the task description. (Show Details)
Marostegui removed a subscriber: ops-monitoring-bot.
Marostegui renamed this task from db1075 (s3 master) crashed to db1075 (s3 master) crashed - BBU failure.Sep 23 2019, 5:15 AM

I've had a look, it's mostly good, but I must say I'm not convinced by the "Alerts worked fine" conclusion. Icinga noticed the host was down at 18:42:45, we had to ping an op (who just happened to be around) to get a message to you that something was wrong, SMS pages only began at 18:56? It seems to me that a DB master host going offline should automatically be a paging event.

I've had a look, I must say I'm not convinced by the "Alerts worked fine" conclusion. Icinga noticed the host was down at 18:42:45, we had to ping an op (who just happened to be around) to get a message to you that something was wrong, SMS pages only began at 18:56?

First of all, I noticed the event before that person texted me on Telegram, I actually didn't see that specific ping until a bit later into the incident.

The alerts worked as they are expected to work, whether that needs changing on the desired thresholds is a different discussion.

The first alert was 18:42 and HOST DOWN alerts do not page (for any role).
Replication broken does page (for eqiad and core roles) but after several checks, to avoid false positives, and the first one that was supposed to trigger an SMS was this one:

18:56:30 <+icinga-wm> PROBLEM - MariaDB Slave Lag: s3 #page on db1078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 963.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave

Which is the page we also received on our phones.

Hope this clarifies why I wrote: Alerts worked fine.

I'm wondering if an entry should be added under "Where did we get lucky?" along the lines of "I/We noticed this incident before SMS paging begun".

Was there a deliberate decision that DB master hosts appearing offline individually is not page-worthy due to the potential for false positives? For most hosts that probably makes sense, I can't think of any other class of non-dev-facing server where an individual host going offline can cause direct user-visible problems (edit: perhaps arguably IRCd or something but that one really is a different discussion), maybe a few years ago the poolcounters or something, I do wonder if DB masters should be considered for an exception to this.

whether that needs changing on the desired thresholds is a different discussion.

The director of SRE was the person who decided that at the time because alerts were too annoying: https://gerrit.wikimedia.org/r/c/operations/puppet/mariadb/+/289825 I don't think we can override his decision.

whether that needs changing on the desired thresholds is a different discussion.

The director of SRE was the person who decided that at the time because alerts were too annoying: https://gerrit.wikimedia.org/r/c/operations/puppet/mariadb/+/289825 I don't think we can override his decision.

We don't live in a theocracy or a dictatorship, and things can surely be reconsidered after a few years, if the original reasons are not valid anymore or we think the bad consequences outweigh the potential noise.

Having said that, I think @Krenair was referring to the fact the master was unresponsive to ping, not that replication was broken (which is just a symptom). A master db being unreachable to ping should page, IMHO, and it does not.

I'm wondering if an entry should be added under "Where did we get lucky?" along the lines of "I/We noticed this incident before SMS paging begun".

Done

Was there a deliberate decision that DB master hosts appearing offline individually is not page-worthy due to the potential for false positives? For most hosts that probably makes sense, I can't think of any other class of non-dev-facing server where an individual host going offline can cause direct user-visible problems (edit: perhaps arguably IRCd or something but that one really is a different discussion), maybe a few years ago the poolcounters or something, I do wonder if DB masters should be considered for an exception to this.

I would assume a master not paging on HOST DOWN was not done at the time because of how puppet with Icinga works and/or because it needed a decent amount of refactoring to be able to select which hosts (or roles) can or cannot page if they go down.
Also depending on the DC they should or shouldn't page as of today, we do not want codfw masters to page but we'd need them to if we run active-active.

This also comes with the discussion: what is the source of truth for database masters, is it puppet or mediawiki? and which should feed each other (again, a different discussion).

Having said that, I do agree that a master going down means that we are on read-only instantly, so probably they should page. I will create a task for this and add it to the actionable list on the IR

Thanks!

I'm wondering if an entry should be added under "Where did we get lucky?" along the lines of "I/We noticed this incident before SMS paging begun".

Done

Was there a deliberate decision that DB master hosts appearing offline individually is not page-worthy due to the potential for false positives? For most hosts that probably makes sense, I can't think of any other class of non-dev-facing server where an individual host going offline can cause direct user-visible problems (edit: perhaps arguably IRCd or something but that one really is a different discussion), maybe a few years ago the poolcounters or something, I do wonder if DB masters should be considered for an exception to this.

I would assume a master not paging on HOST DOWN was not done at the time because of how puppet with Icinga works and/or because it needed a decent amount of refactoring to be able to select which hosts (or roles) can or cannot page if they go down.
Also depending on the DC they should or shouldn't page as of today, we do not want codfw masters to page but we'd need them to if we run active-active.

This also comes with the discussion: what is the source of truth for database masters, is it puppet or mediawiki? and which should feed each other (again, a different discussion).

Having said that, I do agree that a master going down means that we are on read-only instantly, so probably they should page. I will create a task for this and add it to the actionable list on the IR

Thanks!

Let's not have the perfect be the enemy of the good.

It is relatively easy to page on a class of hosts if they go down. It's a bit more complex to make them only page in the currently active mediawiki datacenter, for which the source of truth is etcd.

We can however think of ways to overcome that difficulty in the future, but I would say paging for a master going down regardless of the datacenter seems like a good idea and a net win to me.

I'm wondering if an entry should be added under "Where did we get lucky?" along the lines of "I/We noticed this incident before SMS paging begun".

Done

Was there a deliberate decision that DB master hosts appearing offline individually is not page-worthy due to the potential for false positives? For most hosts that probably makes sense, I can't think of any other class of non-dev-facing server where an individual host going offline can cause direct user-visible problems (edit: perhaps arguably IRCd or something but that one really is a different discussion), maybe a few years ago the poolcounters or something, I do wonder if DB masters should be considered for an exception to this.

I would assume a master not paging on HOST DOWN was not done at the time because of how puppet with Icinga works and/or because it needed a decent amount of refactoring to be able to select which hosts (or roles) can or cannot page if they go down.
Also depending on the DC they should or shouldn't page as of today, we do not want codfw masters to page but we'd need them to if we run active-active.

This also comes with the discussion: what is the source of truth for database masters, is it puppet or mediawiki? and which should feed each other (again, a different discussion).

Having said that, I do agree that a master going down means that we are on read-only instantly, so probably they should page. I will create a task for this and add it to the actionable list on the IR

Thanks!

Let's not have the perfect be the enemy of the good.

It is relatively easy to page on a class of hosts if they go down. It's a bit more complex to make them only page in the currently active mediawiki datacenter, for which the source of truth is etcd.

We can however think of ways to overcome that difficulty in the future, but I would say paging for a master going down regardless of the datacenter seems like a good idea and a net win to me.

I agree, we can page for codfw hosts for now too, after all we don't have many of those. I was talking about the ideal situation desired at the time :-)

BBU has arrived to the DC, I am trying to coordinate with @Cmjohnson and @Jclark-ctr to see if we can replace this asap.

Mentioned in SAL (#wikimedia-operations) [2019-09-25T12:37:37Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1075 for BBU replacement T233534', diff saved to https://phabricator.wikimedia.org/P9176 and previous config saved to /var/cache/conftool/dbconfig/20190925-123736-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-25T12:37:51Z] <marostegui> Stop MySQL on db1075 for BBU replacement T233534

Mentioned in SAL (#wikimedia-operations) [2019-09-25T12:41:33Z] <marostegui> Shutdown db1075 for onsite maintenance T233534

I can see the battery now after @Jclark-ctr has installed the new one:

Battery/Capacitor Count: 1
Battery/Capacitor Status: OK

Update:

[15:05:52]  <+icinga-wm>	RECOVERY - HP RAID on db1075 is OK: OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
Jclark-ctr claimed this task.
Jclark-ctr added a subscriber: Cmjohnson.

replaced battery. resolving ticket

db1075 is now fully pooled back.
Thanks John!

Marostegui closed subtask Unknown Object (Task) as Resolved.Oct 9 2019, 2:47 PM