db1075 (s3 master) crashed - BBU failure
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	Sep 22 2019, 7:21 PM

Description

db1075 (s3) primary master crashed due to BBU failure (T233535) and caused all the s3 wikis to be on read-only from 18:48 UTC to around 19:15 UTC
Reads were not affected

HW logs:

/system1/log1/record13
  Targets
  Properties
    number=13
    severity=Caution
    date=09/22/2019
    time=18:37
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show


</system1/log1>hpiLO-> show record14

status=0
status_tag=COMMAND COMPLETED
Mon Sep 23 05:01:26 2019



/system1/log1/record14
  Targets
  Properties
    number=14
    severity=Caution
    date=09/22/2019
    time=19:00
    description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.
  Verbs
    cd version exit show

I rebooted the host and after a few checks I started mysql again
This host is scheduled to be failed over on Tuesday 24th - T230783: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC

A new BBU for this host should be bought: {T233567}

A BBU failure has resulted on hosts crashing entirely before T225391: db1077 crashed T231638: db1074 crashed: Broken BBU
This host is part of a 6 hosts batch {T118174} T128753: Rack and Initial setup db1074-79 and 3 of them have had BBU failures.

Related Objects
Search...

Status	Assigned	Task
Resolved	• Cmjohnson	T226778 Install new PDUs in rows A/B (Top level tracking task)
Resolved	Jclark-ctr	T227138 a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC)
Resolved	Marostegui	T230783 Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC
Resolved	Jclark-ctr	T233534 db1075 (s3 master) crashed - BBU failure
		Unknown Object (Task)
Declined	None	T233569 Batch db1074-db1079 hosts having BBU issues
Resolved	Kormat	T233684 Make primary DB masters page on HOST DOWN alert
Resolved	Marostegui	T322987 db2173 crashed and didn't alert
Resolved	Papaul	T322988 db2173 HW errors

Event Timeline

Marostegui created this task.Sep 22 2019, 7:21 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 22 2019, 7:21 PM

Urbanecm subscribed.Sep 22 2019, 7:21 PM

Krenair subscribed.Sep 22 2019, 7:21 PM

DannyS712 subscribed.Sep 22 2019, 7:22 PM

Marostegui updated the task description. (Show Details)Sep 22 2019, 7:24 PM

Marostegui updated the task description. (Show Details)Sep 22 2019, 7:27 PM

Marostegui merged a task: T233535: Degraded RAID on db1075.

Marostegui added a subscriber: ops-monitoring-bot.

Marostegui updated the task description. (Show Details)Sep 22 2019, 7:32 PM

wiki_willy assigned this task to • Cmjohnson.Sep 23 2019, 2:19 AM

wiki_willy added a subtask: Unknown Object (Task).

wiki_willy added a project: ops-eqdfw.

Marostegui moved this task from Triage to In progress on the DBA board.Sep 23 2019, 5:00 AM

Marostegui added a parent task: T230783: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC.

Marostegui triaged this task as High priority.Sep 23 2019, 5:06 AM

Marostegui edited projects, added ops-eqiad; removed ops-eqdfw.

Marostegui updated the task description. (Show Details)

Marostegui removed a subscriber: ops-monitoring-bot.

Marostegui mentioned this in T233569: Batch db1074-db1079 hosts having BBU issues.Sep 23 2019, 5:10 AM

Marostegui added a subtask: T233569: Batch db1074-db1079 hosts having BBU issues.

Marostegui mentioned this in T230783: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC.Sep 23 2019, 5:12 AM

Marostegui renamed this task from db1075 (s3 master) crashed to db1075 (s3 master) crashed - BBU failure.Sep 23 2019, 5:15 AM

Marostegui added a project: Wikimedia-Incident.Sep 23 2019, 5:23 AM

I am starting to write the Incident Report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20190923-s3_primary_db_master_crashed_-_s3_wikis_read-only

Urbanecm mentioned this in T233585: Please unblock failed global renames.Sep 23 2019, 9:40 AM

In T233534#5514692, @Marostegui wrote:

I am starting to write the Incident Report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20190923-s3_primary_db_master_crashed_-_s3_wikis_read-only

I've had a look, it's mostly good, but I must say I'm not convinced by the "Alerts worked fine" conclusion. Icinga noticed the host was down at 18:42:45, we had to ping an op (who just happened to be around) to get a message to you that something was wrong, SMS pages only began at 18:56? It seems to me that a DB master host going offline should automatically be a paging event.

In T233534#5517195, @Krenair wrote:

In T233534#5514692, @Marostegui wrote:

I am starting to write the Incident Report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20190923-s3_primary_db_master_crashed_-_s3_wikis_read-only

I've had a look, I must say I'm not convinced by the "Alerts worked fine" conclusion. Icinga noticed the host was down at 18:42:45, we had to ping an op (who just happened to be around) to get a message to you that something was wrong, SMS pages only began at 18:56?

First of all, I noticed the event before that person texted me on Telegram, I actually didn't see that specific ping until a bit later into the incident.

The alerts worked as they are expected to work, whether that needs changing on the desired thresholds is a different discussion.

The first alert was 18:42 and HOST DOWN alerts do not page (for any role).
Replication broken does page (for eqiad and core roles) but after several checks, to avoid false positives, and the first one that was supposed to trigger an SMS was this one:

18:56:30 <+icinga-wm> PROBLEM - MariaDB Slave Lag: s3 #page on db1078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 963.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave

Which is the page we also received on our phones.

Hope this clarifies why I wrote: Alerts worked fine.

I'm wondering if an entry should be added under "Where did we get lucky?" along the lines of "I/We noticed this incident before SMS paging begun".

Was there a deliberate decision that DB master hosts appearing offline individually is not page-worthy due to the potential for false positives? For most hosts that probably makes sense, I can't think of any other class of non-dev-facing server where an individual host going offline can cause direct user-visible problems (edit: perhaps arguably IRCd or something but that one really is a different discussion), maybe a few years ago the poolcounters or something, I do wonder if DB masters should be considered for an exception to this.

whether that needs changing on the desired thresholds is a different discussion.

The director of SRE was the person who decided that at the time because alerts were too annoying: https://gerrit.wikimedia.org/r/c/operations/puppet/mariadb/+/289825 I don't think we can override his decision.

In T233534#5518243, @jcrespo wrote:

whether that needs changing on the desired thresholds is a different discussion.

The director of SRE was the person who decided that at the time because alerts were too annoying: https://gerrit.wikimedia.org/r/c/operations/puppet/mariadb/+/289825 I don't think we can override his decision.

We don't live in a theocracy or a dictatorship, and things can surely be reconsidered after a few years, if the original reasons are not valid anymore or we think the bad consequences outweigh the potential noise.

Having said that, I think @Krenair was referring to the fact the master was unresponsive to ping, not that replication was broken (which is just a symptom). A master db being unreachable to ping should page, IMHO, and it does not.

In T233534#5517306, @Krenair wrote:

I'm wondering if an entry should be added under "Where did we get lucky?" along the lines of "I/We noticed this incident before SMS paging begun".

Done

Was there a deliberate decision that DB master hosts appearing offline individually is not page-worthy due to the potential for false positives? For most hosts that probably makes sense, I can't think of any other class of non-dev-facing server where an individual host going offline can cause direct user-visible problems (edit: perhaps arguably IRCd or something but that one really is a different discussion), maybe a few years ago the poolcounters or something, I do wonder if DB masters should be considered for an exception to this.

I would assume a master not paging on HOST DOWN was not done at the time because of how puppet with Icinga works and/or because it needed a decent amount of refactoring to be able to select which hosts (or roles) can or cannot page if they go down.
Also depending on the DC they should or shouldn't page as of today, we do not want codfw masters to page but we'd need them to if we run active-active.

This also comes with the discussion: what is the source of truth for database masters, is it puppet or mediawiki? and which should feed each other (again, a different discussion).

Having said that, I do agree that a master going down means that we are on read-only instantly, so probably they should page. I will create a task for this and add it to the actionable list on the IR

Thanks!

In T233534#5518359, @Marostegui wrote:

In T233534#5517306, @Krenair wrote:

I'm wondering if an entry should be added under "Where did we get lucky?" along the lines of "I/We noticed this incident before SMS paging begun".

Done

Was there a deliberate decision that DB master hosts appearing offline individually is not page-worthy due to the potential for false positives? For most hosts that probably makes sense, I can't think of any other class of non-dev-facing server where an individual host going offline can cause direct user-visible problems (edit: perhaps arguably IRCd or something but that one really is a different discussion), maybe a few years ago the poolcounters or something, I do wonder if DB masters should be considered for an exception to this.

I would assume a master not paging on HOST DOWN was not done at the time because of how puppet with Icinga works and/or because it needed a decent amount of refactoring to be able to select which hosts (or roles) can or cannot page if they go down.
Also depending on the DC they should or shouldn't page as of today, we do not want codfw masters to page but we'd need them to if we run active-active.

This also comes with the discussion: what is the source of truth for database masters, is it puppet or mediawiki? and which should feed each other (again, a different discussion).

Having said that, I do agree that a master going down means that we are on read-only instantly, so probably they should page. I will create a task for this and add it to the actionable list on the IR

Thanks!

Let's not have the perfect be the enemy of the good.

It is relatively easy to page on a class of hosts if they go down. It's a bit more complex to make them only page in the currently active mediawiki datacenter, for which the source of truth is etcd.

We can however think of ways to overcome that difficulty in the future, but I would say paging for a master going down regardless of the datacenter seems like a good idea and a net win to me.

In T233534#5518388, @Joe wrote:

In T233534#5518359, @Marostegui wrote:

In T233534#5517306, @Krenair wrote:

I'm wondering if an entry should be added under "Where did we get lucky?" along the lines of "I/We noticed this incident before SMS paging begun".

Done

Was there a deliberate decision that DB master hosts appearing offline individually is not page-worthy due to the potential for false positives? For most hosts that probably makes sense, I can't think of any other class of non-dev-facing server where an individual host going offline can cause direct user-visible problems (edit: perhaps arguably IRCd or something but that one really is a different discussion), maybe a few years ago the poolcounters or something, I do wonder if DB masters should be considered for an exception to this.

I would assume a master not paging on HOST DOWN was not done at the time because of how puppet with Icinga works and/or because it needed a decent amount of refactoring to be able to select which hosts (or roles) can or cannot page if they go down.
Also depending on the DC they should or shouldn't page as of today, we do not want codfw masters to page but we'd need them to if we run active-active.

This also comes with the discussion: what is the source of truth for database masters, is it puppet or mediawiki? and which should feed each other (again, a different discussion).

Having said that, I do agree that a master going down means that we are on read-only instantly, so probably they should page. I will create a task for this and add it to the actionable list on the IR

Thanks!

Let's not have the perfect be the enemy of the good.

It is relatively easy to page on a class of hosts if they go down. It's a bit more complex to make them only page in the currently active mediawiki datacenter, for which the source of truth is etcd.

We can however think of ways to overcome that difficulty in the future, but I would say paging for a master going down regardless of the datacenter seems like a good idea and a net win to me.

I agree, we can page for codfw hosts for now too, after all we don't have many of those. I was talking about the ideal situation desired at the time :-)

ArielGlenn subscribed.Sep 24 2019, 6:09 AM

Krenair added a subtask: T233684: Make primary DB masters page on HOST DOWN alert.Sep 24 2019, 8:08 AM

BBU has arrived to the DC, I am trying to coordinate with @Cmjohnson and @Jclark-ctr to see if we can replace this asap.

Mentioned in SAL (#wikimedia-operations) [2019-09-25T12:37:37Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1075 for BBU replacement T233534', diff saved to https://phabricator.wikimedia.org/P9176 and previous config saved to /var/cache/conftool/dbconfig/20190925-123736-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-25T12:37:51Z] <marostegui> Stop MySQL on db1075 for BBU replacement T233534

Mentioned in SAL (#wikimedia-operations) [2019-09-25T12:41:33Z] <marostegui> Shutdown db1075 for onsite maintenance T233534

I can see the battery now after @Jclark-ctr has installed the new one:

Battery/Capacitor Count: 1
Battery/Capacitor Status: OK

Update:

[15:05:52]  <+icinga-wm>	RECOVERY - HP RAID on db1075 is OK: OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering