Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Eevans
	Apr 19 2017, 1:19 AM

Description

It would seem that a disk (/dev/sdc) has failed, and Cassandra's data directories for all 3 instances are unwritable. Since these instances have already been down for some time, and no ETA for repair/replacement yet exists, the safest choice would be to remove the instances.

Next steps

[operations] Shut down node and replace sdc
[operations] Re-assemble / re-create RAID0
[services] re-bootstrap cassandra instances

See: T163280: Degraded RAID on restbase1018 and T164202: Degraded RAID on restbase1018

Details

Subject	Repo	Branch	Lines +/-
Config: Return the transcludes concurrency to normal	mediawiki/services/change-propagation/deploy	master	+1 -1
Temporarily half the transclusion update concurrency	mediawiki/services/change-propagation/deploy	master	+1 -1
Revert "Temporarily lower the concurrency to 30"	mediawiki/services/change-propagation/deploy	master	+1 -1
Temporarily lower the concurrency to 30	mediawiki/services/change-propagation/deploy	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: rMSCD966811e650c1: Final NoteDb migration updates
rMSCD39709e8b960e: Final NoteDb migration updates
rMSCD5e4824a9b6ba: Final NoteDb migration updates
rMSCD270647dbd5ea: Final NoteDb migration updates
rMSCD2e07f61813b9: Config: Return the transcludes concurrency to normal
rMSCDa19ebf83f7fa: Temporarily half the transclusion update concurrency
rMSCD46fb46397519: Revert "Temporarily lower the concurrency to 30"
rMSCDe7ffc2e6c8b4: Temporarily lower the concurrency to 30
Mentioned Here: T164202: Degraded RAID on restbase1018
T163280: Degraded RAID on restbase1018

Event Timeline

Eevans created this task.Apr 19 2017, 1:19 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 19 2017, 1:19 AM

Mentioned in SAL (#wikimedia-operations) [2017-04-19T01:21:44Z] <urandom> T163292: Starting removal of Cassandra instance restbase1018-a.eqiad.wmnet

Since these instances have already been down for some time, and no ETA for repair/replacement yet exists,

down for some time? restbase1018 has an uptime of 127 days and looks alive and running and the RAID incident happened less than 2 hours ago. The linked ticket was auto-created by a monitoring bot. I depooled the server within 6 minutes of the incident.

Adding ops-eqiad to get an ETA.

Dzahn added a project: ops-eqiad.Apr 19 2017, 1:26 AM

@Dzahn, @Eevans' concern is about the Cassandra instances, not the stateless RESTBase service itself. While those instances are down, per-DC redundancy for the data stored in them is reduced. We use quorum reads for consistency, so both remaining local replicas need to respond before a read request is deemed successful. This can reduce performance when only two replicas are up.

We'll need to re-bootstrap the RAID 0 in any case. Decommissioning the existing instances prepares for that, and restores 3-way replication in the meantime.

@GWicke @Eevans thanks for the explanation (and i just saw you removing it and Icinga alerts followed by recoveries. looks good)

In T163292#3192598, @Dzahn wrote:

Since these instances have already been down for some time, and no ETA for repair/replacement yet exists,

down for some time? restbase1018 has an uptime of 127 days and looks alive and running and the RAID incident happened less than 2 hours ago. The linked ticket was auto-created by a monitoring bot. I depooled the server within 6 minutes of the incident.

Sorry, I should have been more verbose here; What I meant was, I think it's safe to assume these instances won't be back up in a matter of minutes or hours, or that the data that is on them will be recoverable when they are. The removenode operations will take a while to run, and we're in a degraded state until they complete. If it's inevitable that we rebuild these instances, then it's better to start that process as soon as possible.

The error rate is currently quite high, a lot of timeouts starting at the point of the RAID failure:

https://logstash.wikimedia.org/goto/2030aaa5a032e2491fa54cd0ffc3460b

The removenode induced load doesn't seem to be helping matters:

https://grafana-admin.wikimedia.org/dashboard/snapshot/FFjhvEEubeRIqVY31316QCrd0syW710D
https://grafana-admin.wikimedia.org/dashboard/snapshot/UDVRURItraxDdZTtGdSavCLJAB27mhYt

NOTE: restbase100{9,14,15} are the remaining row/rack 'd' nodes

Because this is happening in eqiad, where asynchronous updates are processed, I will lower the CP processing concurrency to 30 (from 50) to see whether it will lower the pressure off of the remaining nodes in the rack without a significant impact on the update latency.

Change 348900 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/change-propagation/deploy@master] Temporarily lower the concurrency to 30

https://gerrit.wikimedia.org/r/348900

gerritbot added a project: Patch-For-Review.Apr 19 2017, 3:13 AM

Change 348900 merged by Eevans:
[mediawiki/services/change-propagation/deploy@master] Temporarily lower the concurrency to 30

https://gerrit.wikimedia.org/r/348900

Change 348901 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/change-propagation/deploy@master] Revert "Temporarily lower the concurrency to 30"

https://gerrit.wikimedia.org/r/348901

Change 348901 merged by Mobrovac:
[mediawiki/services/change-propagation/deploy@master] Revert "Temporarily lower the concurrency to 30"

https://gerrit.wikimedia.org/r/348901

Change 348902 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/change-propagation/deploy@master] Temporarily half the transclusion update concurrency

https://gerrit.wikimedia.org/r/348902

Change 348902 merged by Ppchelko:
[mediawiki/services/change-propagation/deploy@master] Temporarily half the transclusion update concurrency

https://gerrit.wikimedia.org/r/348902

Mentioned in SAL (#wikimedia-operations) [2017-04-19T03:31:37Z] <mobrovac@tin> Started deploy [changeprop/deploy@a19ebf8]: Temp: Decrease the transclusion update from 400 to 200 for T163292

• mobrovac mentioned this in rMSCDe7ffc2e6c8b4: Temporarily lower the concurrency to 30.Apr 19 2017, 3:31 AM

• mobrovac mentioned this in rMSCD46fb46397519: Revert "Temporarily lower the concurrency to 30".

• mobrovac mentioned this in rMSCDa19ebf83f7fa: Temporarily half the transclusion update concurrency.

Mentioned in SAL (#wikimedia-operations) [2017-04-19T03:32:31Z] <mobrovac@tin> Finished deploy [changeprop/deploy@a19ebf8]: Temp: Decrease the transclusion update from 400 to 200 for T163292 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2017-04-19T03:53:43Z] <urandom> T163292: Starting removal of Cassandra instance restbase1018-b.eqiad.wmnet

Update: After throttling the removenode operation, reducing transclusion concurrency, and restarting restbase, the remaining warnings are primarily connection failures for the remaining 1018 instances (to be expected).

Eevans triaged this task as High priority.Apr 19 2017, 4:32 AM

Mentioned in SAL (#wikimedia-operations) [2017-04-19T12:38:52Z] <urandom> T163292: Starting removal of Cassandra instance restbase1018-c.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2017-04-19T17:25:20Z] <mobrovac@naos> Started restart [restbase/deploy@1bfada4]: Restart to stop trying to connect to dead restbase1018 Cassandra instances - T163292

• BioPseudo reassigned this task from Eevans to • DavidGreens.Apr 19 2017, 5:30 PM

• BioPseudo awarded a token.

• BioPseudo subscribed.

• mobrovac reassigned this task from • DavidGreens to Eevans.Apr 19 2017, 5:33 PM

• mobrovac removed a project: Patch-For-Review.

• mobrovac edited subscribers, added: • DavidGreens; removed: • BioPseudo, gerritbot.

• mobrovac removed a subscriber: • DavidGreens.

Mentioned in SAL (#wikimedia-operations) [2017-04-19T18:19:33Z] <mobrovac> restbase stopping RB and disabling puppet on restbase1018 due to T163292

• GWicke updated the task description. (Show Details)Apr 20 2017, 2:30 PM

Change 349322 had a related patch set uploaded (by Ppchelko):
[mediawiki/services/change-propagation/deploy@master] Config: Return the transcludes concurrency to normal

https://gerrit.wikimedia.org/r/349322

gerritbot added a project: Patch-For-Review.Apr 20 2017, 9:09 PM

• Pchelolo mentioned this in rMSCD2e07f61813b9: Config: Return the transcludes concurrency to normal.Apr 20 2017, 9:11 PM

Change 349322 abandoned by Ppchelko:
Config: Return the transcludes concurrency to normal

Reason:
In favor of I05b1d27f76fc70a688c4c3b36a2887883fca5694

https://gerrit.wikimedia.org/r/349322

Eevans moved this task from Backlog to Next on the Cassandra board.Apr 24 2017, 2:47 PM

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:26:50Z] <mobrovac@naos> Started deploy [changeprop/deploy@e0e3684]: Bring back the concurrency level - T163292

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:26:59Z] <mobrovac@naos> Finished deploy [changeprop/deploy@e0e3684]: Bring back the concurrency level - T163292 (duration: 00m 10s)

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:35:44Z] <mobrovac@naos> Started deploy [changeprop/deploy@7521b2f]: Bring back the concurrency level - T163292

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:36:57Z] <mobrovac@naos> Finished deploy [changeprop/deploy@7521b2f]: Bring back the concurrency level - T163292 (duration: 01m 13s)

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Apr 27 2017, 8:27 PM

Eevans updated the task description. (Show Details)May 2 2017, 4:37 PM

I just recreated the RAID arrays and rebooted the system with the new disk in place. @Eevans I'd let you re-start puppet and attend cassandra. Of course, the data in /srv are gone for good.

In T163292#3230784, @Joe wrote:

I just recreated the RAID arrays and rebooted the system with the new disk in place. @Eevans I'd let you re-start puppet and attend cassandra. Of course, the data in /srv are gone for good.

Great; Thanks @Joe, I'll take it from here!

Eevans updated the task description. (Show Details)May 3 2017, 2:50 PM

The host is now ready to have the instances re-bootstrapped, but let's postpone doing so until after the Services data-center swithover (scheduled for tomorrow).

Mentioned in SAL (#wikimedia-operations) [2017-05-04T16:55:40Z] <urandom> T163292: Starting bootstrap of restbase1018-a

• Cmjohnson removed a project: ops-eqiad.May 4 2017, 6:14 PM

Mentioned in SAL (#wikimedia-operations) [2017-05-05T01:26:02Z] <urandom> T163292: starting bootstrap of restbase1018-b

Mentioned in SAL (#wikimedia-operations) [2017-05-05T13:28:05Z] <urandom> T163292: bootstrapping Cassandra on restbase1008-c

Eevans closed this task as Resolved.May 5 2017, 9:19 PM

Diffusion mentioned this in rMSCD270647dbd5ea: Final NoteDb migration updates.Jun 11 2018, 12:52 PM

• mobrovac mentioned this in rMSCD5e4824a9b6ba: Final NoteDb migration updates.Jun 11 2018, 12:52 PM

• mobrovac mentioned this in rMSCD39709e8b960e: Final NoteDb migration updates.

• mobrovac mentioned this in rMSCD966811e650c1: Final NoteDb migration updates.

Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet Closed, ResolvedPublicActions