Page MenuHomePhabricator

Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet
Closed, ResolvedPublic

Description

It would seem that a disk (/dev/sdc) has failed, and Cassandra's data directories for all 3 instances are unwritable. Since these instances have already been down for some time, and no ETA for repair/replacement yet exists, the safest choice would be to remove the instances.

Next steps

  • [operations] Shut down node and replace sdc
  • [operations] Re-assemble / re-create RAID0
  • [services] re-bootstrap cassandra instances

See: T163280: Degraded RAID on restbase1018 and T164202: Degraded RAID on restbase1018

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2017-04-19T01:21:44Z] <urandom> T163292: Starting removal of Cassandra instance restbase1018-a.eqiad.wmnet

Since these instances have already been down for some time, and no ETA for repair/replacement yet exists,

down for some time? restbase1018 has an uptime of 127 days and looks alive and running and the RAID incident happened less than 2 hours ago. The linked ticket was auto-created by a monitoring bot. I depooled the server within 6 minutes of the incident.

Adding ops-eqiad to get an ETA.

@Dzahn, @Eevans' concern is about the Cassandra instances, not the stateless RESTBase service itself. While those instances are down, per-DC redundancy for the data stored in them is reduced. We use quorum reads for consistency, so both remaining local replicas need to respond before a read request is deemed successful. This can reduce performance when only two replicas are up.

We'll need to re-bootstrap the RAID 0 in any case. Decommissioning the existing instances prepares for that, and restores 3-way replication in the meantime.

@GWicke @Eevans thanks for the explanation (and i just saw you removing it and Icinga alerts followed by recoveries. looks good)

Since these instances have already been down for some time, and no ETA for repair/replacement yet exists,

down for some time? restbase1018 has an uptime of 127 days and looks alive and running and the RAID incident happened less than 2 hours ago. The linked ticket was auto-created by a monitoring bot. I depooled the server within 6 minutes of the incident.

Sorry, I should have been more verbose here; What I meant was, I think it's safe to assume these instances won't be back up in a matter of minutes or hours, or that the data that is on them will be recoverable when they are. The removenode operations will take a while to run, and we're in a degraded state until they complete. If it's inevitable that we rebuild these instances, then it's better to start that process as soon as possible.

The error rate is currently quite high, a lot of timeouts starting at the point of the RAID failure:

https://logstash.wikimedia.org/goto/2030aaa5a032e2491fa54cd0ffc3460b

The removenode induced load doesn't seem to be helping matters:

https://grafana-admin.wikimedia.org/dashboard/snapshot/FFjhvEEubeRIqVY31316QCrd0syW710D
https://grafana-admin.wikimedia.org/dashboard/snapshot/UDVRURItraxDdZTtGdSavCLJAB27mhYt

NOTE: restbase100{9,14,15} are the remaining row/rack 'd' nodes

Because this is happening in eqiad, where asynchronous updates are processed, I will lower the CP processing concurrency to 30 (from 50) to see whether it will lower the pressure off of the remaining nodes in the rack without a significant impact on the update latency.

Change 348900 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/change-propagation/deploy@master] Temporarily lower the concurrency to 30

https://gerrit.wikimedia.org/r/348900

Change 348900 merged by Eevans:
[mediawiki/services/change-propagation/deploy@master] Temporarily lower the concurrency to 30

https://gerrit.wikimedia.org/r/348900

Change 348901 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/change-propagation/deploy@master] Revert "Temporarily lower the concurrency to 30"

https://gerrit.wikimedia.org/r/348901

Change 348901 merged by Mobrovac:
[mediawiki/services/change-propagation/deploy@master] Revert "Temporarily lower the concurrency to 30"

https://gerrit.wikimedia.org/r/348901

Change 348902 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/change-propagation/deploy@master] Temporarily half the transclusion update concurrency

https://gerrit.wikimedia.org/r/348902

Change 348902 merged by Ppchelko:
[mediawiki/services/change-propagation/deploy@master] Temporarily half the transclusion update concurrency

https://gerrit.wikimedia.org/r/348902

Mentioned in SAL (#wikimedia-operations) [2017-04-19T03:31:37Z] <mobrovac@tin> Started deploy [changeprop/deploy@a19ebf8]: Temp: Decrease the transclusion update from 400 to 200 for T163292

Mentioned in SAL (#wikimedia-operations) [2017-04-19T03:32:31Z] <mobrovac@tin> Finished deploy [changeprop/deploy@a19ebf8]: Temp: Decrease the transclusion update from 400 to 200 for T163292 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2017-04-19T03:53:43Z] <urandom> T163292: Starting removal of Cassandra instance restbase1018-b.eqiad.wmnet

Update: After throttling the removenode operation, reducing transclusion concurrency, and restarting restbase, the remaining warnings are primarily connection failures for the remaining 1018 instances (to be expected).

Eevans triaged this task as High priority.Apr 19 2017, 4:32 AM

Mentioned in SAL (#wikimedia-operations) [2017-04-19T12:38:52Z] <urandom> T163292: Starting removal of Cassandra instance restbase1018-c.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2017-04-19T17:25:20Z] <mobrovac@naos> Started restart [restbase/deploy@1bfada4]: Restart to stop trying to connect to dead restbase1018 Cassandra instances - T163292

Mentioned in SAL (#wikimedia-operations) [2017-04-19T18:19:33Z] <mobrovac> restbase stopping RB and disabling puppet on restbase1018 due to T163292

Change 349322 had a related patch set uploaded (by Ppchelko):
[mediawiki/services/change-propagation/deploy@master] Config: Return the transcludes concurrency to normal

https://gerrit.wikimedia.org/r/349322

Change 349322 abandoned by Ppchelko:
Config: Return the transcludes concurrency to normal

Reason:
In favor of I05b1d27f76fc70a688c4c3b36a2887883fca5694

https://gerrit.wikimedia.org/r/349322

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:26:50Z] <mobrovac@naos> Started deploy [changeprop/deploy@e0e3684]: Bring back the concurrency level - T163292

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:26:59Z] <mobrovac@naos> Finished deploy [changeprop/deploy@e0e3684]: Bring back the concurrency level - T163292 (duration: 00m 10s)

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:35:44Z] <mobrovac@naos> Started deploy [changeprop/deploy@7521b2f]: Bring back the concurrency level - T163292

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:36:57Z] <mobrovac@naos> Finished deploy [changeprop/deploy@7521b2f]: Bring back the concurrency level - T163292 (duration: 01m 13s)

I just recreated the RAID arrays and rebooted the system with the new disk in place. @Eevans I'd let you re-start puppet and attend cassandra. Of course, the data in /srv are gone for good.

I just recreated the RAID arrays and rebooted the system with the new disk in place. @Eevans I'd let you re-start puppet and attend cassandra. Of course, the data in /srv are gone for good.

Great; Thanks @Joe, I'll take it from here!

The host is now ready to have the instances re-bootstrapped, but let's postpone doing so until after the Services data-center swithover (scheduled for tomorrow).

Mentioned in SAL (#wikimedia-operations) [2017-05-04T16:55:40Z] <urandom> T163292: Starting bootstrap of restbase1018-a

Mentioned in SAL (#wikimedia-operations) [2017-05-05T01:26:02Z] <urandom> T163292: starting bootstrap of restbase1018-b

Mentioned in SAL (#wikimedia-operations) [2017-05-05T13:28:05Z] <urandom> T163292: bootstrapping Cassandra on restbase1008-c