Page MenuHomePhabricator

Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet
Closed, ResolvedPublic

Description

It would seem that a disk (/dev/sdc) has failed, and Cassandra's data directories for all 3 instances are unwritable. Since these instances have already been down for some time, and no ETA for repair/replacement yet exists, the safest choice would be to remove the instances.

Next steps

  • [operations] Shut down node and replace sdc
  • [operations] Re-assemble / re-create RAID0
  • [services] re-bootstrap cassandra instances

See: T163280: Degraded RAID on restbase1018 and T164202: Degraded RAID on restbase1018

Details

Related Gerrit Patches:
mediawiki/services/change-propagation/deploy : masterConfig: Return the transcludes concurrency to normal
mediawiki/services/change-propagation/deploy : masterTemporarily half the transclusion update concurrency
mediawiki/services/change-propagation/deploy : masterRevert "Temporarily lower the concurrency to 30"
mediawiki/services/change-propagation/deploy : masterTemporarily lower the concurrency to 30

Event Timeline

Eevans created this task.Apr 19 2017, 1:19 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 19 2017, 1:19 AM

Mentioned in SAL (#wikimedia-operations) [2017-04-19T01:21:44Z] <urandom> T163292: Starting removal of Cassandra instance restbase1018-a.eqiad.wmnet

Dzahn added a subscriber: Dzahn.Apr 19 2017, 1:25 AM

Since these instances have already been down for some time, and no ETA for repair/replacement yet exists,

down for some time? restbase1018 has an uptime of 127 days and looks alive and running and the RAID incident happened less than 2 hours ago. The linked ticket was auto-created by a monitoring bot. I depooled the server within 6 minutes of the incident.

Adding ops-eqiad to get an ETA.

GWicke added a subscriber: GWicke.EditedApr 19 2017, 1:35 AM

@Dzahn, @Eevans' concern is about the Cassandra instances, not the stateless RESTBase service itself. While those instances are down, per-DC redundancy for the data stored in them is reduced. We use quorum reads for consistency, so both remaining local replicas need to respond before a read request is deemed successful. This can reduce performance when only two replicas are up.

We'll need to re-bootstrap the RAID 0 in any case. Decommissioning the existing instances prepares for that, and restores 3-way replication in the meantime.

Dzahn added a comment.Apr 19 2017, 1:38 AM

@GWicke @Eevans thanks for the explanation (and i just saw you removing it and Icinga alerts followed by recoveries. looks good)

Since these instances have already been down for some time, and no ETA for repair/replacement yet exists,

down for some time? restbase1018 has an uptime of 127 days and looks alive and running and the RAID incident happened less than 2 hours ago. The linked ticket was auto-created by a monitoring bot. I depooled the server within 6 minutes of the incident.

Sorry, I should have been more verbose here; What I meant was, I think it's safe to assume these instances won't be back up in a matter of minutes or hours, or that the data that is on them will be recoverable when they are. The removenode operations will take a while to run, and we're in a degraded state until they complete. If it's inevitable that we rebuild these instances, then it's better to start that process as soon as possible.

The error rate is currently quite high, a lot of timeouts starting at the point of the RAID failure:

https://logstash.wikimedia.org/goto/2030aaa5a032e2491fa54cd0ffc3460b

The removenode induced load doesn't seem to be helping matters:

https://grafana-admin.wikimedia.org/dashboard/snapshot/FFjhvEEubeRIqVY31316QCrd0syW710D
https://grafana-admin.wikimedia.org/dashboard/snapshot/UDVRURItraxDdZTtGdSavCLJAB27mhYt

NOTE: restbase100{9,14,15} are the remaining row/rack 'd' nodes

Because this is happening in eqiad, where asynchronous updates are processed, I will lower the CP processing concurrency to 30 (from 50) to see whether it will lower the pressure off of the remaining nodes in the rack without a significant impact on the update latency.

Change 348900 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/change-propagation/deploy@master] Temporarily lower the concurrency to 30

https://gerrit.wikimedia.org/r/348900

Change 348900 merged by Eevans:
[mediawiki/services/change-propagation/deploy@master] Temporarily lower the concurrency to 30

https://gerrit.wikimedia.org/r/348900

Change 348901 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/change-propagation/deploy@master] Revert "Temporarily lower the concurrency to 30"

https://gerrit.wikimedia.org/r/348901

Change 348901 merged by Mobrovac:
[mediawiki/services/change-propagation/deploy@master] Revert "Temporarily lower the concurrency to 30"

https://gerrit.wikimedia.org/r/348901

Change 348902 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/change-propagation/deploy@master] Temporarily half the transclusion update concurrency

https://gerrit.wikimedia.org/r/348902

Change 348902 merged by Ppchelko:
[mediawiki/services/change-propagation/deploy@master] Temporarily half the transclusion update concurrency

https://gerrit.wikimedia.org/r/348902

Mentioned in SAL (#wikimedia-operations) [2017-04-19T03:31:37Z] <mobrovac@tin> Started deploy [changeprop/deploy@a19ebf8]: Temp: Decrease the transclusion update from 400 to 200 for T163292

Mentioned in SAL (#wikimedia-operations) [2017-04-19T03:32:31Z] <mobrovac@tin> Finished deploy [changeprop/deploy@a19ebf8]: Temp: Decrease the transclusion update from 400 to 200 for T163292 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2017-04-19T03:53:43Z] <urandom> T163292: Starting removal of Cassandra instance restbase1018-b.eqiad.wmnet

Update: After throttling the removenode operation, reducing transclusion concurrency, and restarting restbase, the remaining warnings are primarily connection failures for the remaining 1018 instances (to be expected).

Eevans triaged this task as High priority.Apr 19 2017, 4:32 AM

Mentioned in SAL (#wikimedia-operations) [2017-04-19T12:38:52Z] <urandom> T163292: Starting removal of Cassandra instance restbase1018-c.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2017-04-19T17:25:20Z] <mobrovac@naos> Started restart [restbase/deploy@1bfada4]: Restart to stop trying to connect to dead restbase1018 Cassandra instances - T163292

BioPseudo awarded a token.
BioPseudo added a subscriber: BioPseudo.
mobrovac removed a project: Patch-For-Review.
mobrovac edited subscribers, added: DavidGreens; removed: BioPseudo, gerritbot.
mobrovac removed a subscriber: DavidGreens.

Mentioned in SAL (#wikimedia-operations) [2017-04-19T18:19:33Z] <mobrovac> restbase stopping RB and disabling puppet on restbase1018 due to T163292

GWicke updated the task description. (Show Details)Apr 20 2017, 2:30 PM

Change 349322 had a related patch set uploaded (by Ppchelko):
[mediawiki/services/change-propagation/deploy@master] Config: Return the transcludes concurrency to normal

https://gerrit.wikimedia.org/r/349322

Change 349322 abandoned by Ppchelko:
Config: Return the transcludes concurrency to normal

Reason:
In favor of I05b1d27f76fc70a688c4c3b36a2887883fca5694

https://gerrit.wikimedia.org/r/349322

Eevans moved this task from Backlog to Next on the Cassandra board.Apr 24 2017, 2:47 PM

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:26:50Z] <mobrovac@naos> Started deploy [changeprop/deploy@e0e3684]: Bring back the concurrency level - T163292

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:26:59Z] <mobrovac@naos> Finished deploy [changeprop/deploy@e0e3684]: Bring back the concurrency level - T163292 (duration: 00m 10s)

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:35:44Z] <mobrovac@naos> Started deploy [changeprop/deploy@7521b2f]: Bring back the concurrency level - T163292

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:36:57Z] <mobrovac@naos> Finished deploy [changeprop/deploy@7521b2f]: Bring back the concurrency level - T163292 (duration: 01m 13s)

Eevans updated the task description. (Show Details)May 2 2017, 4:37 PM
Joe added a subscriber: Joe.May 3 2017, 8:58 AM

I just recreated the RAID arrays and rebooted the system with the new disk in place. @Eevans I'd let you re-start puppet and attend cassandra. Of course, the data in /srv are gone for good.

Eevans added a comment.May 3 2017, 1:55 PM

I just recreated the RAID arrays and rebooted the system with the new disk in place. @Eevans I'd let you re-start puppet and attend cassandra. Of course, the data in /srv are gone for good.

Great; Thanks @Joe, I'll take it from here!

Eevans updated the task description. (Show Details)May 3 2017, 2:50 PM
Eevans added a comment.May 3 2017, 2:52 PM

The host is now ready to have the instances re-bootstrapped, but let's postpone doing so until after the Services data-center swithover (scheduled for tomorrow).

Mentioned in SAL (#wikimedia-operations) [2017-05-04T16:55:40Z] <urandom> T163292: Starting bootstrap of restbase1018-a

Mentioned in SAL (#wikimedia-operations) [2017-05-05T01:26:02Z] <urandom> T163292: starting bootstrap of restbase1018-b

Mentioned in SAL (#wikimedia-operations) [2017-05-05T13:28:05Z] <urandom> T163292: bootstrapping Cassandra on restbase1008-c

Eevans closed this task as Resolved.May 5 2017, 9:19 PM