Page MenuHomePhabricator

migrate elasticsearch cirrus cluster to RAID0
Closed, ResolvedPublic

Description

The elasticsearch cirrus cluster is getting low on disk space. We are currently using RAID1, except on older servers, with smaller disks, where we are already using RAID0.

Given that we have enough redundancy at the cluster level (we have at least 3 copies of any shard) and that recovery is automatic (in case we loose one server, the lost shards will be recreated on a different node), it seems like a good idea to move the full cluster to RAID0. This acknowledges that loosing one server is a non event.

Note: if we reimage those servers, it might make sense to migrate to stretch at the same time (T193649).

Event Timeline

herron triaged this task as Medium priority.Jun 28 2018, 4:59 PM
herron subscribed.

What are your thoughts about RAID10, RAID5(0) or even exposing each individual disk to ES an option for expansion? I am leery of RAID0 since odds for failure are greater than a single drive, and eventually it will fail and degrade the cluster. Understood that cluster recovery should be automatic, but at the same time this could lead to increased urgency to obtain a replacement, rebuild the filesystem and rebalance.

for the elasticsearch cluster, we could probably lose 3 or 4 machines before there was any thought of potential urgency. Elasticsearch can handle being provided the disks as a list of directories though, so if we prefer can go that way as well.

As an example, during cluster restarts, my standard procedure is to restart 3 nodes at a time. So we have strong evidence that loosing 3 nodes is a non issue.

About RAID10 / RAID5(0), in both cases it would require adding more disk to those servers, which is something we are trying to avoid. The JBOD approach could work and increase the reliability a bit at the cost of a slightly more complicated configuration. I prefer to have the storage completely abstracted from elasticsearch, but I'm open to other opinion.

Change 450062 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: migrate codfw cluster to Stretch and RAID0

https://gerrit.wikimedia.org/r/450062

Change 450064 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: migrate eqiad cluster to Stretch and RAID0

https://gerrit.wikimedia.org/r/450064

Change 450062 merged by Gehel:
[operations/puppet@production] elasticsearch: migrate codfw cluster to Stretch and RAID0

https://gerrit.wikimedia.org/r/450062

Mentioned in SAL (#wikimedia-operations) [2018-08-07T08:29:05Z] <gehel> start reimaging of elasticsearch / cirrus / codfw cluster (RAID0 / Stretch) - T193649 / T198391

Change 450064 merged by Gehel:
[operations/puppet@production] elasticsearch: migrate eqiad cluster to Stretch and RAID0

https://gerrit.wikimedia.org/r/450064

Mentioned in SAL (#wikimedia-operations) [2018-08-13T12:56:27Z] <gehel> reimaging of elasticsearch / cirrus / codfw cluster (RAID0 / Stretch) completed - T193649 / T198391

Mentioned in SAL (#wikimedia-operations) [2018-08-13T12:56:45Z] <gehel> start reimaging of elasticsearch / cirrus / eqiad cluster (RAID0 / Stretch) - T193649 / T198391

Change 453094 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: storage device name changed with new partitioning scheme

https://gerrit.wikimedia.org/r/453094

Mentioned in SAL (#wikimedia-operations) [2018-08-16T09:46:34Z] <gehel> all elasticsearch nodes reimaged (except elastic1029, waiting on memory issue) - T198391 / T193649 / T201991

Mentioned in SAL (#wikimedia-operations) [2018-08-16T18:45:05Z] <gehel> reimage of elasticsearch eqiad completed - T198391 / T193649

Change 453094 merged by Gehel:
[operations/puppet@production] elasticsearch: storage device name changed with new partitioning scheme

https://gerrit.wikimedia.org/r/453094