Page MenuHomePhabricator

Reprovision legacy Cassandra nodes into new cluster
Closed, ResolvedPublic

Description

Once the remaining use-case is migrated, and restbase has been fully provisioned, the 12 Cassandra nodes of the legacy cluster (6 ea in eqiad and codfw) can be forcibly decommissioned (shutdown and wiped), re-imaged, and bootstrapped into the new cluster.

  • restbase1011.eqiad.wmnet
  • restbase1016.eqiad.wmnet
  • restbase1013.eqiad.wmnet
  • restbase1017.eqiad.wmnet
  • restbase1015.eqiad.wmnet
  • restbase1018.eqiad.wmnet
  • restbase2008.codfw.wmnet
  • restbase2011.codfw.wmnet
  • restbase2007.codfw.wmnet
  • restbase2010.codfw.wmnet
  • restbase2009.codfw.wmnet
  • restbase2012.codfw.wmnet

In the course of T178177: Investigate aberrant Cassandra columnfamily read latency of restbase101{0,2,4}, it was determined that the performance of HP nodes configured for JBOD-like access via the smartarray controllers suffers (in comparison to the Dell hosts). We would like to test a configuration with the controller in HBA mode, to determine if this fares better. I propose the following sequence for re-imaging/provisioning:

  1. restbase1011.eqiad.wmnet (rack a, HP, to be configured in HBA mode)
  2. restbase1017.eqiad.wmnet (rack b, Dell)
  3. restbase1018.eqiad.wmnet (rack d, Dell)
  4. restbase2007.codfw.wmnet (rack b, HP, to be configured according to the outcome of #1)
  5. restbase2008.codfw.wmnet (rack c, HP, to be configured according to the outcome of #1)
  6. restbase2009.codfw.wmnet (rack d, HP, to be configured according to the outcome of #1)
  7. restbase1016.eqiad.wmnet (rack a, Dell)
  8. restbase1013.eqiad.wmnet (rack b, HP, to be configured according to the outcome of #1))
  9. restbase1015.eqiad.wmnet (rack d, HP, to be configured according to the outcome of #1))
  10. restbase2010.codfw.wmnet (rack b, Dell)
  11. restbase2011.codfw.wmnet (rack c, Dell)
  12. restbase2012.codfw.wmnet (rack d, Dell)

Details

Related Gerrit Patches:
mediawiki/services/restbase/deploy : masterScap: Bring back all target nodes
operations/puppet : productionrestbase: reprovision restbase200[789]
operations/puppet : productioninstall_server: swap restbase2009 disk layout
operations/puppet : productioninstall_server: reprovision restbase200[789]
mediawiki/services/restbase/deploy : masterScap: Add back `restbase201[012]`
mediawiki/services/restbase/deploy : masterScap: Add back `restbase101[678]`
operations/puppet : productionrestbase: reprovision restbase201[012]
operations/puppet : productionrestbase: reprovision restbase101[35]
operations/puppet : productionrestbase: reprovision restbase1016
operations/puppet : productioninstall_server: switch remaining cassandra hosts to jbod
operations/puppet : productionrestbase: reprovision restbase1018
operations/puppet : productionhieradata: enable remaining restbase1017 instances
operations/puppet : productionrestbase: don't consider sde for restbase1017
operations/puppet : productionrestbase: reprovision restbase1017
mediawiki/services/restbase/deploy : masterScap: Add back restbase1011 to targets
operations/puppet : productionrestbase: reimage restbase1011 as cassandra 3
operations/puppet : productiondecom legacy restbase/cassandra 2 cluster
mediawiki/services/restbase/deploy : masterScap: Remove legacy cluster nodes from deployment targets
operations/puppet : productionRESTBase: Do not manage Cassandra 2 in the legacy cluster

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 403122 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] RESTBase: Do not manage Cassandra 2 in the legacy cluster

https://gerrit.wikimedia.org/r/403122

fgiunchedi added a subscriber: fgiunchedi.EditedJan 9 2018, 4:45 PM

Next steps before starting with the reimages:

  • Depool the machines above from LVS (cassandra 3 restbase nodes now run restbase)
  • Stop puppet, downtime in icinga
  • Stop restbase
  • Make sure no clients are accessing cassandra on those nodes
  • Stop cassandra
Eevans added a comment.Jan 9 2018, 5:12 PM

Next steps before starting with the reimages:

  • Depool the machines above from LVS (cassandra 3 restbase nodes now run restbase)
  • Stop puppet, downtime in icinga
  • Stop restbase
  • Make sure no clients are accessing cassandra on those nodes
  • Stop cassandra

LGTM

Mentioned in SAL (#wikimedia-operations) [2018-01-09T17:15:29Z] <godog> depool restbase cassandra 2 nodes - T184100

Change 403122 abandoned by Mobrovac:
RESTBase: Do not manage Cassandra 2 in the legacy cluster

Reason:
No need to complicate things, we'll just nuke these nodes.

https://gerrit.wikimedia.org/r/403122

Change 403208 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] decom legacy restbase/cassandra 2 cluster

https://gerrit.wikimedia.org/r/403208

Change 403211 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Scap: Remove legacy cluster nodes from deployment targets

https://gerrit.wikimedia.org/r/403211

Change 403211 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Scap: Remove legacy cluster nodes from deployment targets

https://gerrit.wikimedia.org/r/403211

Mentioned in SAL (#wikimedia-operations) [2018-01-10T09:27:26Z] <godog> stop restbase on cassandra 2 nodes - T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-10T09:50:07Z] <godog> shut cassandra 2 on restbase legacy nodes - T184100

Change 403208 merged by Filippo Giunchedi:
[operations/puppet@production] decom legacy restbase/cassandra 2 cluster

https://gerrit.wikimedia.org/r/403208

Mentioned in SAL (#wikimedia-operations) [2018-01-10T10:29:40Z] <godog> reimage restbase1011 to test HBA mode - T184100

Change 403389 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] restbase: reimage restbase1011 as cassandra 3

https://gerrit.wikimedia.org/r/403389

Change 403389 merged by Filippo Giunchedi:
[operations/puppet@production] restbase: reimage restbase1011 as cassandra 3

https://gerrit.wikimedia.org/r/403389

Change 403402 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Scap: Add back restbase1011 to targets

https://gerrit.wikimedia.org/r/403402

Change 403402 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Scap: Add back restbase1011 to targets

https://gerrit.wikimedia.org/r/403402

Mentioned in SAL (#wikimedia-operations) [2018-01-10T14:51:48Z] <godog> start cassandra-a on restbase1011 - T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-10T19:32:47Z] <urandom> bootstrapping restbase1011-b -- T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-11T00:57:22Z] <urandom> bootstrapping restbase1011-c -- T184100

Change 404262 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] restbase: reprovision restbase1017

https://gerrit.wikimedia.org/r/404262

Change 404262 merged by Filippo Giunchedi:
[operations/puppet@production] restbase: reprovision restbase1017

https://gerrit.wikimedia.org/r/404262

Change 404277 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] restbase: don't consider sde for restbase1017

https://gerrit.wikimedia.org/r/404277

Change 404277 merged by Filippo Giunchedi:
[operations/puppet@production] restbase: don't consider sde for restbase1017

https://gerrit.wikimedia.org/r/404277

Change 404300 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable remaining restbase1017 instances

https://gerrit.wikimedia.org/r/404300

Change 404300 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable remaining restbase1017 instances

https://gerrit.wikimedia.org/r/404300

Change 404325 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] restbase: reprovision restbase1018

https://gerrit.wikimedia.org/r/404325

Change 404325 merged by Filippo Giunchedi:
[operations/puppet@production] restbase: reprovision restbase1018

https://gerrit.wikimedia.org/r/404325

Mentioned in SAL (#wikimedia-operations) [2018-01-16T10:13:06Z] <godog> start cassandra-a on restbase1018 - T184100

fgiunchedi updated the task description. (Show Details)Jan 16 2018, 10:13 AM
fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Jan 16 2018, 11:17 AM

Change 404455 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: switch remaining cassandra hosts to jbod

https://gerrit.wikimedia.org/r/404455

Change 404455 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: switch remaining cassandra hosts to jbod

https://gerrit.wikimedia.org/r/404455

Here are the nodes of rack a for the previous 2 days. 1007 is a Dell, and 1010 and 1011 are HPs (where the latter has been configured in HBA mode).

CPU:

Table read latency:

While still not as performant as the Dell (1007), the HBA configured node (1011) seems to fare considerably better than the JBOD-of-RAID configuration (1010).

Eevans moved this task from Backlog to In-Progress on the User-Eevans board.Jan 16 2018, 5:32 PM

It seems indeed pretty obvious that with HBA mode on there's a performance improvement in terms of latency. I think it makes sense at this point to turn HBA on for the HPs left to be reimagined and proceed. This of course begs the question of what to do with the machines with HBA off, which we'll need to reimage at some point (and gather more data between the two configurations in the mean time)

fgiunchedi updated the task description. (Show Details)Jan 16 2018, 5:35 PM

It seems indeed pretty obvious that with HBA mode on there's a performance improvement in terms of latency. I think it makes sense at this point to turn HBA on for the HPs left to be reimagined and proceed. This of course begs the question of what to do with the machines with HBA off, which we'll need to reimage at some point (and gather more data between the two configurations in the mean time)

+1

The difference is significant, even if not equivalent to the Dells. On the other hand, I'm not insensitive to the amount of additional work this creates for you (namely the reimage of existing HPs to ensure consistency).

For posterity sake, there are 9 machines that would need to be re-imaged at some point:

  • restbase1010.eqiad.wmnet
  • restbase1012.eqiad.wmnet
  • restbase1014.eqiad.wmnet
  • restbase2003.codfw.wmnet
  • restbase2004.codfw.wmnet
  • restbase2001.codfw.wmnet
  • restbase2002.codfw.wmnet
  • restbase2005.codfw.wmnet
  • restbase2006.codfw.wmnet

Change 404638 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] restbase: reprovision restbase1016

https://gerrit.wikimedia.org/r/404638

Change 404638 merged by Filippo Giunchedi:
[operations/puppet@production] restbase: reprovision restbase1016

https://gerrit.wikimedia.org/r/404638

Mentioned in SAL (#wikimedia-operations) [2018-01-17T09:14:35Z] <godog> reimage restbase1016 - T184100

Change 404652 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] restbase: reprovision restbase201[012]

https://gerrit.wikimedia.org/r/404652

Change 404675 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] restbase: reprovision restbase101[35]

https://gerrit.wikimedia.org/r/404675

Change 404675 merged by Filippo Giunchedi:
[operations/puppet@production] restbase: reprovision restbase101[35]

https://gerrit.wikimedia.org/r/404675

fgiunchedi updated the task description. (Show Details)Jan 17 2018, 4:26 PM

Mentioned in SAL (#wikimedia-operations) [2018-01-17T22:54:42Z] <urandom> bootstrapping restbase1013-b - T184100

Change 404652 merged by Filippo Giunchedi:
[operations/puppet@production] restbase: reprovision restbase201[012]

https://gerrit.wikimedia.org/r/404652

Change 404950 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Scap: Add back restbase101[678]

https://gerrit.wikimedia.org/r/404950

Change 404950 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Scap: Add back restbase101[678]

https://gerrit.wikimedia.org/r/404950

Change 404952 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Scap: Add back restbase201[012]

https://gerrit.wikimedia.org/r/404952

Change 404952 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Scap: Add back restbase201[012]

https://gerrit.wikimedia.org/r/404952

Change 405008 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: reprovision restbase200[789]

https://gerrit.wikimedia.org/r/405008

Change 405008 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: reprovision restbase200[789]

https://gerrit.wikimedia.org/r/405008

fgiunchedi updated the task description. (Show Details)Jan 18 2018, 4:35 PM

Change 405019 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: swap restbase2009 disk layout

https://gerrit.wikimedia.org/r/405019

Change 405019 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: swap restbase2009 disk layout

https://gerrit.wikimedia.org/r/405019

Mentioned in SAL (#wikimedia-operations) [2018-01-18T23:11:52Z] <urandom> bootstrapping restbase1015-b -- T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-19T09:08:50Z] <godog> start cassandra-a on restbase1015 - T184100

fgiunchedi updated the task description. (Show Details)Jan 19 2018, 3:12 PM

Mentioned in SAL (#wikimedia-operations) [2018-01-19T15:23:34Z] <godog> bootstrap cassandra-a on restbase2010 - T184100

Change 405312 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] restbase: reprovision restbase200[789]

https://gerrit.wikimedia.org/r/405312

Change 405312 merged by Filippo Giunchedi:
[operations/puppet@production] restbase: reprovision restbase200[789]

https://gerrit.wikimedia.org/r/405312

Mentioned in SAL (#wikimedia-operations) [2018-01-19T18:58:27Z] <urandom> bootstrapping restbase2010-b - T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-19T21:28:26Z] <urandom> bootstrapping restbase2010-c - T184100

Eevans updated the task description. (Show Details)Jan 20 2018, 3:31 AM

Mentioned in SAL (#wikimedia-operations) [2018-01-20T03:32:52Z] <urandom> bootstrapping restbase2011-a - T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-20T12:53:50Z] <urandom> bootstrapping restbase2011-b - T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-20T16:57:20Z] <urandom> bootstrapping restbase2011-c - T184100

Eevans updated the task description. (Show Details)Jan 20 2018, 11:19 PM

Mentioned in SAL (#wikimedia-operations) [2018-01-20T23:20:34Z] <urandom> bootstrapping restbase2012-a - T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-21T02:35:53Z] <urandom> bootstrapping restbase2012-b - T184100

Eevans updated the task description. (Show Details)Jan 24 2018, 12:06 AM

Mentioned in SAL (#wikimedia-operations) [2018-01-24T00:08:26Z] <urandom> bootstrapping restbase2007-a - T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-24T06:26:20Z] <urandom> bootstrapping restbase2007-b - T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-24T18:32:47Z] <urandom> bootstrapping restbase2007-c - T184100

Change 406113 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Scap: Bring back all target nodes

https://gerrit.wikimedia.org/r/406113

Change 406113 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Scap: Bring back all target nodes

https://gerrit.wikimedia.org/r/406113

Eevans updated the task description. (Show Details)Jan 25 2018, 12:20 AM

Mentioned in SAL (#wikimedia-operations) [2018-01-25T00:24:54Z] <urandom> bootstrapping restbase2008-a - T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-25T07:44:23Z] <urandom> bootstrapping restbase2008-b - T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-25T14:52:42Z] <urandom> bootstrapping restbase2008-c - T184100

Eevans updated the task description. (Show Details)Jan 25 2018, 10:04 PM

Mentioned in SAL (#wikimedia-operations) [2018-01-25T22:06:29Z] <urandom> bootstrapping restbase2009-a - T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-26T02:32:42Z] <urandom> bootstrapping restbase2009-b - T184100

Mentioned in SAL (#wikimedia-operations) [2018-01-26T04:31:49Z] <urandom> bootstrapping restbase2009-c - T184100

Eevans updated the task description. (Show Details)Jan 26 2018, 2:47 PM

All of the nodes are now bootstrapped into the cluster. All that remains is to execute cleanups.

Eevans closed this task as Resolved.Jan 28 2018, 3:50 PM
Eevans edited projects, added Services (done); removed Services (next).

All of the nodes are now bootstrapped into the cluster. All that remains is to execute cleanups.

Done!