Page MenuHomePhabricator

CloudVPS: rework codfw deployments
Closed, ResolvedPublic

Description

It's time to rework codfw Cloud VPS (openstack) deployments. View of our current deployments: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Deployments

We will be keeping codfw1dev (AKA labtestn) as a mirror setup of eqiad1.

List of affected servers and their plans:

Hosts with little changes (or no changes at all):

After these changes, this is how the deployments in codfw will look like:

codfw1dev
cloudcontrol2001-dev.codfw.wmnet (was spare)
cloudcontrol2003-dev.codfw.wmnet (was labtestcontrol2003.wikimedia.org)

cloudnet2002-dev.codfw.wmnet
cloudnet2003-dev.codfw.smnet (was labtestnet2003.codfw.wmnet)

cloudservices2002-dev.wikimedia.org (was labtestservices2002.wikimedia.org)

cloudweb2001-dev.wikimedia.org (was labtestnet2002.codfw.wmnet)
clouddb2001.dev.codfw.wmnet (was labtestmetal2001.codfw.wmnet)

cloudvirt2001-dev.codfw.wmnet (was spare)
cloudvirt2002-dev.codfw.wmnet (was spare)
cloudvirt2003-dev.codfw.wmnet (was spare)

Related Objects

StatusAssignedTask
Resolvedaborrero
Resolvedaborrero
ResolvedPapaul
ResolvedPapaul
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
Resolvedaborrero
Resolvedaborrero
DeclinedNone
OpenNone
Declinedaborrero
Resolvedaborrero
ResolvedAndrew
Resolvedaborrero
Resolvedaborrero
ResolvedPapaul
ResolvedPapaul
Resolvedaborrero
Resolvedaborrero
Declinedaborrero
Duplicateaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Openaborrero

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I answered @arturo, that this is technically possible and we intend to support this, but we are not ready at the moment. Too main blockers:

  • Lack of hardware (to be purchases on Q4)- mainly proxies
  • Lack of configuration management (multi-instance proxies, multi-instance misc hosts, active-passive management)

So this is possible in the future, but needs work- specially no work has been done to support this use case. Misc codfw was being tracked at: T156937

Regarding codfw2dev and the proposal in this task, we had a conversation in our WMCS team meeting today:

  • we will focus in decom the old hardware
  • once that is done, we will work on getting codfw1dev in shape as close as possible to eqiad1 (including T218029: CloudVPS: evaluate convenience of having codfw openstack DBs in proper DB hosts)
  • we will also evaluate how much hardware we need to build a proper and complete codfw2dev deployment, so we can properly include this in the budget for next fiscal year
  • we will delay the building of codfw2dev until we have proper/enough hardware.
aborrero updated the task description. (Show Details)Mar 12 2019, 6:37 PM
aborrero updated the task description. (Show Details)Mar 18 2019, 1:35 PM
aborrero updated the task description. (Show Details)Mar 18 2019, 1:51 PM
aborrero updated the task description. (Show Details)Mar 18 2019, 2:01 PM
aborrero changed the status of subtask T218024: decommmision: labtestweb2001.wikimedia.org from Open to Stalled.Mar 21 2019, 4:47 PM
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)Mar 21 2019, 5:10 PM
aborrero updated the task description. (Show Details)Mar 25 2019, 12:02 PM
aborrero updated the task description. (Show Details)Mar 29 2019, 12:41 PM
aborrero updated the task description. (Show Details)Apr 1 2019, 11:03 AM
aborrero updated the task description. (Show Details)Apr 4 2019, 12:34 PM
aborrero updated the task description. (Show Details)Apr 4 2019, 12:55 PM
aborrero updated the task description. (Show Details)Apr 4 2019, 1:02 PM
aborrero updated the task description. (Show Details)Apr 4 2019, 4:57 PM
aborrero updated the task description. (Show Details)Apr 5 2019, 1:24 PM
aborrero reassigned this task from aborrero to Andrew.EditedApr 5 2019, 1:27 PM
aborrero added a subscriber: bd808.

We may end with 3 cloudnet servers:

Before I move on with T220203, I would like to evaluate another option, which is to simply return labtestnet2002 to the spare pool.
I'm trying to avoid a future in which we require another server in codfw and we decide to reuse one of those cloudnets.

So options are 2:

By now, I will leave the server with role::spare in stretch, which is a common intermediate state for both options.

Please @bd808 and @Andrew comment (assigning task to andrew)

Andrew added a comment.Apr 5 2019, 3:24 PM

I don't think we need three network nodes, until/unless we start testing for a new region. So yeah, marking labtestnet2002 as a spare seems like the best option for now.

Per conversation with @Andrew on IRC:

  • labtestmetal2001 -> clouddb2001-dev
  • labtestnet2002 -> cloudweb2001-dev
  • labtestweb2001 -> decom
aborrero updated the task description. (Show Details)Apr 8 2019, 4:48 PM
aborrero updated the task description. (Show Details)Apr 8 2019, 4:54 PM
aborrero updated the task description. (Show Details)Apr 9 2019, 10:07 AM
aborrero updated the task description. (Show Details)Apr 9 2019, 10:10 AM
aborrero updated the task description. (Show Details)Apr 9 2019, 12:52 PM
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)Apr 12 2019, 11:01 AM
Gilles added a subscriber: Gilles.Apr 12 2019, 11:46 AM

FYI, deployed a mediawiki config change just now and got this:

11:39:41 Started sync-apaches
11:44:06 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/InitialiseSettings.php', 'mw1268.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2255.codfw.wmnet', 'mw2290.codfw.wmnet', 'mw2216.codfw.wmnet', 'mw1251.eqiad.wmnet', 'mw2188.codfw.wmnet', 'mw1320.eqiad.wmnet', 'mw1285.eqiad.wmnet'] on cloudweb2001-dev.wikimedia.org returned [255]: ssh: connect to host cloudweb2001-dev.wikimedia.org port 22: Connection timed out

sync-apaches: 100% (ok: 262; fail: 1; left: 0)
aborrero updated the task description. (Show Details)Apr 16 2019, 12:23 PM

FYI, deployed a mediawiki config change just now and got this:

11:39:41 Started sync-apaches
11:44:06 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/InitialiseSettings.php', 'mw1268.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2255.codfw.wmnet', 'mw2290.codfw.wmnet', 'mw2216.codfw.wmnet', 'mw1251.eqiad.wmnet', 'mw2188.codfw.wmnet', 'mw1320.eqiad.wmnet', 'mw1285.eqiad.wmnet'] on cloudweb2001-dev.wikimedia.org returned [255]: ssh: connect to host cloudweb2001-dev.wikimedia.org port 22: Connection timed out

sync-apaches: 100% (ok: 262; fail: 1; left: 0)

This should be T220931: deploy1001 cannot reach cloudweb2001-dev.wikimedia.org when running scap

aborrero updated the task description. (Show Details)Apr 16 2019, 3:19 PM
aborrero updated the task description. (Show Details)Apr 22 2019, 12:25 PM
aborrero updated the task description. (Show Details)Thu, Apr 25, 11:17 AM
aborrero updated the task description. (Show Details)Mon, Apr 29, 9:39 AM
aborrero updated the task description. (Show Details)Mon, Apr 29, 10:15 AM
aborrero updated the task description. (Show Details)
aborrero closed subtask Unknown Object (Task) as Resolved.Mon, May 20, 10:32 AM
aborrero closed this task as Resolved.

Closing this task now, since the only important subtask is T222061: labtestpuppetmaster2001.wikimedia.org: use proper codfw1dev role but that doesn't really affect the rework of the codfw deployments, which is actually completed now.