Page MenuHomePhabricator

CloudVPS: rework codfw deployments
Closed, ResolvedPublic

Description

It's time to rework codfw Cloud VPS (openstack) deployments. View of our current deployments: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Deployments

We will be keeping codfw1dev (AKA labtestn) as a mirror setup of eqiad1.

List of affected servers and their plans:

Hosts with little changes (or no changes at all):

After these changes, this is how the deployments in codfw will look like:

codfw1dev
cloudcontrol2001-dev.codfw.wmnet (was spare)
cloudcontrol2003-dev.codfw.wmnet (was labtestcontrol2003.wikimedia.org)

cloudnet2002-dev.codfw.wmnet
cloudnet2003-dev.codfw.smnet (was labtestnet2003.codfw.wmnet)

cloudservices2002-dev.wikimedia.org (was labtestservices2002.wikimedia.org)

cloudweb2001-dev.wikimedia.org (was labtestnet2002.codfw.wmnet)
clouddb2001.dev.codfw.wmnet (was labtestmetal2001.codfw.wmnet)

cloudvirt2001-dev.codfw.wmnet (was spare)
cloudvirt2002-dev.codfw.wmnet (was spare)
cloudvirt2003-dev.codfw.wmnet (was spare)

Related Objects

StatusAssignedTask
Resolvedaborrero
Resolvedaborrero
ResolvedPapaul
ResolvedPapaul
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
Resolvedaborrero
Resolvedaborrero
DeclinedNone
OpenNone
Declinedaborrero
Resolvedaborrero
ResolvedAndrew
OpenNone
Resolvedaborrero
Resolvedaborrero
ResolvedPapaul
ResolvedPapaul
Resolvedaborrero
Resolvedaborrero
Declinedaborrero
Duplicateaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Openaborrero
OpenNone
OpenNone
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
aborrero updated the task description. (Show Details)Mar 18 2019, 1:35 PM
aborrero updated the task description. (Show Details)Mar 18 2019, 1:51 PM
aborrero updated the task description. (Show Details)Mar 18 2019, 2:01 PM
aborrero changed the status of subtask T218024: decommmision: labtestweb2001.wikimedia.org from Open to Stalled.Mar 21 2019, 4:47 PM
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)Mar 21 2019, 5:10 PM
aborrero updated the task description. (Show Details)Mar 25 2019, 12:02 PM
aborrero updated the task description. (Show Details)Mar 29 2019, 12:41 PM
aborrero updated the task description. (Show Details)Apr 1 2019, 11:03 AM
aborrero updated the task description. (Show Details)Apr 4 2019, 12:34 PM
aborrero updated the task description. (Show Details)Apr 4 2019, 12:55 PM
aborrero updated the task description. (Show Details)Apr 4 2019, 1:02 PM
aborrero updated the task description. (Show Details)Apr 4 2019, 4:57 PM
aborrero updated the task description. (Show Details)Apr 5 2019, 1:24 PM
aborrero reassigned this task from aborrero to Andrew.EditedApr 5 2019, 1:27 PM
aborrero added a subscriber: bd808.

We may end with 3 cloudnet servers:

Before I move on with T220203, I would like to evaluate another option, which is to simply return labtestnet2002 to the spare pool.
I'm trying to avoid a future in which we require another server in codfw and we decide to reuse one of those cloudnets.

So options are 2:

By now, I will leave the server with role::spare in stretch, which is a common intermediate state for both options.

Please @bd808 and @Andrew comment (assigning task to andrew)

Andrew added a comment.Apr 5 2019, 3:24 PM

I don't think we need three network nodes, until/unless we start testing for a new region. So yeah, marking labtestnet2002 as a spare seems like the best option for now.

Per conversation with @Andrew on IRC:

  • labtestmetal2001 -> clouddb2001-dev
  • labtestnet2002 -> cloudweb2001-dev
  • labtestweb2001 -> decom
aborrero updated the task description. (Show Details)Apr 8 2019, 4:48 PM
aborrero updated the task description. (Show Details)Apr 8 2019, 4:54 PM
aborrero updated the task description. (Show Details)Apr 9 2019, 10:07 AM
aborrero updated the task description. (Show Details)Apr 9 2019, 10:10 AM
aborrero updated the task description. (Show Details)Apr 9 2019, 12:52 PM
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)Apr 12 2019, 11:01 AM
Gilles added a subscriber: Gilles.Apr 12 2019, 11:46 AM

FYI, deployed a mediawiki config change just now and got this:

11:39:41 Started sync-apaches
11:44:06 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/InitialiseSettings.php', 'mw1268.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2255.codfw.wmnet', 'mw2290.codfw.wmnet', 'mw2216.codfw.wmnet', 'mw1251.eqiad.wmnet', 'mw2188.codfw.wmnet', 'mw1320.eqiad.wmnet', 'mw1285.eqiad.wmnet'] on cloudweb2001-dev.wikimedia.org returned [255]: ssh: connect to host cloudweb2001-dev.wikimedia.org port 22: Connection timed out

sync-apaches: 100% (ok: 262; fail: 1; left: 0)
aborrero updated the task description. (Show Details)Apr 16 2019, 12:23 PM

FYI, deployed a mediawiki config change just now and got this:

11:39:41 Started sync-apaches
11:44:06 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/InitialiseSettings.php', 'mw1268.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2255.codfw.wmnet', 'mw2290.codfw.wmnet', 'mw2216.codfw.wmnet', 'mw1251.eqiad.wmnet', 'mw2188.codfw.wmnet', 'mw1320.eqiad.wmnet', 'mw1285.eqiad.wmnet'] on cloudweb2001-dev.wikimedia.org returned [255]: ssh: connect to host cloudweb2001-dev.wikimedia.org port 22: Connection timed out
sync-apaches: 100% (ok: 262; fail: 1; left: 0)

This should be T220931: deploy1001 cannot reach cloudweb2001-dev.wikimedia.org when running scap

aborrero updated the task description. (Show Details)Apr 16 2019, 3:19 PM
aborrero updated the task description. (Show Details)Apr 22 2019, 12:25 PM
aborrero updated the task description. (Show Details)Apr 25 2019, 11:17 AM
aborrero updated the task description. (Show Details)Apr 29 2019, 9:39 AM
aborrero updated the task description. (Show Details)Apr 29 2019, 10:15 AM
aborrero updated the task description. (Show Details)
aborrero closed subtask Unknown Object (Task) as Resolved.May 20 2019, 10:32 AM
aborrero closed this task as Resolved.May 20 2019, 10:34 AM

Closing this task now, since the only important subtask is T222061: labtestpuppetmaster2001.wikimedia.org: use proper codfw1dev role but that doesn't really affect the rework of the codfw deployments, which is actually completed now.