Page MenuHomePhabricator

CloudVPS: rework codfw deployments
Closed, ResolvedPublic

Description

It's time to rework codfw Cloud VPS (openstack) deployments. View of our current deployments: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Deployments

We will be keeping codfw1dev (AKA labtestn) as a mirror setup of eqiad1.

List of affected servers and their plans:

Hosts with little changes (or no changes at all):

After these changes, this is how the deployments in codfw will look like:

codfw1dev
cloudcontrol2001-dev.codfw.wmnet (was spare)
cloudcontrol2003-dev.codfw.wmnet (was labtestcontrol2003.wikimedia.org)

cloudnet2002-dev.codfw.wmnet
cloudnet2003-dev.codfw.smnet (was labtestnet2003.codfw.wmnet)

cloudservices2002-dev.wikimedia.org (was labtestservices2002.wikimedia.org)

cloudweb2001-dev.wikimedia.org (was labtestnet2002.codfw.wmnet)
clouddb2001.dev.codfw.wmnet (was labtestmetal2001.codfw.wmnet)

cloudvirt2001-dev.codfw.wmnet (was spare)
cloudvirt2002-dev.codfw.wmnet (was spare)
cloudvirt2003-dev.codfw.wmnet (was spare)

Related Objects

StatusSubtypeAssignedTask
Resolvedaborrero
Resolvedaborrero
ResolvedPapaul
ResolvedPapaul
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
Resolvedaborrero
Resolvedaborrero
DeclinedNone
DeclinedNone
Declinedaborrero
Resolvedaborrero
ResolvedAndrew
OpenNone
ResolvedAndrew
Resolvedaborrero
ResolvedAndrew
ResolvedAndrew
Resolvedaborrero
Resolvedaborrero
ResolvedPapaul
ResolvedPapaul
Resolvedaborrero
Resolvedaborrero
Declinedaborrero
Duplicateaborrero
Resolvedaborrero
ResolvedAndrew
Resolvedaborrero
Resolvedaborrero
DeclinedNone
ResolvedNone
ResolvedAndrew
ResolvedAndrew
ResolvedCDanis
ResolvedMarostegui
ResolvedAndrew
Resolvedaborrero
InvalidNone
Resolvedaborrero
ResolvedKrenair
Resolvedaborrero
Resolvedjcrespo
ResolvedAndrew
ResolvedNone
Resolvedaborrero
Duplicatejbond
ResolvedAndrew
ResolvedPapaul
ResolvedAndrew
InvalidNone
ResolvedPapaul
ResolvedAndrew
Resolvedayounsi

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
aborrero added a subscriber: bd808.

We may end with 3 cloudnet servers:

Before I move on with T220203, I would like to evaluate another option, which is to simply return labtestnet2002 to the spare pool.
I'm trying to avoid a future in which we require another server in codfw and we decide to reuse one of those cloudnets.

So options are 2:

By now, I will leave the server with role::spare in stretch, which is a common intermediate state for both options.

Please @bd808 and @Andrew comment (assigning task to andrew)

I don't think we need three network nodes, until/unless we start testing for a new region. So yeah, marking labtestnet2002 as a spare seems like the best option for now.

Per conversation with @Andrew on IRC:

  • labtestmetal2001 -> clouddb2001-dev
  • labtestnet2002 -> cloudweb2001-dev
  • labtestweb2001 -> decom

FYI, deployed a mediawiki config change just now and got this:

11:39:41 Started sync-apaches
11:44:06 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/InitialiseSettings.php', 'mw1268.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2255.codfw.wmnet', 'mw2290.codfw.wmnet', 'mw2216.codfw.wmnet', 'mw1251.eqiad.wmnet', 'mw2188.codfw.wmnet', 'mw1320.eqiad.wmnet', 'mw1285.eqiad.wmnet'] on cloudweb2001-dev.wikimedia.org returned [255]: ssh: connect to host cloudweb2001-dev.wikimedia.org port 22: Connection timed out

sync-apaches: 100% (ok: 262; fail: 1; left: 0)

FYI, deployed a mediawiki config change just now and got this:

11:39:41 Started sync-apaches
11:44:06 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/InitialiseSettings.php', 'mw1268.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2255.codfw.wmnet', 'mw2290.codfw.wmnet', 'mw2216.codfw.wmnet', 'mw1251.eqiad.wmnet', 'mw2188.codfw.wmnet', 'mw1320.eqiad.wmnet', 'mw1285.eqiad.wmnet'] on cloudweb2001-dev.wikimedia.org returned [255]: ssh: connect to host cloudweb2001-dev.wikimedia.org port 22: Connection timed out

sync-apaches: 100% (ok: 262; fail: 1; left: 0)

This should be T220931: deploy1001 cannot reach cloudweb2001-dev.wikimedia.org when running scap

aborrero updated the task description. (Show Details)

Closing this task now, since the only important subtask is T222061: labtestpuppetmaster2001.wikimedia.org: use proper codfw1dev role but that doesn't really affect the rework of the codfw deployments, which is actually completed now.