Page MenuHomePhabricator

Move cloudcephmon1001, 1002 and 1003 to the cloud-hosts1-b-eqiad vlan
Closed, ResolvedPublic

Description

Arzhel and the wmcs team decided to make this move a few days ago; the full plan is documented here:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Ceph#Network

Event Timeline

Change 616150 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] Add ips in cloud-hosts1-b-eqiad for cloudcephmon nodes

https://gerrit.wikimedia.org/r/616150

Change 616151 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] Remove public IP addresses for cloudcephmons nodes

https://gerrit.wikimedia.org/r/616151

Change 616152 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] Remove eth1 addresses for cloudcephmon hosts

https://gerrit.wikimedia.org/r/616152

Change 616153 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Move cloudcephmon hosts from .wikimedia.org to .eqiad.wmnet

https://gerrit.wikimedia.org/r/616153

Change 616150 merged by Andrew Bogott:
[operations/dns@master] Add ips in cloud-hosts1-b-eqiad for cloudcephmon nodes

https://gerrit.wikimedia.org/r/616150

Change 616156 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Move cloudcephmon1002 from .wikimedia.org to .eqiad.wmnet

https://gerrit.wikimedia.org/r/616156

Change 616157 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Move cloudcephmon1001 from .wikimedia.org to .eqiad.wmnet

https://gerrit.wikimedia.org/r/616157

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudcephmon1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007242007_andrew_19879_cloudcephmon1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephmon1003.eqiad.wmnet']

Of which those FAILED:

['cloudcephmon1003.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudcephmon1003.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/202007242008_andrew_20298_cloudcephmon1003_wikimedia_org.log.

Change 616153 merged by Andrew Bogott:
[operations/puppet@production] Move cloudcephmon1003 from .wikimedia.org to .eqiad.wmnet

https://gerrit.wikimedia.org/r/616153

Completed auto-reimage of hosts:

['cloudcephmon1003.wikimedia.org']

Of which those FAILED:

['cloudcephmon1003.wikimedia.org']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudcephmon1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007242024_andrew_2972_cloudcephmon1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephmon1003.eqiad.wmnet']

Of which those FAILED:

['cloudcephmon1003.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudcephmon1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007242024_andrew_3206_cloudcephmon1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephmon1003.eqiad.wmnet']

Of which those FAILED:

['cloudcephmon1003.eqiad.wmnet']

Change 616168 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Add site.pp entries for cloudcephmon1001-1003.eqiad.wmnet

https://gerrit.wikimedia.org/r/616168

Change 616168 merged by Andrew Bogott:
[operations/puppet@production] Add site.pp entries for cloudcephmon1001-1003.eqiad.wmnet

https://gerrit.wikimedia.org/r/616168

Presumably this move requires a switch config change; I need to only do one host at a time to avoid split-brain so will need to coordinate with @ayounsi or another network engineer.

Change 616172 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Move cloudcephmon1003 from .wikimedia.org to .eqiad.wmnet

https://gerrit.wikimedia.org/r/616172

Change 616172 merged by Andrew Bogott:
[operations/puppet@production] Move cloudcephmon1003 from .wikimedia.org to .eqiad.wmnet

https://gerrit.wikimedia.org/r/616172

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudcephmon1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007271247_andrew_27892_cloudcephmon1003_eqiad_wmnet.log.

Change 616519 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Ceph: temporarily hack the public network to include both old and new

https://gerrit.wikimedia.org/r/616519

Change 616519 merged by Andrew Bogott:
[operations/puppet@production] Ceph: temporarily hack the public network to include both old and new

https://gerrit.wikimedia.org/r/616519

Completed auto-reimage of hosts:

['cloudcephmon1003.eqiad.wmnet']

and were ALL successful.

Change 616520 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Ceph: update ip for cloudcephmon1003

https://gerrit.wikimedia.org/r/616520

Change 616520 merged by Andrew Bogott:
[operations/puppet@production] Ceph: update ip for cloudcephmon1003

https://gerrit.wikimedia.org/r/616520

Change 616156 merged by Andrew Bogott:
[operations/puppet@production] Move cloudcephmon1002 from .wikimedia.org to .eqiad.wmnet

https://gerrit.wikimedia.org/r/616156

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudcephmon1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007271405_andrew_6394_cloudcephmon1002_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephmon1002.eqiad.wmnet']

and were ALL successful.

Change 616157 merged by Andrew Bogott:
[operations/puppet@production] Move cloudcephmon1001 from .wikimedia.org to .eqiad.wmnet

https://gerrit.wikimedia.org/r/616157

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudcephmon1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007271441_andrew_6728_cloudcephmon1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephmon1001.eqiad.wmnet']

and were ALL successful.

Change 616151 merged by Andrew Bogott:
[operations/dns@master] Remove public IP addresses for cloudcephmons nodes

https://gerrit.wikimedia.org/r/616151

Change 616152 merged by Andrew Bogott:
[operations/dns@master] Remove eth1 addresses for cloudcephmon hosts

https://gerrit.wikimedia.org/r/616152

All three are moved; I created a new ceph-backed VM and it came up and worked.

Andrew reassigned this task from Andrew to ayounsi.

There is a new wrinkle here -- lvs!

These hosts are an lvs pool, which means they need to be able to talk to e.g. lvs1015. Currently lvs1015 can't even ping them. Is this a simple ACL change, or a huge wrench in our network plans?

Looking at the LVS config, it seems like it's only configured for Prometheus monitoring. As it's a bit of a surprising setup, is it a hard requirement? Is it possible to know more about it?

If we want to stretch our LVS to a new vlan or realm we would need to have Traffic approval, and I'm not sure we should as it extends the fate sharing for the LVS.

Change 616817 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] Move cloudceph.svc.eqiad.wmnet service name to cloudceph.eqiad.wmnet

https://gerrit.wikimedia.org/r/616817

Change 616818 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudceph: don't use lvs for prometheus monitoring

https://gerrit.wikimedia.org/r/616818

I've added a couple of patches that moves this out from behind lvs. A side-effect of that is that the service name cloudceph.svc.eqiad.wmnet will move to cloudceph.eqiad.wmnet. That means we need to update whatever things are currently monitoring the old .svc name. Grafana dashboards, probably?

Change 616818 merged by Andrew Bogott:
[operations/puppet@production] cloudceph: don't use lvs for prometheus monitoring

https://gerrit.wikimedia.org/r/616818

Change 616852 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Remove lvs from cloudcephmon nodes

https://gerrit.wikimedia.org/r/616852

Change 616852 merged by Andrew Bogott:
[operations/puppet@production] Remove lvs from cloudcephmon nodes

https://gerrit.wikimedia.org/r/616852

Change 616817 merged by Andrew Bogott:
[operations/dns@master] Remove cloudceph.svc.eqiad.wmnet service name

https://gerrit.wikimedia.org/r/616817

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudcephmon1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007281534_andrew_27220_cloudcephmon1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephmon1003.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudcephmon1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007281609_andrew_31018_cloudcephmon1002_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephmon1002.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudcephmon1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007281633_andrew_19753_cloudcephmon1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephmon1001.eqiad.wmnet']

and were ALL successful.

Andrew claimed this task.

lvs is all cleaned up now.

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudcephosd1001.wikimedia.org

  • cloudcephosd1001.wikimedia.org (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudcephosd1002.wikimedia.org

  • cloudcephosd1002.wikimedia.org (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudcephosd1003.wikimedia.org

  • cloudcephosd1003.wikimedia.org (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1001.eqiad.wmnet', 'cloudcephosd1002.eqiad.wmnet', 'cloudcephosd1003.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007281934_andrew_18746.log.

Completed auto-reimage of hosts:

['cloudcephosd1001.eqiad.wmnet', 'cloudcephosd1003.eqiad.wmnet', 'cloudcephosd1002.eqiad.wmnet']

and were ALL successful.

@Andrew is the DNS record:

10.in-addr.arpa:51  1H  IN PTR  cloudceph.svc.eqiad.wmnet.

a leftover that can be removed?

@Andrew is the DNS record:

10.in-addr.arpa:51  1H  IN PTR  cloudceph.svc.eqiad.wmnet.

a leftover that can be removed?

yep!

Change 623843 had a related patch set uploaded (by Volans; owner: Volans):
[operations/dns@master] Cleanup leftover record cloudceph.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/623843

Change 623843 merged by Volans:
[operations/dns@master] Cleanup leftover record cloudceph.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/623843