Page MenuHomePhabricator

cloudservices2005-dev: reimage into new network setup
Closed, ResolvedPublic

Description

The cloudcontrol2005-dev server got a new network setup.

We should:

  • drop wikimedia.org domain in favor of .codfw.wmnet.
  • drop connection to asw
  • keep private.codfw.wikimedia.cloud address

Following procedure at https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

Related Objects

StatusSubtypeAssignedTask
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedayounsi
Resolvedcmooney
ResolvedPapaul
Resolvedcmooney
Resolvedcmooney
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedtaavi
Opencmooney
Resolvedaborrero
Opencmooney
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
OpenAndrew
OpenAndrew
ResolvedAndrew
Resolvedaborrero
OpenNone
OpenNone
Resolvedaborrero
Resolvedcmooney
Resolvedaborrero
Resolvedaborrero
OpenNone
Resolvedaborrero

Event Timeline

aborrero triaged this task as Medium priority.Jun 12 2023, 11:11 AM
aborrero created this task.

Change 929687 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: codfw1dev: services: disable slapd mirror mode

https://gerrit.wikimedia.org/r/929687

Change 929687 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: codfw1dev: services: disable slapd mirror mode

https://gerrit.wikimedia.org/r/929687

Change 931291 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: codfw1dev: refresh ldap hosts

https://gerrit.wikimedia.org/r/931291

Change 931589 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimediacloud.org: refresh A for ns1.openstack.codfw1dev.wikimediacloud.org

https://gerrit.wikimedia.org/r/931589

Change 931287 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: codfw1dev: fix labsldapconfig to use newer server

https://gerrit.wikimedia.org/r/931287

Change 931589 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimediacloud.org: refresh A for ns1.openstack.codfw1dev.wikimediacloud.org

https://gerrit.wikimedia.org/r/931589

Change 931900 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: drop references to cloudservices2005-dev.wikimedia.org

https://gerrit.wikimedia.org/r/931900

Change 931900 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: drop references to cloudservices2005-dev.wikimedia.org

https://gerrit.wikimedia.org/r/931900

cookbooks.sre.hosts.decommission executed by aborrero@cumin2002 for hosts: cloudservices2005-dev

  • cloudservices2005-dev (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook, run it manually

ERROR: some step on some host failed, check the bolded items above

aborrero changed the task status from Open to In Progress.Jun 21 2023, 2:48 PM
aborrero updated Other Assignee, added: Papaul.
aborrero added a project: ops-codfw.
aborrero added subscribers: Jhancock.wm, Papaul.

Hey @Papaul (or @Jhancock.wm ) we would need this server connected in a similar fashion as cloudservices2004-dev:

  • primary interface connected to cloudsw (whatever port) -- relocate connection from server's eno2
  • no connection to asw

@aborrero I've physically removed the connection from asw. the cloudsw connection is now on port ge-0/0/12 in <-> eno1.

@aborrero I've physically removed the connection from asw. the cloudsw connection is now on port ge-0/0/12 in <-> eno1.

Thanks!

Could you please make the changes in netbox so it shows this new reality ? https://netbox.wikimedia.org/dcim/devices/4144/interfaces/

Change 932800 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices2005-dev: give it proper role and name.

https://gerrit.wikimedia.org/r/932800

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudservices2005-dev.codfw.wmnet with OS bullseye

Change 932800 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices2005-dev: give it proper role and name

https://gerrit.wikimedia.org/r/932800

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudservices2005-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudservices2005-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudservices2005-dev.codfw.wmnet with OS bullseye

Change 932832 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices2005-dev: drop cloud-private base interface override

https://gerrit.wikimedia.org/r/932832

Change 932833 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] acme_chief: allow cloudservices2005-dev to access ldap-codfw1dev cert

https://gerrit.wikimedia.org/r/932833

Change 932832 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices2005-dev: drop cloud-private base interface override

https://gerrit.wikimedia.org/r/932832

Change 932833 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] acme_chief: allow cloudservices2005-dev to access ldap-codfw1dev cert

https://gerrit.wikimedia.org/r/932833

Change 932834 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] acme_chief: extend ldap-codfw1dev with cloudservices2005-dev SNI

https://gerrit.wikimedia.org/r/932834

Change 932834 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] acme_chief: extend ldap-codfw1dev with cloudservices2005-dev SNI

https://gerrit.wikimedia.org/r/932834

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudservices2005-dev.codfw.wmnet with OS bullseye completed:

  • cloudservices2005-dev (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306260940_aborrero_2574582_cloudservices2005-dev.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually
aborrero updated the task description. (Show Details)

I'm closing this ticket as I believe the reimage is completed and the server is mostly working, barring any last minute firewalling changes which we are tracking in other tickets.

Change 934404 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] designate: distinguish between designate IPs and pdns IPs in pools.yaml

https://gerrit.wikimedia.org/r/934404

Change 934405 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps pdns: include designate hosts as allowed api clients

https://gerrit.wikimedia.org/r/934405

Change 934404 merged by Andrew Bogott:

[operations/puppet@production] designate: distinguish between designate IPs and pdns IPs in pools.yaml

https://gerrit.wikimedia.org/r/934404

Change 934405 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps pdns: include designate hosts as allowed api clients

https://gerrit.wikimedia.org/r/934405

Change 934427 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Designate/pdns: allow designate hosts to access the pdns rest api

https://gerrit.wikimedia.org/r/934427

Change 934427 merged by Andrew Bogott:

[operations/puppet@production] Designate/pdns: allow designate hosts to access the pdns rest api

https://gerrit.wikimedia.org/r/934427

Change 934430 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Designate/pdns: allow designate and pdns hosts to access mdns for axfr

https://gerrit.wikimedia.org/r/934430

Change 934430 merged by Andrew Bogott:

[operations/puppet@production] Designate/pdns: allow designate and pdns hosts to access mdns for axfr

https://gerrit.wikimedia.org/r/934430