Page MenuHomePhabricator

Rename dbstore1004 to db1183 and place it on m5
Open, Stalled, HighPublic

Description

dbstore1004 is no longer in use, it was replaced by dbstore1007.
dbstore1004 needs to be renamed and reimaged to db1183 and placed into m5 (https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging).

Note: once the network maintenance at T286032 is done, move db1183 to s7.

Event Timeline

Marostegui triaged this task as Medium priority.Jun 9 2021, 4:39 AM
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui raised the priority of this task from Medium to High.Fri, Jul 2, 12:41 PM

@Kormat let's give this some higher priority, as we might be able to use db1183 to replace one of the systems at T286032

Marostegui renamed this task from Rename dbstore1004 to db1183 and place it on s7 to Rename dbstore1004 to db1183 and place it on m5.Fri, Jul 2, 1:12 PM
Marostegui updated the task description. (Show Details)

Please use this host (once reimaged) to replace db1128 in m5: T286032#7193722

cookbooks.sre.hosts.decommission executed by kormat@cumin1001 for hosts: dbstore1004.eqiad.wmnet

  • dbstore1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • COMMON_STEPS (FAIL)
    • Failed to run Homer on asw2-b-eqiad.mgmt.eqiad.wmnet: Command '['/usr/local/bin/homer', 'asw2-b-eqiad.mgmt.eqiad.wmnet', 'commit', 'Host decommission - kormat@cumin1001 - T284622']' returned non-zero exit status 1.

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by kormat@cumin1001 for hosts: dbstore1004.eqiad.wmnet

  • dbstore1004.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Host steps raised exception: Invalid management FQDN dbstore1004.mgmt.eqiad.wmnet for dbstore1004.eqiad.wmnet

ERROR: some step on some host failed, check the bolded items above

Change 702988 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] dbstore1004: Rename to db1183, prep for m5.

https://gerrit.wikimedia.org/r/702988

Change 702988 merged by Kormat:

[operations/puppet@production] dbstore1004: Rename to db1183, prep for m5.

https://gerrit.wikimedia.org/r/702988

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

db1183.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107021618_kormat_30648_db1183_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1183.eqiad.wmnet']

Of which those FAILED:

['db1183.eqiad.wmnet']

Current status:

  • Host is renamed
  • Needs a partitioning scheme configured for the reimaging
  • Needs a role assigned, and hiera host vars set.

And then these final steps need to be run:

  • Run puppet on the install servers: cumin 'A:installserver' 'run-puppet-agent -q'
  • Run the wmf-auto-reimage-host script for the host with the new name and with --new option (see the Reimage section above)
  • Edit the device page on Netbox, set its status from PLANNED to STAGED.
  • Get the physical re-labeling done (open a task for dc-ops): T286468: Relabel dbstore1004 to db1183
  • Run Homer (again) against the switch the device is connected to, in order to update the port's description with the interface name assigned to the host during the reimage/install.
  • Once the host is back in production update its status in Netbox from STAGED to ACTIVE.

I have removed dbstore1004:331* from zarcillo and tendril.

Change 704072 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] install_server: Set db1183 to be partitioned on install.

https://gerrit.wikimedia.org/r/704072

Change 704072 merged by Kormat:

[operations/puppet@production] install_server: Set db1183 to be partitioned on install.

https://gerrit.wikimedia.org/r/704072

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

db1183.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107120858_kormat_6790_db1183_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1183.eqiad.wmnet']

and were ALL successful.

Change 704079 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1183: Assign role, configure hiera values.

https://gerrit.wikimedia.org/r/704079

Change 704079 merged by Kormat:

[operations/puppet@production] db1183: Assign role, configure hiera values.

https://gerrit.wikimedia.org/r/704079

Mentioned in SAL (#wikimedia-operations) [2021-07-13T12:53:37Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 4:00:00 on db1117.eqiad.wmnet with reason: Copy m5 from db1117 to db1183 T284622

Mentioned in SAL (#wikimedia-operations) [2021-07-13T12:53:43Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1117.eqiad.wmnet with reason: Copy m5 from db1117 to db1183 T284622

Mentioned in SAL (#wikimedia-operations) [2021-07-13T12:53:53Z] <kormat> stopping replication on db1117:3325 T284622

Mentioned in SAL (#wikimedia-operations) [2021-07-13T13:14:06Z] <kormat> restarted replication on db1117:3325 T284622

Change 704341 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1183: Enable notifications.

https://gerrit.wikimedia.org/r/704341

Change 704341 merged by Kormat:

[operations/puppet@production] db1183: Enable notifications.

https://gerrit.wikimedia.org/r/704341

Kormat changed the task status from Open to Stalled.Tue, Jul 13, 2:19 PM
Kormat moved this task from Ready to Blocked on the DBA board.

Stalling until T286032: Switch buffer re-partition - Eqiad Row A is done, then it will be moved to s7.