Page MenuHomePhabricator

Q3:rack/setup/install db1207-db1225
Open, MediumPublic

Description

This task will track the racking, setup, and OS installation of db1207-db1225

Hostname / Racking / Installation Details

Hostnames: db1207-db1229
Racking Proposal: Whatever is easier for DCOps, we don't have any preference as long as we don't place more than 2 on the same rack.
Networking Setup: # of Connections:1 , Speed:1G. Vlan: Private AAAA records: N
Partitioning/Raid: HW Raid: Y, Partman recipe and/or desired Raid Level: RAID10 (partman recipe already done in puppet by @Marostegui )
OS Distro: Bullseye
Sub-team Technical Contact: @Marostegui

Per host setup checklist

db1207:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1208:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1209:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1212:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db121:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1213:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1214:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1215:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1216:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1217:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1218:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1219:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1220:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1221:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1222:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1223:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1224:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1225:
  • - receive in system on procurement task T325209 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH removed a subscriber: RobH.
Marostegui renamed this task from Q3:rack/setup/install db1207-db1225 to Q3:rack/setup/install db1207-db1229.Jan 10 2023, 5:48 PM
Marostegui updated the task description. (Show Details)
RobH renamed this task from Q3:rack/setup/install db1207-db1229 to Q3:rack/setup/install db1207-db1225.Jan 10 2023, 6:14 PM
RobH updated the task description. (Show Details)

Change 878182 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Adjust new eqiad hosts

https://gerrit.wikimedia.org/r/878182

Change 878182 merged by Marostegui:

[operations/puppet@production] mariadb: Adjust new eqiad hosts

https://gerrit.wikimedia.org/r/878182

Any ETA to get these (or some) racked and installed? Thanks!

I am in process of racking right now will have them finished being racked and cabled in the next day or so

db1207 a5 u22 Port 40 Cableid 2570
db1208 a5 u23 Port 41 Cableid 1880
db1209 a6 u24 Port 36 Cableid 1918
db1210 a6 u25 Port 41 Cableid 1949
db1211 b5 u27 Port 16 Cableid 3283
db1212 b5 u28 Port 17 Cableid 3282
db1213 b6 u37 Port 42 Cableid 23000001
db1214 b6 u38 Port 41 Cableid 23000012
db1215 b3 u38 Port 27 Cableid 1944
db1216 b3 u39 Port 14 Cableid 5236
db1217 c5 u39 Port 29 Cableid 4011
db1218 c5 u40 Port 28 Cableid 4010
db1219 c6 u29 Port 29 Cableid 1946
db1220 c6 u30 Port 26 Cableid 3248
db1221 d1 u34 Port 20 Cableid 3613
db1222 d3 u26 Port 26 Cableid 1961
db1223 d3 u27 Port 39 Cableid 5089
db1224 d6 u37 Port 37 Cableid 23000046
db1225 d6 u38 Port 38 Cableid 23000007

Jclark-ctr updated the task description. (Show Details)
Jclark-ctr added subscribers: Cmjohnson, Jclark-ctr.

@Cmjohnson can you assist with next steps of these?

While running the provision cookbook on 2 of the db nodes (db1207 and db1208) and gerrit1003 i am getting the error .

Raised while handling: The `choices` argument is empty and no custom validator was provided.
Failed to run cookbooks.sre.hosts.provision.ProvisionRunner._config: The `choices` argument is empty and no custom validator was provided.

@Papaul please do not reimage db1206, that host is already in production. We bought it in advance to test the raid controller as it's a new one. So it's serving traffic.

@Marostegui it is db1207 and db1208 not db1206.

Great, as you mentioned db1206 earlier I got scared :)

@Volans

100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...cludes -r commit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
pt1979@cumin2002:~$ sudo cookbook sre.hosts.provision db1207 --no-dhcp --no-users
Management Password:
Testing Redfish API connection to cumin2002 (10.193.0.139)
==> Are you sure to proceed to apply BIOS/iDRAC settings for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED?
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
Testing Redfish API connection to db1207 (10.65.1.14)
==> Detected Hardware RAID. Please configure the RAID at this point (the password is still DELL default one). Once done select "modified" if the RAID was modified or "untouched" if it was not touched. If the RAID was modified the host will be rebooted to make sure the changes are applied.
> modified
User input is: "modified"
Rebooting the host with policy ChassisResetPolicy.FORCE_RESTART and waiting for 3 minutes
Resetting chassis power status for db1207 to ForceRestart
Testing Redfish API connection to db1207 (10.65.1.14)
[IDRAC.2.7.SYS057] Exporting Server Configuration Profile.
[1/30, retrying in 30.00s] Polling task: JID_800971819368 not completed yet: status=OK, state=Running, completed=10%
First attempt to load the new configuration failed, auto-retrying once
Testing Redfish API connection to db1207 (10.65.1.14)
[IDRAC.2.7.SYS057] Exporting Server Configuration Profile.
[1/30, retrying in 30.00s] Polling task: JID_800972141788 not completed yet: status=OK, state=Running, completed=10%
Raised while handling: The `choices` argument is empty and no custom validator was provided.
Failed to run cookbooks.sre.hosts.provision.ProvisionRunner._config: The `choices` argument is empty and no custom validator was provided.
==> What do you want to do? "retry" the last command, manually fix the issue and "skip" the last command to continue the execution or completely "abort" the execution.

Change 904201 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.provision: handle the case of no NICs

https://gerrit.wikimedia.org/r/904201

Change 904201 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.provision: handle the case of no NICs

https://gerrit.wikimedia.org/r/904201

Change 904235 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.provision: fix NIC link detection

https://gerrit.wikimedia.org/r/904235

Change 904235 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.provision: fix NIC link detection

https://gerrit.wikimedia.org/r/904235

After the switch configuration step I get the output below and

Testing Redfish API connection to db1209 (10.65.1.88)
Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1ba0fcf7f0>, 'Connection to 10.65.1.88 timed out. (connect timeout=10)')': /redfish
Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1ba0fcf130>, 'Connection to 10.65.1.88 timed out. (connect timeout=10)')': /redfish
Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1ba0fcf070>, 'Connection to 10.65.1.88 timed out. (connect timeout=10)')': /redfish
Failed to run cookbooks.sre.hosts.provision.ProvisionRunner.run.<locals>.check_connection: Unable to connect to the Redfish API of db1209. Follow https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Troubleshooting_2