Page MenuHomePhabricator

(Need By: 2020-11-29) rack/setup/install db11[51-76]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of new db servers.

Please rack the 3 new x2 service hosts as high priority, the others normal priority.

Hostname / Racking / Installation Details

new x2 service racking details:
Hostnames: db1151, db1152, db1153
Racking Proposal: Any 1G rack that works for DC-Ops as long as we have one host per row
Networking/Subnet/VLAN/IP: 1G private VLAN as any other database
Partitioning/Raid: RAID10 with 256 stripe size and writeback as documented at https://wikitech.wikimedia.org/wiki/Raid_and_MegaCli#Raid_setup_at_Wikimedia
OS Distro: Buster

refresh racking details:
Hostnames: db1154 db1155 db1156 db1157 db1158 db1159 db1160 db1161 db1162 db1163 db1164 db1165 db1166 db1167 db1168 db1169 db1170 db1171 db1172 db1173 db1174 db1175 db1176
Racking Proposal: 6 hosts in row A, 5 hosts in row B, 6 hosts in row C, 5 hosts in row D. We don't really mind the racks as long as not all the hosts go to the same rack in the same row.
Networking/Subnet/VLAN/IP: 1G private VLAN as any other database
Partitioning/Raid: RAID10 with 256 stripe size and writeback as documented at https://wikitech.wikimedia.org/wiki/Raid_and_MegaCli#Raid_setup_at_Wikimedia
OS Distro: Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

db1151:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) - partially done by T267043#6608037, but MAC addresses NOT entered by that initial patchset and still require update (please do not check the box until ALL installation required updates are completed.) - mac address added via T267043#6667490
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

db1152:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) - partially done by T267043#6608037, but MAC addresses NOT entered by that initial patchset and still require update (please do not check the box until ALL installation required updates are completed.) - mac address added via T267043#6667490
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

db1153:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) - partially done by T267043#6608037, but MAC addresses NOT entered by that initial patchset and still require update (please do not check the box until ALL installation required updates are completed.) - mac address added via T267043#6667490
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

db1154:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) - partially done by T267043#6608037, but MAC addresses NOT entered by that initial patchset and still require update (please do not check the box until ALL installation required updates are completed.) - mac address added via T267043#6667490
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

db1155:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) - partially done by T267043#6608037, but MAC addresses NOT entered by that initial patchset and still require update (please do not check the box until ALL installation required updates are completed.) - mac address added via T267043#6667490
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

db1156:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1157:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1158:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1159:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1160:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1161:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1162:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1163:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1164:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1165:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1166:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1167:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1168:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1169:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1170:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1171:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1172:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1173:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1174:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

db1175:

  • - receive in system on procurement task T264584 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - Updated idrac password
  • - idrac and bios firmware updated
  • - raid setup
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Cmjohnson es1011, es1012 and es1014 are now ready for you to unrack. They are on different racks (and they are also 2U) so we can use those three racks to rack the X2 hosts.
Thanks!

@Cmjohnson Would it be possible to plan for racking 5 instead of 3 of the new hosts in one go? It would help us prepare fot Sanitarium host Buster/10.4 upgrades, which are currently at risk due to issues tracked in https://phabricator.wikimedia.org/T268742. Any 1G rack locations are fine for the 2 additional hosts.

Change 645130 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] adding in db115[1-5] mac address entries

https://gerrit.wikimedia.org/r/645130

Change 645130 merged by RobH:
[operations/puppet@production] adding in db115[1-5] mac address entries

https://gerrit.wikimedia.org/r/645130

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

db1151.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012031956_robh_24972_db1151_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1151.eqiad.wmnet']

Of which those FAILED:

['db1151.eqiad.wmnet']

20:19:50 | db1151.eqiad.wmnet | Unable to run wmf-auto-reimage-host: could not convert string to float: "Warning: Permanently added the ECDSA host key for IP address '2620:0:861:101:10:64:0:6' to the list of known hosts.\n1607026730"

script error, fixed by volans already so continued and reimaged second time after the failures of first run below

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['db1152.eqiad.wmnet', 'db1153.eqiad.wmnet', 'db1154.eqiad.wmnet', 'db1155.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202012032025_robh_11817.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['db1151.eqiad.wmnet', 'db1152.eqiad.wmnet', 'db1153.eqiad.wmnet', 'db1154.eqiad.wmnet', 'db1155.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202012032107_robh_28658.log.

Completed auto-reimage of hosts:

['db1155.eqiad.wmnet', 'db1151.eqiad.wmnet', 'db1153.eqiad.wmnet', 'db1152.eqiad.wmnet', 'db1154.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)

@Cmjohnson Would it be possible to plan for racking 5 instead of 3 of the new hosts in one go? It would help us prepare fot Sanitarium host Buster/10.4 upgrades, which are currently at risk due to issues tracked in https://phabricator.wikimedia.org/T268742. Any 1G rack locations are fine for the 2 additional hosts.

Chris got db115[1-5] racked and remotely accessible for me to do the initial image of the hosts. These first 5 have had their checklists updated, and are now ready for the DBA team to push into service as needed.

This task remains open for the completion of the remainder of the hosts.

@Cmjohnson and @RobH - per our conversation on IRC, just a heads up to avoid installing the remaining db hosts with IPV6. (reference T270101 for the reasoning) Thanks, Willy

Cmjohnson added a subscriber: Jclark-ctr.

These are racked and netbox updated but not connected to the switches yet. @Jclark-ctr could you please cable these and send me the network port information. I also noticed that one server is still missing, it could be in the data center mixed up with something else. The service tag is in the packing slip but I am not sure where it is. @Jclark-ctr can you take a look for this server. ST is DXV8773

@Jclark-ctr @Cmjohnson, what's a realistic ETA for completing the work on these servers? It would help us plan the next steps for this quarter.

@LSobanski I will be working on these this week. As long as nothing comes up urgent it should not be that long

Reminder: do not add IPV6 entries to these hosts (T267043#6692741)

quick update, all the servers are cabled, need to add to netbox next and then setup idrac. These will be ready to be handed over tomorrow.

I ran into some issues along the way, these are taking a little longer to get the idrac's setup. I am sorry for the delay.

@RobH @Cmjohnson DYV8773 is the ST not in netbox right now

Thanks @Jclark-ctr, I just sent an email to Dell to figure out what's going on.

@RobH @Cmjohnson DYV8773 is the ST not in netbox right now

Dell provided some docs that show DYV8773 should be onsite, and John confirmed all 25 were received. @Cmjohnson - it probably got mixed in, with one of the other install tasks. But also, it could be just be a Netbox discrepancy. In looking at the Netbox errors, I see db1156 (in Rack A1) seems to have the incorrect serial number in Netbox. The S/N listed in Netbox wasn't one of the servers listed in the invoice.

https://netbox.wikimedia.org/dcim/devices/2967/

Can you double-check if the S/N on the host, to see if it's the one missing one? If not, the only alternative may be physically counting the number of R440s in each rack to see it matches how many R440s Netbox lists in each rack. And if any are off by one number, then that's probably where the missing server is. Here's a list of how many R440s are in each rack, based on Netbox:

A1 - 6
A2 - 5
A3 - 10
A4 - 4
A5 - 22
A6 - 8
A7 - 3
A8 - 4
B1 - 10
B2 - 9
B3 - 13
B4 - 3
B5 - 18
B6 - 3
B7 - 7
B8 - 17
C1 - 10
C2 - 2
C3 - 18
C4 - 3
C5 - 13
C6 - 6
C7 - 6
C8 - 20
D1 - 27
D2 - 6
D3 - 16
D4 - 3
D5 - 12
D6 - 27
D7 - 5
D8 - 5

Hope this helps track it down.

Thanks,
Willy

DYT7773 is correct ST for db1156 located last DB server racked in D3 U12

the issue I ran into is db1169 was created in netbox w/out a mgmt ip. I didn't see until I went through and assigned mgmt IP's. so now everything is 1 off, I am fixing now. Good news, @Jclark-ctr found the missing db host.

Cmjohnson updated the task description. (Show Details)
Cmjohnson added a subscriber: RobH.

All of the servers are in the racks, idracs are setup including db1169 and db1175. Outstanding items that @RobH will do

  • raid
  • password
  • dhcpd/netboot.cfg
  • site.pp
  • db1166-db1176 (exceptions: db117[01]) have all had their default passwords changed to the idrac mgmt password.
  • Chris is going to check out db117[01] tomorrow and see what is up with them.

next steps:

  • update idrac firmware
  • update bios firmware
  • setup raid arrays
  • double check bios/raid/idrac settings (post firmware and setup)
  • update puppet with dhcp and netboot info
  • image hosts

John was onsite and fixed db117[01] for me, they are now online.

db11[56-65] have had bios and idrac firmware updates, and raid setup. I've updated the task description checkboxes to account for the need to fix the firmwares, raid, etc...

I'll continue updating these tomorrow.

Change 657710 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] dhcp entries for new db systems

https://gerrit.wikimedia.org/r/657710

Change 657710 merged by RobH:
[operations/puppet@production] dhcp entries for new db systems

https://gerrit.wikimedia.org/r/657710

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['db1156.eqiad.wmnet', 'db1157.eqiad.wmnet', 'db1158.eqiad.wmnet', 'db1159.eqiad.wmnet', 'db1160.eqiad.wmnet', 'db1161.eqiad.wmnet', 'db1162.eqiad.wmnet', 'db1163.eqiad.wmnet', 'db1164.eqiad.wmnet', 'db1165.eqiad.wmnet', 'db1166.eqiad.wmnet', 'db1167.eqiad.wmnet', 'db1168.eqiad.wmnet', 'db1169.eqiad.wmnet', 'db1170.eqiad.wmnet', 'db1171.eqiad.wmnet', 'db1172.eqiad.wmnet', 'db1173.eqiad.wmnet', 'db1174.eqiad.wmnet', 'db1175.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202101220030_robh_7835.log.

Completed auto-reimage of hosts:

['db1166.eqiad.wmnet', 'db1164.eqiad.wmnet', 'db1170.eqiad.wmnet', 'db1162.eqiad.wmnet', 'db1160.eqiad.wmnet', 'db1163.eqiad.wmnet', 'db1165.eqiad.wmnet', 'db1169.eqiad.wmnet', 'db1167.eqiad.wmnet', 'db1168.eqiad.wmnet', 'db1174.eqiad.wmnet', 'db1157.eqiad.wmnet', 'db1158.eqiad.wmnet', 'db1156.eqiad.wmnet', 'db1161.eqiad.wmnet']

Of which those FAILED:

['db1159.eqiad.wmnet', 'db1171.eqiad.wmnet', 'db1172.eqiad.wmnet', 'db1173.eqiad.wmnet', 'db1175.eqiad.wmnet']

Of which those FAILED:

['db1159.eqiad.wmnet', 'db1171.eqiad.wmnet', 'db1172.eqiad.wmnet', 'db1173.eqiad.wmnet', 'db1175.eqiad.wmnet']

I've updated the task description checkboxes, and the 4 hosts listed above need to have their reimage failure investigated and re-run.

@RobH it looks like db1163 has RAID0 instead of RAID10:

root@db1163:~# megacli -LdPdInfo -a0

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
Size                : 17.458 TB
Sector Size         : 512
Is VD emulated      : Yes
Parity Size         : 0
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 10
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: No
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No
Number of Spans: 1
Span: 0 - Number of PDs: 10

See the difference with db1162:

root@db1162:~# megacli -LdPdInfo -a0

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 8.729 TB
Sector Size         : 512
Is VD emulated      : Yes
Mirror Data         : 8.729 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 10
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: No
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No
Number of Spans: 1
Span: 0 - Number of PDs: 10

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

db1159.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101281831_robh_28091_db1159_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1159.eqiad.wmnet']

and were ALL successful.

@RobH it looks like db1163 has RAID0 instead of RAID10:

Acknowledged, fix and reimage in progress!

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

db1163.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101281907_robh_5300_db1163_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1163.eqiad.wmnet']

and were ALL successful.

@RobH it looks like db1163 has RAID0 instead of RAID10:

Acknowledged, fix and reimage in progress!

this is reimaged with raid10, I just setup the wrong raid level initially in my batch of setups, apologies.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

db1171.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101281942_robh_9483_db1171_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1171.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['db1172.eqiad.wmnet', 'db1173.eqiad.wmnet', 'db1175.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202101282115_robh_13325.log.

Completed auto-reimage of hosts:

['db1175.eqiad.wmnet', 'db1172.eqiad.wmnet', 'db1173.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)

Thank you Rob - they all look good!