Page MenuHomePhabricator

(Need By: 2021-03-31) rack/setup/install snapshot101[1-5]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of snapshot101[1-5]

Hostname / Racking / Installation Details

Hostnames: snapshot101[1-5]
Racking Proposal: Please try to avoid putting them in the same racks with other snapshot hosts. But they can go in the same rack with a dumpsdata host or a labstore host. snapshot100[567] will go away once these are fully online, so they can share racks with those hosts on a 1:1 basis.
Networking/Subnet/VLAN/IP: These are all one port 1G connections.
Partitioning/Raid: We can set these up like snapshot 1008-10, with software raid 1 of two 480GB SSDs, using the same partman recipe combo: partman/standard.cfg partman/raid1-2dev.cfg
OS Distro: Please put buster on them.:-)

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

snapshot1011:

  • - receive in system on procurement task T271167 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware updates
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm). https://gerrit.wikimedia.org/r/682765
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

snapshot1012:

  • - receive in system on procurement task T271167 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware updates
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm). https://gerrit.wikimedia.org/r/682765
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

snapshot1013:

  • - receive in system on procurement task T271167 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware updates
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm). https://gerrit.wikimedia.org/r/682765
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

snapshot1014:

  • - receive in system on procurement task T271167 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware updates
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm). https://gerrit.wikimedia.org/r/682765
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

snapshot1015:

  • - receive in system on procurement task T271167 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware updates
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm). https://gerrit.wikimedia.org/r/682765
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH removed a subscriber: RobH.
RobH mentioned this in Unknown Object (Task).Jan 20 2021, 5:12 PM

It is extremely likely you'll be able to install direcly with buster once these arrive. I should know within a week. The first reimaged host will be running dumps starting tomorrow, and assuming that all goes well, I'll update this task to reflect the change.

We have many wiki dump runs completed without problems. So please do go ahead with buster on these new servers. Thanks!

Jclark-ctr added a subscriber: Jclark-ctr.

snapshot1011 A1 U7 PORT19 ID 1852
snapshot1012 A3 U36 PORT30 ID1932
snapshot1013 B3 U24 PORT11 ID2614
snapshot1014 C5 U19 PORT25 ID1007
snapshot1015 D3 U39 PORT17 ID3901

@Cmjohnson - can you provide an update on this one? This is one of the priority installs. Thanks, Willy

Cmjohnson updated the task description. (Show Details)
Cmjohnson added subscribers: RobH, Cmjohnson.

Assigning to @RobH for installs

Change 682765 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] snapshot10[1-5] setup info

https://gerrit.wikimedia.org/r/682765

Change 682765 merged by RobH:

[operations/puppet@production] snapshot10[1-5] setup info

https://gerrit.wikimedia.org/r/682765

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['snapshot1011.eqiad.wmnet', 'snapshot1012.eqiad.wmnet', 'snapshot1013.eqiad.wmnet', 'snapshot1014.eqiad.wmnet', 'snapshot1015.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104262225_robh_2387.log.

Hey, this looks almost done, am I reading that right? :-) :-)

These are failing to partition correctly during the initial imaging. I ran out of bandwidth troubleshooting this yesterday evening, and will return to it today.

Puppet is updated, just something in netboot isn't pattern matching these correctly.

Ah ok! I didn't mean to be hasty, just saw the reimaging script runs and got excited :-)

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['snapshot1011.eqiad.wmnet', 'snapshot1012.eqiad.wmnet', 'snapshot1013.eqiad.wmnet', 'snapshot1014.eqiad.wmnet', 'snapshot1015.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104272338_robh_7476.log.

So these are chugging along just fine, and didn't fall to the manual partition menu. I suspect my change didn't hit apt server, just install host, so dhcp was updated but not netboot to live servers. I went away after an hour (a day really) and now it works.

Completed auto-reimage of hosts:

['snapshot1011.eqiad.wmnet', 'snapshot1012.eqiad.wmnet', 'snapshot1013.eqiad.wmnet', 'snapshot1014.eqiad.wmnet', 'snapshot1015.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)

@ArielGlenn

These are all yours!

Volans reopened this task as Open.EditedMay 24 2021, 7:44 PM
Volans raised the priority of this task from Medium to High.
Volans added a subscriber: Volans.

Re-opening because snapshot1014 and snapshot1015 don't have any role applied and Puppet is broken since 25 days.

They have been on hold pending the availability of a colleague who would be learning about their setup. I hope to find out more about that person's availability this week.

RobH closed this task as Resolved.EditedMay 24 2021, 9:41 PM

Keeping a racking task in DC ops open for this isn't going to work, as this counts against our SLA metrics as part of our OKRs.

Re-opening because snapshot1014 and snapshot1015 don't have any role applied and Puppet is broken since 25 days.

They have been on hold pending the availability of a colleague who would be learning about their setup. I hope to find out more about that person's availability this week.

Keeping a racking task open in DC-Ops after its done isn't an option for us, as we have OKR specific SLAs that track how long racking tasks are open. For that same reason, removing the projects from a racking task breaks/hurts our metrics as well. Racking tasks basically need to stay just that from creation to resolution, DC ops specific racking tasks. Anything for non dc ops post racking/setup needs to be handled via a different task.

Once a server is racked, role(insetup) is typically applied, and was done so for these via https://gerrit.wikimedia.org/r/c/operations/puppet/+/682765/. At some point since then, it seems that it was removed and/or/maybe remimaged, breaking these runs.

I've opened T283545 to track the error, and I've resolved this task.