Page MenuHomePhabricator

(Need By: ASAP) rack/setup/install clouddb10[13-20]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of clouddb10[13-20]

Need By: A date of ASAP due to Q1 OKRs was stated on parent task, but has not been reviewed/approved/agreed to by the ops-eqiad engineers at the time of this task creation.

Hostname / Racking / Installation Details

Hostnames: clouddb1013, clouddb1014, clouddb1015
Racking Proposal: These should be spread out away from other systems as much as is practical along with the other 5 systems from T258088. There may be limitations to how much they can be spread out because there are 8 of them.
Networking/Subnet/VLAN/IP: 10GbE on the private vlan (with the same access as the existing labsdb1009-11 hosts)
Partitioning/Raid: RAID10 + 256k stripe size. Normal db raid recipe @Marostegui will take care of this puppet part (or @Bstorm will copy his work).
OS Distro: Buster

Hostnames: clouddb1016, clouddb1017, clouddb1018,clouddb1019, clouddb1020
Racking Proposal: These should be spread out away from other systems as much as is practical along with the other 3 systems from T257987. There may be limitations to how much they can be spread out because there are 8 of these, total.
Networking/Subnet/VLAN/IP: 10GbE on the private vlan (with the same access as the existing labsdb1009-11 hosts)
Partitioning/Raid: RAID10 + 256k stripe size. Normal db raid recipe @Marostegui will take care of this puppet part (or @Bstorm will copy his work).
OS Distro: Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

clouddb1013:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1014:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1015:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1016:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1017:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1018:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1019:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1020:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Related Objects

StatusSubtypeAssignedTask
ResolvedRobH
OpenBstorm
ResolvedBstorm
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedBstorm
ResolvedBstorm
ResolvedMoritzMuehlenhoff
ResolvedMarostegui
ResolvedMarostegui
ResolvedCmjohnson
Resolveddcaro
ResolvedMarostegui
ResolvedRequestwiki_willy
ResolvedRequestCmjohnson
ResolvedRequestCmjohnson
ResolvedRequestCmjohnson
ResolvedRequestCmjohnson
DeclinedNone
ResolvedKormat
ResolvedArielGlenn
OpenBstorm
DeclinedBstorm
OpenBstorm
ResolvedBstorm
OpenJhernandez
Resolvedrazzi
ResolvedMarostegui
ResolvedMilimetric
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
OpenNone
ResolvedBstorm
ResolvedAndrew
ResolvedBstorm
OpenNone
ResolvedJhernandez
ResolvedMarostegui
ResolvedRagesoss
ResolvedBstorm
ResolvedBstorm
OpenBstorm
ResolvedMarostegui

Event Timeline

RobH added a parent task: Unknown Object (Task).Aug 14 2020, 3:45 PM
RobH removed a subscriber: RobH.

Change 620529 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Allow the installation of clouddb hosts

https://gerrit.wikimedia.org/r/620529

Change 620529 merged by Marostegui:
[operations/puppet@production] mariadb: Allow the installation of clouddb hosts

https://gerrit.wikimedia.org/r/620529

I have merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529, so the new hosts will get installed with notifications disabled, RAID10 and spare role.
From  DC-Ops the pending commits are the usual DNS and DHCP entries

@Bstorm once the hosts are installed we can create a different task to talk about the final puppet roles, which I guess it should the role(mariadb::dbstore_multiinstance) which allows multi-instance mysql deployment and take it from there to see what things we need to migrate from the role(labs::db::wikireplica_web) and role(labs::db::wikireplica_analytics) to it.
This can be discussed in a different task.

@Bstorm once the hosts are installed we can create a different task to talk about the final puppet roles, which I guess it should the role(mariadb::dbstore_multiinstance) which allows multi-instance mysql deployment and take it from there to see what things we need to migrate from the role(labs::db::wikireplica_web) and role(labs::db::wikireplica_analytics) to it.
This can be discussed in a different task.

Sounds good. I'll make the task! That role will probably get us pretty far for the initial install, I imagine.

@Marostegui and @Bstorm I will be racking these in the next few days. Can you please review your racking plan and confirm that this it is correct. Thanks!

Things look right to me, at least. @Marostegui?

Great news, thank you @Cmjohnson!
The racking plan looks good to me, we don't have much requirements other than trying to spread them across racks and rows as much as possible and to get them with Buster.
Just as a reminder, I left all the puppet recipes and initial roles (spare) ready on puppet, so from DC-Ops side the only puppet changes needed are the dhcp entries one.

Sorry for the delay, these are in progress. I expect to have them
completed by the end of next week.

Change 636971 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Add mac addresses to dhcp for clouddb1013-1020

https://gerrit.wikimedia.org/r/636971

Change 636971 merged by Cmjohnson:
[operations/puppet@production] Add mac addresses to dhcp for clouddb1013-1020

https://gerrit.wikimedia.org/r/636971

Cmjohnson added subscribers: RobH, Cmjohnson.

@RobH these are ready for install, the raid configuration has been completed. Just need to do the final OS install.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['clouddb1013.eqiad.wmnet', 'clouddb1014.eqiad.wmnet', 'clouddb1015.eqiad.wmnet', 'clouddb1016.eqiad.wmnet', 'clouddb1017.eqiad.wmnet', 'clouddb1018.eqiad.wmnet', 'clouddb1019.eqiad.wmnet', 'clouddb1020.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010302248_robh_26788.log.

Completed auto-reimage of hosts:

['clouddb1020.eqiad.wmnet']

Of which those FAILED:

['clouddb1020.eqiad.wmnet']

All but clouddb1020 are set to staged in netbox, and calling into puppet. I'll investigate whats up with clouddb1020.

Change 638205 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] clouddb1020 mac address update

https://gerrit.wikimedia.org/r/638205

Change 638205 merged by RobH:
[operations/puppet@production] clouddb1020 mac address update

https://gerrit.wikimedia.org/r/638205

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

clouddb1020.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011030023_robh_25838_clouddb1020_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['clouddb1020.eqiad.wmnet']

Of which those FAILED:

['clouddb1020.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

clouddb1020.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011030026_robh_26206_clouddb1020_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['clouddb1020.eqiad.wmnet']

and were ALL successful.

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)

all hosts installed, calling into puppet, staged in netbox.

Thanks.
RAID (level and stripe size), memory and CPU looks good.