Page MenuHomePhabricator

(Need By: ASAP) rack/setup/install clouddb10[13-20]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of clouddb10[13-20]

Need By: A date of ASAP due to Q1 OKRs was stated on parent task, but has not been reviewed/approved/agreed to by the ops-eqiad engineers at the time of this task creation.

Hostname / Racking / Installation Details

Hostnames: clouddb1013, clouddb1014, clouddb1015
Racking Proposal: These should be spread out away from other systems as much as is practical along with the other 5 systems from T258088. There may be limitations to how much they can be spread out because there are 8 of them.
Networking/Subnet/VLAN/IP: 10GbE on the private vlan (with the same access as the existing labsdb1009-11 hosts)
Partitioning/Raid: RAID10 + 256k stripe size. Normal db raid recipe @Marostegui will take care of this puppet part (or @Bstorm will copy his work).
OS Distro: Buster

Hostnames: clouddb1016, clouddb1017, clouddb1018,clouddb1019, clouddb1020
Racking Proposal: These should be spread out away from other systems as much as is practical along with the other 3 systems from T257987. There may be limitations to how much they can be spread out because there are 8 of these, total.
Networking/Subnet/VLAN/IP: 10GbE on the private vlan (with the same access as the existing labsdb1009-11 hosts)
Partitioning/Raid: RAID10 + 256k stripe size. Normal db raid recipe @Marostegui will take care of this puppet part (or @Bstorm will copy his work).
OS Distro: Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

clouddb1013:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1014:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1015:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1016:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1017:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1018:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1019:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

clouddb1020:

  • - receive in system on procurement task T257987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Related Objects

StatusSubtypeAssignedTask
ResolvedRobH
OpenBstorm
ResolvedBstorm
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedBstorm
ResolvedBstorm
ResolvedMoritzMuehlenhoff
ResolvedMarostegui
StalledMarostegui
OpenCmjohnson
Resolveddcaro
DeclinedNone
ResolvedKormat
ResolvedArielGlenn
OpenBstorm
OpenBstorm
ResolvedBstorm
OpenNone
OpenNone
OpenNone
OpenNone
OpenJhernandez
Openrazzi
ResolvedMarostegui
OpenBstorm
ResolvedBstorm
OpenBstorm
ResolvedAndrew
OpenNone
OpenJhernandez
ResolvedMarostegui

Event Timeline

RobH created this task.Aug 14 2020, 3:45 PM
Restricted Application added a project: SRE. · View Herald TranscriptAug 14 2020, 3:45 PM
RobH added a parent task: Unknown Object (Task).Aug 14 2020, 3:45 PM
RobH removed a subscriber: RobH.
Marostegui updated the task description. (Show Details)Aug 17 2020, 5:53 AM

Change 620529 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Allow the installation of clouddb hosts

https://gerrit.wikimedia.org/r/620529

Change 620529 merged by Marostegui:
[operations/puppet@production] mariadb: Allow the installation of clouddb hosts

https://gerrit.wikimedia.org/r/620529

I have merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529, so the new hosts will get installed with notifications disabled, RAID10 and spare role.
From  DC-Ops the pending commits are the usual DNS and DHCP entries

@Bstorm once the hosts are installed we can create a different task to talk about the final puppet roles, which I guess it should the role(mariadb::dbstore_multiinstance) which allows multi-instance mysql deployment and take it from there to see what things we need to migrate from the role(labs::db::wikireplica_web) and role(labs::db::wikireplica_analytics) to it.
This can be discussed in a different task.

Marostegui updated the task description. (Show Details)Aug 18 2020, 8:01 AM

@Bstorm once the hosts are installed we can create a different task to talk about the final puppet roles, which I guess it should the role(mariadb::dbstore_multiinstance) which allows multi-instance mysql deployment and take it from there to see what things we need to migrate from the role(labs::db::wikireplica_web) and role(labs::db::wikireplica_analytics) to it.
This can be discussed in a different task.

Sounds good. I'll make the task! That role will probably get us pretty far for the initial install, I imagine.

wiki_willy moved this task from Backlog to Racking Tasks on the ops-eqiad board.Aug 19 2020, 6:11 PM
Cmjohnson updated the task description. (Show Details)Oct 1 2020, 6:09 PM

@Marostegui and @Bstorm I will be racking these in the next few days. Can you please review your racking plan and confirm that this it is correct. Thanks!

Bstorm added a comment.Oct 1 2020, 9:16 PM

Things look right to me, at least. @Marostegui?

Great news, thank you @Cmjohnson!
The racking plan looks good to me, we don't have much requirements other than trying to spread them across racks and rows as much as possible and to get them with Buster.
Just as a reminder, I left all the puppet recipes and initial roles (spare) ready on puppet, so from DC-Ops side the only puppet changes needed are the dhcp entries one.

Sorry for the delay, these are in progress. I expect to have them
completed by the end of next week.

Cmjohnson updated the task description. (Show Details)Oct 26 2020, 2:47 PM

Change 636971 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Add mac addresses to dhcp for clouddb1013-1020

https://gerrit.wikimedia.org/r/636971

Cmjohnson updated the task description. (Show Details)Oct 28 2020, 4:05 PM

Change 636971 merged by Cmjohnson:
[operations/puppet@production] Add mac addresses to dhcp for clouddb1013-1020

https://gerrit.wikimedia.org/r/636971

Cmjohnson reassigned this task from Cmjohnson to RobH.Oct 28 2020, 4:14 PM
Cmjohnson added subscribers: RobH, Cmjohnson.

@RobH these are ready for install, the raid configuration has been completed. Just need to do the final OS install.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['clouddb1013.eqiad.wmnet', 'clouddb1014.eqiad.wmnet', 'clouddb1015.eqiad.wmnet', 'clouddb1016.eqiad.wmnet', 'clouddb1017.eqiad.wmnet', 'clouddb1018.eqiad.wmnet', 'clouddb1019.eqiad.wmnet', 'clouddb1020.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010302248_robh_26788.log.

Completed auto-reimage of hosts:

['clouddb1020.eqiad.wmnet']

Of which those FAILED:

['clouddb1020.eqiad.wmnet']
RobH updated the task description. (Show Details)Oct 30 2020, 11:58 PM

All but clouddb1020 are set to staged in netbox, and calling into puppet. I'll investigate whats up with clouddb1020.

Change 638205 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] clouddb1020 mac address update

https://gerrit.wikimedia.org/r/638205

Change 638205 merged by RobH:
[operations/puppet@production] clouddb1020 mac address update

https://gerrit.wikimedia.org/r/638205

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

clouddb1020.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011030023_robh_25838_clouddb1020_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['clouddb1020.eqiad.wmnet']

Of which those FAILED:

['clouddb1020.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

clouddb1020.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011030026_robh_26206_clouddb1020_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['clouddb1020.eqiad.wmnet']

and were ALL successful.

RobH closed this task as Resolved.Nov 3 2020, 12:51 AM
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)

all hosts installed, calling into puppet, staged in netbox.

Thanks.
RAID (level and stripe size), memory and CPU looks good.