Page MenuHomePhabricator

rack/setup/instal (4) CI ganeti nodes
Closed, ResolvedPublic4294967296 Estimated Story Points

Description

Please note that we ordered 14 total ganeti nodes via procurement task T214088.

The breakdown is as follows: eqiad: (4) ganeti refresh + (4) ganeti nodes for CI (releng) + (6) eqiad: ganeti nodes for expansion

This breakdown makes it seem like 10 of these will go into the general ganeti pool, and 4 will go for CI/release engineering use.

This task will track the racking and setup of the 4 CI/releng ganeti nodes.

Hostname Proposal: <no clue what to call these, but infrastructure naming conventions wikitech page will need updating if its a new hostname>

Racking Proposal: Unclear what the racking/service redundancy will be. Do these need to be in 4 different rows or just 4 different racks?

ganeti1019 checklist:

  • - receive in system on procurement task T214088
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned) D3/u36
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

ganeti1020 checklist:

  • - receive in system on procurement task T214088
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned) D5/u39
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

ganeti1021 checklist:

  • - receive in system on procurement task T214088
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)D8/u39
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

ganeti1022 checklist:

  • - receive in system on procurement task T214088
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)D8/u38
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Event Timeline

RobH triaged this task as Medium priority.Jul 24 2019, 7:03 PM
RobH created this task.

@akosiaris,

Are you involved in this project, and if so would you be the one to provide details for this? Please comment and assign back to me for followup, thanks!

RobH added a parent task: Unknown Object (Task).Jul 24 2019, 7:04 PM
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.

@RobH, @Cmjohnson

Despite the designation as CI, we will be treating these uniformly as far as ganeti goes (we will handling the capacity allocations within ganeti) so:

  • Single rack row and one that is not ganeti populated yet. Preferably row D (especially since the nodes in T228924 go to row B) and spread out across the 1G racks.
  • Naming: ganeti1019-ganeti2023

The partman recipe should be partman/ganeti-raid5.cfg

@Jclark-ctr Please rack 4 of the servers from the same ganeti stack in row D and label them as ganeti1019-1022. Please update netbox, and provide access switch port info.

@Jclark-ctr

ganeti1019 10.65.5.114
ganeti1020 10.65.5.115
ganeti1021 10.65.5.116
ganeti1022 10.65.5.117

entered ip addresses in IDRAC and set password ganeti10[19...22]

Cmjohnson updated the task description. (Show Details)

@Jclark-ctr Can you get the network ports, please add to the task. Thanks

Cmjohnson updated the task description. (Show Details)

The mgmt passwords have been updated.

Change 566606 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding mac addresses and partman for ganeti1019-1022

https://gerrit.wikimedia.org/r/566606

Change 566606 merged by Cmjohnson:
[operations/puppet@production] Adding mac addresses and partman for ganeti1019-1022

https://gerrit.wikimedia.org/r/566606

the OS has been installed on 4 of these, initial puppet run has not been done

Cmjohnson updated the task description. (Show Details)
Cmjohnson removed a project: ops-eqiad.

@akosiaris I am assigning this to you, the initial puppet run has been completed. I removed the ops-eqiad tag

Icinga downtime for 1 day, 0:00:00 set by akosiaris@cumin1001 on 14 host(s) and their services with reason: enable VT

ganeti[1009-1022].eqiad.wmnet

Change 581018 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add CI ganeti nodes

https://gerrit.wikimedia.org/r/581018

Change 581018 merged by Dzahn:
[operations/puppet@production] site: add CI ganeti nodes

https://gerrit.wikimedia.org/r/581018

Dzahn added a subscriber: Dzahn.

Enabled remote IPMI on these machines which was disabled but is needed. (wikitech how to)

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1019.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005200910_dzahn_257748_ganeti1019_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1020.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005200911_dzahn_257903_ganeti1020_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1021.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005200912_dzahn_258005_ganeti1021_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1022.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005200914_dzahn_258218_ganeti1022_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['ganeti1019.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['ganeti1021.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['ganeti1020.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['ganeti1022.eqiad.wmnet']

and were ALL successful.

These 4 hosts have been reimaged and now have RAID5 instead of RAID1 after gerrit:597261 (grep raid5 /proc/mdstat)

Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts:

['ganeti1019.eqiad.wmnet', 'ganeti1020.eqiad.wmnet', 'ganeti1021.eqiad.wmnet', 'ganeti1022.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006040851_akosiaris_249195.log.

Completed auto-reimage of hosts:

['ganeti1021.eqiad.wmnet', 'ganeti1019.eqiad.wmnet', 'ganeti1022.eqiad.wmnet', 'ganeti1020.eqiad.wmnet']

and were ALL successful.

Change 602364 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Assign role::ganeti to new ganeti expansion hosts

https://gerrit.wikimedia.org/r/602364

Change 602364 merged by Alexandros Kosiaris:
[operations/puppet@production] Assign role::ganeti to new ganeti expansion hosts

https://gerrit.wikimedia.org/r/602364

Change 602379 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ganeti: ganeti[12]0{09..24}.eqiad|codfw.wmnet to hieradata

https://gerrit.wikimedia.org/r/602379

Change 602379 merged by Alexandros Kosiaris:
[operations/puppet@production] ganeti: ganeti[12]0{09..24}.eqiad|codfw.wmnet to hieradata

https://gerrit.wikimedia.org/r/602379

akosiaris updated the task description. (Show Details)
akosiaris changed the point value for this task from 0 to 4294967296.

The hardware machines are now in full production mode, ready to receive VMS. Finally resolving.

Change 603932 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/software/spicerack@master] Add new eqiad/codfw Ganeti rows

https://gerrit.wikimedia.org/r/603932

Change 603932 merged by Ayounsi:
[operations/software/spicerack@master] Add new eqiad/codfw Ganeti rows

https://gerrit.wikimedia.org/r/603932