Page MenuHomePhabricator

rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet
Closed, ResolvedPublic4294967296 Estimated Story Points

Description

Please note that we ordered 14 total ganeti nodes via procurement task T214088.

The breakdown is as follows: eqiad: (4) ganeti refresh + (4) ganeti nodes for CI (releng) + (6) eqiad: ganeti nodes for expansion

This breakdown makes it seem like 10 of these will go into the general ganeti pool, and 4 will go for CI/release engineering use.

This task will track the racking and setup of the 10 general ganeti nodes. For this, 4 of them are replacing ganeti100[1-4], while the rest merely expand the service group.

Hostname Proposal: ganeti1009+

Racking Proposal: 4 of these nodes can share with ganeti100[1-4], which are in C4 and C7 (10G racks with 1G hosts, so best to avoid 10G racks for these 1G new hosts as well) We also have existing ganeti nodes in A4, A3 (2), A6, and A5. So place the new nodes as follows: Any 1G rack that isn't A3, A4, A5, or A6. Ideally we spread these out to rows C and D some.

ganeti1009:

  • - receive in system on procurement task T214088
  • - rack system with C3/u28 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - ipmi over lan is not enabled, fix this!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1010:

  • - receive in system on procurement task T214088
  • - rack system C5/u8 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1011:

  • - receive in system on procurement task T214088
  • - rack system C6/u37 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1012:

  • - receive in system on procurement task T214088
  • - rack system C8/u20 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1013:

  • - receive in system on procurement task T214088
  • - rack system B3/u14 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - THE MGMT PASSWORD DID NOT WORK, MUST BE FIXED
  • - update the netbox entry to show the SERVICE TAG for the serial, not the express service code. This is important!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1014:

  • - receive in system on procurement task T214088
  • - rack system B3/u15 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - THE MGMT PASSWORD DID NOT WORK, MUST BE FIXED
  • - update the netbox entry to show the SERVICE TAG for the serial, not the express service code. This is important!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1015:

  • - receive in system on procurement task T214088
  • - rack system B5/u21 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - THE MGMT PASSWORD DID NOT WORK, MUST BE FIXED
  • - update the netbox entry to show the SERVICE TAG for the serial, not the express service code. This is important!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1016:

  • - receive in system on procurement task T214088
  • - rack system B5/u22 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - THE MGMT PASSWORD DID NOT WORK, MUST BE FIXED
  • - update the netbox entry to show the SERVICE TAG for the serial, not the express service code. This is important!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1017:

  • - receive in system on procurement task T214088
  • - rack system B6/u31& update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - THE MGMT PASSWORD DID NOT WORK, MUST BE FIXED
  • - update the netbox entry to show the SERVICE TAG for the serial, not the express service code. This is important!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1018:

  • - receive in system on procurement task T214088
  • - rack system B8/u29 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - THE MGMT PASSWORD DID NOT WORK, MUST BE FIXED
  • - update the netbox entry to show the SERVICE TAG for the serial, not the express service code. This is important!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

entered ip addresses in IDRAC and set password ganeti10([09]|1[0-8[)

@Cmjohnson Idrac and bios settings finished

host_name port
ganeti1009 23
ganeti1010 7
ganeti1011 41
ganeti1012 1
ganeti1013 14
ganeti1014 15
ganeti1015 28
ganeti1016 29
ganeti1017 30
ganeti1018 28

Ok, there are a few issues here.

Please START UPDATING THE CHECKLISTS AS YOU SETUP SERVERS. I don't want to keep having to repeat this across multiple tasks, but if you setup a system, CHECK OFF THE BOXES.

Having to parse every single comment to see who did what gets cumbersome and pointless over time.

There are a few issues with the racking and entering of these into netbox:

  • These were entered using the Express Service Code, not the Service Tag, so every single one is generating accounting report errors.
    • @Jclark-ctr: In the future, use the SERVICE TAG for the Serial Number on Dell systems in netbox.
  • These were racked and setup using the old mgmt password, but this was before the password shift.
    • I've updated ganeti1009-1012.

I am unable to login with root to the mgmt interface on the following (attempting to use both old and new mgmt password and both fail). I need these to be manually rebooted via crash cart into idrac bios and have the passwords set a second time (as they were not set correctly it seems.)

  • ganeti1013
  • ganeti1014
  • ganeti1015
  • ganeti1016
  • ganeti1017
  • ganeti1018

I've updated the checklists for EVERY SINGLE SERVER above.

@Jclark-ctr or @Cmjohnson:

  • The service tag issue has to be resolved on ganeti101[3-8]. Since i cannot remotely connect, I cannot poll their service tags and update netbox. Please do this ASAP as it causes issues on reports.
  • The password needs to be set and TESTED on every single one of these systems when they are racked. Then checked off. Since ganeti101[3-8] aren't working (no passwords work to login), they need to be crash carted and have their passwords set to the new mgmt password. Do not use the old one!
  • as you set the mgmt info, you need to test it to ensure it works.
RobH updated the task description. (Show Details)

Change 552812 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding dns entries new ganeti hosts

https://gerrit.wikimedia.org/r/552812

Change 552812 merged by Cmjohnson:
[operations/dns@master] Adding dns entries new ganeti hosts

https://gerrit.wikimedia.org/r/552812

@Jclark-ctr can you fix the mgmt password for these please.

The row_A ganeti group is running low on memory capacity (please see T239151#5707691) . Should we allocate a few of these new hosts to expand the existing row_A ganeti group?

taavi renamed this task from rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet to rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet.Jan 8 2020, 4:18 PM
Cmjohnson added a subscriber: Jclark-ctr.

The mgmt passwords have been updated. Expect these to be ready this week

The row_A ganeti group is running low on memory capacity (please see T239151#5707691) . Should we allocate a few of these new hosts to expand the existing row_A ganeti group?

Not necessarily. With row_C being refreshed and having an increase x3 in disk capacity and row_B or D being added we should be able to rebalance a few of the workloads (e.g. etcd hosts that should be better spread across our rack rows) across the cluster better and still have some room for temp requests such as the T239151#5707691 in the new row D. Per Chris's comment we should be able to pull that off next week.

Change 568652 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Add ganeti1009|101[1-8] dhcpd file and netboot.cfg

https://gerrit.wikimedia.org/r/568652

Change 568652 merged by Cmjohnson:
[operations/puppet@production] Add ganeti1009|101[1-8] dhcpd file and netboot.cfg

https://gerrit.wikimedia.org/r/568652

All but ganeti1017 are ready for handoff, I am not sure what is going on with this server, I cannot get any output on the console. This needs to be checked on-site.

Mentioned in SAL (#wikimedia-operations) [2020-02-03T13:31:32Z] <moritzm> rebooting ganeti1009 - ganeti1022 to pick up microcode update T228924

Mentioned in SAL (#wikimedia-operations) [2020-02-03T22:13:31Z] <mutante> rebooting ganeti1010, ganeti1011 and other new ganeti machines to pickup microcode mitigations, for some reason the previous reboots did not do it. rescheduled service check on icinga for ganeti1010 and now it recovered (T228924)

For some reason the previous reboots did not fix it but the second attempt did it. The microcode alerts are recovered now after rebooting hosts.

Cmjohnson removed a project: ops-eqiad.

@Dzahn ganeti1017 is now ready. I am assigning this to you and removing ops-eqiad tag.

Thanks @Cmjohnson !

I rebooted ganeti1017 one more time because that fixes the microcode mitigation icinga alerts

Change 570390 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add new ganeti hosts for refresh/expansion with spare role

https://gerrit.wikimedia.org/r/570390

Change 570390 merged by Alexandros Kosiaris:
[operations/puppet@production] site: add new ganeti hosts for refresh/expansion with spare role

https://gerrit.wikimedia.org/r/570390

Change 576406 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add ganeti role to new eqiad ganeti expansion servers

https://gerrit.wikimedia.org/r/576406

Change 576887 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] netboot/partman: add ganeti101[3-8] and fix typo in selector

https://gerrit.wikimedia.org/r/576887

Change 579017 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: let new ganeti nodes and logstash1008 use role(insetup)

https://gerrit.wikimedia.org/r/579017

Change 579017 merged by Dzahn:
[operations/puppet@production] site: let new ganeti nodes and logstash1008 use role(insetup)

https://gerrit.wikimedia.org/r/579017

ganeti1009 is set as Staged in Netbox and missing in PuppetDB, so it's report by the Netbox report.
What should be the correct state for now?

ganeti1009 is set as Staged in Netbox and missing in PuppetDB, so it's report by the Netbox report.
What should be the correct state for now?

I 've just reran puppet over there (I had it disabled for testing). it should be ok now.

Change 576887 abandoned by Dzahn:
netboot/partman: add new ganeti servers and fix typo in selector

Reason:
already covered by existing regexes

https://gerrit.wikimedia.org/r/576887

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1009.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005191327_dzahn_97521_ganeti1009_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['ganeti1009.eqiad.wmnet']

Of which those FAILED:

['ganeti1009.eqiad.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1010.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005200804_dzahn_237896_ganeti1010_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1011.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005200818_dzahn_240643_ganeti1011_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1012.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005200819_dzahn_240835_ganeti1012_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['ganeti1010.eqiad.wmnet']

and were ALL successful.

@RobH Remote IPMI was disabled on these hosts which popped up when i tried to run the reimage cookbook (to change software RAID level from 1 to 5) and it failed.

The command sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff showed it.

The command sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --commit fixed it.

I got these from https://wikitech.wikimedia.org/wiki/Management_Interfaces#Is_remote_IPMI_enabled?

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1013.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005200829_dzahn_243184_ganeti1013_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['ganeti1011.eqiad.wmnet']

Of which those FAILED:

['ganeti1011.eqiad.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1014.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005200837_dzahn_246536_ganeti1014_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['ganeti1012.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1015.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005200843_dzahn_247233_ganeti1015_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1016.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005200846_dzahn_247522_ganeti1016_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['ganeti1013.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005200851_dzahn_249383_ganeti1017_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti1018.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005200858_dzahn_251423_ganeti1018_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['ganeti1014.eqiad.wmnet']

and were ALL successful.

@RobH @Cmjohnson I noticed by chance there are more ganeti machines beyond ganeti1018. ganeti1019-ganeti1022 are in netbox but i don't see a racking ticket for them. Should there be one?

Edit: nevermind, it is T228926 , the special nodes for CI.

Completed auto-reimage of hosts:

['ganeti1015.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['ganeti1016.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['ganeti1017.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['ganeti1018.eqiad.wmnet']

and were ALL successful.

@akosiaris All of these hosts have RAID5 now:

===== NODE GROUP =====                                                                
(10) ganeti[1009-1018].eqiad.wmnet                                                    
----- OUTPUT of 'grep active /pro...| cut -d " " -f4' -----                           
raid5

Handing back over for the next "init" command steps you have mentioned are needed next.

Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts:

['ganeti1009.eqiad.wmnet', 'ganeti1010.eqiad.wmnet', 'ganeti1011.eqiad.wmnet', 'ganeti1012.eqiad.wmnet', 'ganeti1013.eqiad.wmnet', 'ganeti1014.eqiad.wmnet', 'ganeti1015.eqiad.wmnet', 'ganeti1016.eqiad.wmnet', 'ganeti1017.eqiad.wmnet', 'ganeti1018.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006040846_akosiaris_248215.log.

Was another reimage needed? I already did these. Something wrong with RAID still?

Completed auto-reimage of hosts:

['ganeti1011.eqiad.wmnet', 'ganeti1017.eqiad.wmnet', 'ganeti1012.eqiad.wmnet', 'ganeti1013.eqiad.wmnet', 'ganeti1009.eqiad.wmnet', 'ganeti1015.eqiad.wmnet', 'ganeti1018.eqiad.wmnet', 'ganeti1016.eqiad.wmnet', 'ganeti1010.eqiad.wmnet', 'ganeti1014.eqiad.wmnet']

and were ALL successful.

Was another reimage needed? I already did these. Something wrong with RAID still?

buster vs stretch. the current clusters are stretch and I 'd rather not stalled/jeopardize this further by coupling the distribution upgrade with the capacity increase.

buster vs stretch. the current clusters are stretch

Oh yea, that makes a lot of sense. gotcha, thanks.

Change 602350 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ganeti: Add a ganeti_init.sh script

https://gerrit.wikimedia.org/r/602350

Change 602364 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Assign role::ganeti to new ganeti expansion hosts

https://gerrit.wikimedia.org/r/602364

Change 602364 merged by Alexandros Kosiaris:
[operations/puppet@production] Assign role::ganeti to new ganeti expansion hosts

https://gerrit.wikimedia.org/r/602364

Change 602379 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ganeti: ganeti[12]0{09..24}.eqiad|codfw.wmnet to hieradata

https://gerrit.wikimedia.org/r/602379

Change 602379 merged by Alexandros Kosiaris:
[operations/puppet@production] ganeti: ganeti[12]0{09..24}.eqiad|codfw.wmnet to hieradata

https://gerrit.wikimedia.org/r/602379

Change 576406 abandoned by Dzahn:
site: add ganeti role to all new ganeti servers

Reason:
duplicate

https://gerrit.wikimedia.org/r/576406

akosiaris updated the task description. (Show Details)
akosiaris changed the point value for this task from 0 to 4294967296.

The hardware machines are now in full production mode, ready to receive VMs. In fact, the row C machines already have VMs as the old ganeti1001-ganeti1004 are almost empty now (the exception being 1 etcd VM for the old kubernetes cluster). The master has been failed over as well.

Change 602350 merged by Alexandros Kosiaris:
[operations/puppet@production] ganeti: Add a ganeti_init.sh script

https://gerrit.wikimedia.org/r/602350