Page MenuHomePhabricator

rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet
Open, MediumPublic0 Estimate Story Points

Description

Please note that we ordered 14 total ganeti nodes via procurement task T214088.

The breakdown is as follows: eqiad: (4) ganeti refresh + (4) ganeti nodes for CI (releng) + (6) eqiad: ganeti nodes for expansion

This breakdown makes it seem like 10 of these will go into the general ganeti pool, and 4 will go for CI/release engineering use.

This task will track the racking and setup of the 10 general ganeti nodes. For this, 4 of them are replacing ganeti100[1-4], while the rest merely expand the service group.

Hostname Proposal: ganeti1009+

Racking Proposal: 4 of these nodes can share with ganeti100[1-4], which are in C4 and C7 (10G racks with 1G hosts, so best to avoid 10G racks for these 1G new hosts as well) We also have existing ganeti nodes in A4, A3 (2), A6, and A5. So place the new nodes as follows: Any 1G rack that isn't A3, A4, A5, or A6. Ideally we spread these out to rows C and D some.

ganeti1009:

  • - receive in system on procurement task T214088
  • - rack system with C3/u28 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - ipmi over lan is not enabled, fix this!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1010:

  • - receive in system on procurement task T214088
  • - rack system C5/u8 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1011:

  • - receive in system on procurement task T214088
  • - rack system C6/u37 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1012:

  • - receive in system on procurement task T214088
  • - rack system C8/u20 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1013:

  • - receive in system on procurement task T214088
  • - rack system B3/u14 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - THE MGMT PASSWORD DID NOT WORK, MUST BE FIXED
  • - update the netbox entry to show the SERVICE TAG for the serial, not the express service code. This is important!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1014:

  • - receive in system on procurement task T214088
  • - rack system B3/u15 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - THE MGMT PASSWORD DID NOT WORK, MUST BE FIXED
  • - update the netbox entry to show the SERVICE TAG for the serial, not the express service code. This is important!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1015:

  • - receive in system on procurement task T214088
  • - rack system B5/u21 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - THE MGMT PASSWORD DID NOT WORK, MUST BE FIXED
  • - update the netbox entry to show the SERVICE TAG for the serial, not the express service code. This is important!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1016:

  • - receive in system on procurement task T214088
  • - rack system B5/u22 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - THE MGMT PASSWORD DID NOT WORK, MUST BE FIXED
  • - update the netbox entry to show the SERVICE TAG for the serial, not the express service code. This is important!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1017:

  • - receive in system on procurement task T214088
  • - rack system B6/u31& update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - THE MGMT PASSWORD DID NOT WORK, MUST BE FIXED
  • - update the netbox entry to show the SERVICE TAG for the serial, not the express service code. This is important!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

genati1018:

  • - receive in system on procurement task T214088
  • - rack system B8/u29 & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing - THE MGMT PASSWORD DID NOT WORK, MUST BE FIXED
  • - update the netbox entry to show the SERVICE TAG for the serial, not the express service code. This is important!
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Details

Related Gerrit Patches:

Event Timeline

RobH triaged this task as Medium priority.Jul 24 2019, 6:59 PM
RobH created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 24 2019, 6:59 PM
RobH assigned this task to akosiaris.Jul 24 2019, 7:00 PM
RobH added subscribers: Cmjohnson, akosiaris.

@akosiaris,

Can i get your sign off about the racking proposal and planning for these 10 ganeti nodes? 4 were refresh, while 6 were expansion from last years budget. If these differ from normal ganeti nodes in any way, please note and assign to @Cmjohnson for racking/followup.

Thanks!

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.

@RobH, @Cmjohnson

Indeed the refreshes are for ganeti100[1-4] so row C it is. Try to spread them across 1G racks.

However, the 6 ganeti nodes of the expansion should go either to row B or D, with a preference for B as we want to expand to a 3rd rack row to provide HA for services that need to be spread across an odd number of rows. The "try to spread them across 1G racks" holds true here as well.

The partman recipe should be partman/ganeti-raid5.cfg

Hostnames LGTM, i.e. ganeti1009-ganeti1018

RobH added a parent task: Unknown Object (Task).Jul 30 2019, 5:26 PM
Cmjohnson updated the task description. (Show Details)Aug 13 2019, 2:36 PM
Cmjohnson reassigned this task from akosiaris to Jclark-ctr.Aug 13 2019, 2:41 PM

Please rack, label and cable these servers with the racking locations above. Add them to netbox, be sure to make sure status is set to planned and asset tag/SN is ALL CAPS. Please update the task with which network ports each server is attached to on the access switch.

Change 530132 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns ganeti1009-1022

https://gerrit.wikimedia.org/r/530132

Change 530132 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns ganeti1009-1022

https://gerrit.wikimedia.org/r/530132

@Jclark-ctr Mgmt IP's that need to be setup on the idrac

Instructions for setup https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN30#Initial_System_Setup

ganeti1009 10.65.5.104
ganeti1010 10.65.5.105
ganeti1011 . 10.65.5.106
ganeti1012 10.65.5.107
ganeti1013 10.65.5.108
ganeti1014 10.65.5.109
ganeti1015 10.65.5.110
ganeti1016 10.65.5.111
ganeti1017 10.65.5.112
ganeti1018 10.65.5.113

Jclark-ctr updated the task description. (Show Details)Aug 14 2019, 1:46 PM

entered ip addresses in IDRAC and set password ganeti10([09]|1[0-8[)

Jclark-ctr added a comment.EditedSep 4 2019, 10:12 PM

@Cmjohnson Idrac and bios settings finished

host_name port
ganeti1009 23
ganeti1010 7
ganeti1011 41
ganeti1012 1
ganeti1013 14
ganeti1014 15
ganeti1015 28
ganeti1016 29
ganeti1017 30
ganeti1018 28

RobH added a comment.EditedSep 25 2019, 10:22 PM

Ok, there are a few issues here.

Please START UPDATING THE CHECKLISTS AS YOU SETUP SERVERS. I don't want to keep having to repeat this across multiple tasks, but if you setup a system, CHECK OFF THE BOXES.

Having to parse every single comment to see who did what gets cumbersome and pointless over time.

There are a few issues with the racking and entering of these into netbox:

  • These were entered using the Express Service Code, not the Service Tag, so every single one is generating accounting report errors.
    • @Jclark-ctr: In the future, use the SERVICE TAG for the Serial Number on Dell systems in netbox.
  • These were racked and setup using the old mgmt password, but this was before the password shift.
    • I've updated ganeti1009-1012.

I am unable to login with root to the mgmt interface on the following (attempting to use both old and new mgmt password and both fail). I need these to be manually rebooted via crash cart into idrac bios and have the passwords set a second time (as they were not set correctly it seems.)

  • ganeti1013
  • ganeti1014
  • ganeti1015
  • ganeti1016
  • ganeti1017
  • ganeti1018

I've updated the checklists for EVERY SINGLE SERVER above.

@Jclark-ctr or @Cmjohnson:

  • The service tag issue has to be resolved on ganeti101[3-8]. Since i cannot remotely connect, I cannot poll their service tags and update netbox. Please do this ASAP as it causes issues on reports.
  • The password needs to be set and TESTED on every single one of these systems when they are racked. Then checked off. Since ganeti101[3-8] aren't working (no passwords work to login), they need to be crash carted and have their passwords set to the new mgmt password. Do not use the old one!
  • as you set the mgmt info, you need to test it to ensure it works.
RobH updated the task description. (Show Details)Sep 25 2019, 10:23 PM
RobH updated the task description. (Show Details)Sep 25 2019, 10:27 PM
RobH updated the task description. (Show Details)
Cmjohnson updated the task description. (Show Details)Sep 26 2019, 6:13 PM

Change 552812 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding dns entries new ganeti hosts

https://gerrit.wikimedia.org/r/552812

Change 552812 merged by Cmjohnson:
[operations/dns@master] Adding dns entries new ganeti hosts

https://gerrit.wikimedia.org/r/552812

@Jclark-ctr can you fix the mgmt password for these please.

herron added a subscriber: herron.Jan 8 2020, 4:10 PM

The row_A ganeti group is running low on memory capacity (please see T239151#5707691) . Should we allocate a few of these new hosts to expand the existing row_A ganeti group?

Majavah renamed this task from rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet to rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet.Jan 8 2020, 4:18 PM
Cmjohnson added a subscriber: Jclark-ctr.

The mgmt passwords have been updated. Expect these to be ready this week

The row_A ganeti group is running low on memory capacity (please see T239151#5707691) . Should we allocate a few of these new hosts to expand the existing row_A ganeti group?

Not necessarily. With row_C being refreshed and having an increase x3 in disk capacity and row_B or D being added we should be able to rebalance a few of the workloads (e.g. etcd hosts that should be better spread across our rack rows) across the cluster better and still have some room for temp requests such as the T239151#5707691 in the new row D. Per Chris's comment we should be able to pull that off next week.

Cmjohnson updated the task description. (Show Details)Wed, Jan 29, 10:18 PM

Change 568652 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Add ganeti1009|101[1-8] dhcpd file and netboot.cfg

https://gerrit.wikimedia.org/r/568652

Change 568652 merged by Cmjohnson:
[operations/puppet@production] Add ganeti1009|101[1-8] dhcpd file and netboot.cfg

https://gerrit.wikimedia.org/r/568652

RobH removed a subscriber: RobH.Thu, Jan 30, 5:57 PM
Cmjohnson updated the task description. (Show Details)Thu, Jan 30, 11:51 PM

All but ganeti1017 are ready for handoff, I am not sure what is going on with this server, I cannot get any output on the console. This needs to be checked on-site.

Mentioned in SAL (#wikimedia-operations) [2020-02-03T13:31:32Z] <moritzm> rebooting ganeti1009 - ganeti1022 to pick up microcode update T228924

Mentioned in SAL (#wikimedia-operations) [2020-02-03T22:13:31Z] <mutante> rebooting ganeti1010, ganeti1011 and other new ganeti machines to pickup microcode mitigations, for some reason the previous reboots did not do it. rescheduled service check on icinga for ganeti1010 and now it recovered (T228924)

Dzahn added a subscriber: Dzahn.Mon, Feb 3, 11:01 PM

For some reason the previous reboots did not fix it but the second attempt did it. The microcode alerts are recovered now after rebooting hosts.

Cmjohnson updated the task description. (Show Details)Tue, Feb 4, 3:39 PM
Cmjohnson reassigned this task from Cmjohnson to Dzahn.Tue, Feb 4, 3:52 PM
Cmjohnson removed a project: ops-eqiad.

@Dzahn ganeti1017 is now ready. I am assigning this to you and removing ops-eqiad tag.

Dzahn added a comment.Wed, Feb 5, 6:09 PM

Thanks @Cmjohnson !

I rebooted ganeti1017 one more time because that fixes the microcode mitigation icinga alerts

Change 570390 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add new ganeti hosts for refresh/expansion with spare role

https://gerrit.wikimedia.org/r/570390

Change 570390 merged by Alexandros Kosiaris:
[operations/puppet@production] site: add new ganeti hosts for refresh/expansion with spare role

https://gerrit.wikimedia.org/r/570390

Dzahn edited projects, added serviceops; removed vm-requests.Sat, Feb 8, 12:20 AM