Page MenuHomePhabricator

rack/setup/install ganeti400[123]
Closed, ResolvedPublic0 Estimated Story Points

Description

This task will track the racking, setup, and installation of ganeti400[123], 3 new R440 systems in ulsfo.

Racking Plan: We have all the odd number servers in rack .22 and all the even numbered servers in .23, so I've (@RobH) followed this system and racked ganeti400[13] in .22 and ganeti4002 in .23.

ganeti4001:

  • - receive in system on procurement task T214098
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

ganeti4002:

  • - receive in system on procurement task T214098
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

ganeti4003:

  • - receive in system on procurement task T214098
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 518831 merged by RobH:
[operations/puppet@production] adding install params for ganeti400[123]

https://gerrit.wikimedia.org/r/518831

This comment was removed by RobH.

Change 518832 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] fix ganeti host entry

https://gerrit.wikimedia.org/r/518832

Change 518832 merged by RobH:
[operations/puppet@production] fix ganeti host entry

https://gerrit.wikimedia.org/r/518832

Change 518834 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] setting ganeti400[123] production dns

https://gerrit.wikimedia.org/r/518834

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)

I set these to internal IP/vlan since other ganeti hosts are that way.

So while I can ssh into any of the other site switch stacks, or into the mgmt on the servers in ulsfo, I cannot ssh into asw-ulsfo.mgmt.ulsfo.wmnet.

@ayounsi: please ping me when you have a moment to help me debug

asw-ulsfo.mgmt.ulsfo.wmnet was the old stack, asw2-ulsfo.mgmt.ulsfo.wmnet is the way to go.

I set these to internal IP/vlan since other ganeti hosts are that way.

Yup, that's correct. Thanks!

Alex is working on a new partman for these hosts, so I'm assigning this task to him as its blocked on that for the OS installation. The hosts are remotely accessible, and their network ports are live for internal vlan/subnet.

Actually, a bit of confusion we clarified with IRC chat. These are NOT the same config of partman he is working on, but I cannot install these until we re-enable puppet for me to work on them.

As its not urgent, I'll just wait and work on this this afternoon when Alex is done with his partman work for the day.

Change 518834 merged by RobH:
[operations/dns@master] setting ganeti400[123] production dns

https://gerrit.wikimedia.org/r/518834

Change 519245 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] update dhcp file for ganeti400[123]

https://gerrit.wikimedia.org/r/519245

Change 519245 merged by RobH:
[operations/puppet@production] update dhcp file for ganeti400[123]

https://gerrit.wikimedia.org/r/519245

Change 519252 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] testing out partman for new ganeti hosts

https://gerrit.wikimedia.org/r/519252

Change 519252 merged by RobH:
[operations/puppet@production] testing out partman for new ganeti hosts

https://gerrit.wikimedia.org/r/519252

RobH removed RobH as the assignee of this task.Jun 26 2019, 6:44 PM
RobH assigned this task to akosiaris.
RobH removed a project: ops-ulsfo.
RobH updated the task description. (Show Details)

Assigning to Alex for ganeti setup. Not 100% if this is an Alex project (due to ganeti) or a Traffic project (due to being in our caching centers), but going with Alex first.

BBlack added a project: Traffic.

I don't think anyone's 100% sure how we're handling this project, but probably Traffic will figure out the setup for these and ask Alex if we need help. We probably won't get around to it very quickly, can leave them in role::spare for now until we get to it.

akosiaris changed the task status from Open to Stalled.Jun 27 2019, 9:44 AM
akosiaris lowered the priority of this task from Medium to Low.

OK, good to know. Moving to Low priority and Stalled status until then.

These hosts showed up as special cases as part of T147074 because commands over remote IPMI did not work. ssh login works. Feels like IPMI over LAN might be disabled in BIOS.

^ Fixed by @Papaul. Confirmed it works now to change password via IPMI from remote.

Will this Ganeti cluster use vlan tagged interfaces, or will separate physical interfaces connect to both public and private vlans? If tagging, are the switchports configured for that yet?

Using eqiad as an example, looks like the native vlan is private with public using tagged interfaces. IMO makes sense to continue this model.

bridge name	bridge id		STP enabled	interfaces
private		8000.f01fafe8c5a3	no		eno1
public		8000.f01fafe8c5a3	no		eno1.1003

I think we'll keep them private-vlan only and no tagging, and for the rare cases of "public" service instances we'll use LVS to route the traffic (same for all the edge-site ganeti).

Ok, for my own edification, how would the private only LVS model work if we wanted to stand up a public facing non HTTP(S) service in a VM at one+ of these sites?

Please note that ganeti4002 and ganeti4003 are showing as 'staged' in netbox but not in puppetdb, and throwing report errors on https://netbox.wikimedia.org/extras/reports/puppetdb.PuppetDB/

"missing physical device in PuppetDB: state Staged in Netbox"

Are these mid reinstall or will they be offline for awhile? If offline for awhile, they need to change to 'planned' if they weren't deployed or 'failed' if they are in a hardware failure state and need repair.

Change 554961 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] install_server: switch ganeti[345]* to raid1 layout

https://gerrit.wikimedia.org/r/554961

Change 554961 merged by Herron:
[operations/puppet@production] install_server: switch ganeti[345]* to raid1 layout

https://gerrit.wikimedia.org/r/554961

Change 555761 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] ganeti: assign ganeti400[123] role::ganeti

https://gerrit.wikimedia.org/r/555761

Change 555763 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] ganeti: disable alerts on ganeti400[123] during setup

https://gerrit.wikimedia.org/r/555763

Change 555763 merged by Herron:
[operations/puppet@production] ganeti: disable alerts on ganeti400[123] during setup

https://gerrit.wikimedia.org/r/555763

Change 558185 had a related patch set uploaded (by Herron; owner: Herron):
[labs/private@master] add dummy ulsfo ganeti RAPI key to pacify PCC

https://gerrit.wikimedia.org/r/558185

Change 558185 merged by Herron:
[labs/private@master] add dummy ulsfo ganeti RAPI key to pacify PCC

https://gerrit.wikimedia.org/r/558185

Change 555761 merged by Herron:
[operations/puppet@production] ganeti: assign ganeti400[123] role::ganeti

https://gerrit.wikimedia.org/r/555761

Change 558214 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] ganeti: use 'drbd-utils' package on buster and beyond

https://gerrit.wikimedia.org/r/558214

Change 558247 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Add cluster defs for edge ganetis

https://gerrit.wikimedia.org/r/558247

Change 558247 merged by BBlack:
[operations/puppet@production] Add cluster defs for edge ganetis

https://gerrit.wikimedia.org/r/558247

Change 558214 merged by Muehlenhoff:
[operations/puppet@production] ganeti: use 'drbd-utils' package

https://gerrit.wikimedia.org/r/558214

Change 558633 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] dns: add ganeti01.svc.ulsfo.wmnet cluster service address

https://gerrit.wikimedia.org/r/558633

Change 558633 merged by Herron:
[operations/dns@master] dns: add ganeti01.svc.ulsfo.wmnet cluster service address

https://gerrit.wikimedia.org/r/558633

Change 559166 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] ganeti: ensure package ganeti-instance-debootstrap installed

https://gerrit.wikimedia.org/r/559166

herron changed the task status from Stalled to Open.Dec 19 2019, 2:44 AM

The ulsfo buster ganeti cluster is up and running now, and netflow4001 has been created there as a first VM.

This VM can be accessed from the puppetmaster install_console, or ganeti console currently. We'll need to merge something like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/559155/ for puppet to complete the first run on this VM, and likewise will need to update things like the sites available to various clusters which I presume will happen as services are built out.

I kept notes on the process and added them to a google doc here https://docs.google.com/document/d/1NjqfdCIY1ClGgDNnq_WRwCz3sPa52qa41DkrM24JjiU/edit# There are some areas where we could improve our documentation. But I'd like first to review and ensure what's outlined there is sane before documenting it as the procedure to follow, and automating more of the steps.

Some next steps before resolving this task:

  • Review the ulsfo ganeti cluster setup notes https://docs.google.com/document/d/1NjqfdCIY1ClGgDNnq_WRwCz3sPa52qa41DkrM24JjiU/edit#
  • Do some failover testing and make sure the cluster stays stable
  • Review the ganeti group name (currently "default") and rename as desired
  • Make sure ulsfo ganeti is syncing with netbox (or maybe just do this once when all clusters are online)
  • Re-enable alerts in icinga and set to active in netbox

For sure, but its a work in progress currently. Basically I'd like a sanity check that the manual steps make sense and aren't already automated, or are better handled, in a way that I'm not aware of.

I think it will be easier to work through comments on a google doc, and then move finished instructions to wikitech. Like we do with incident reports, for example.

Change 559330 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] add misc cluster to eqsin and ulsfo

https://gerrit.wikimedia.org/r/559330

Actually since netflow4001 is not yet puppetized the instance has been shut down. https://gerrit.wikimedia.org/r/559330 should unblock the first puppet run, and can re-start the instance after its merged.

I did a reinstall of netflow4001 (had missed this task update and thought it was a botched install) and tested migrations/draining a node, a master failover and a rebalance (which came to the correct solution that with one node nothing needs to be done..) This looks all fine!

Change 559462 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Re-enable notifications for ganeti/ulsfo

https://gerrit.wikimedia.org/r/559462

Change 559166 merged by Herron:
[operations/puppet@production] ganeti: ensure package ganeti-instance-debootstrap installed

https://gerrit.wikimedia.org/r/559166

Change 559330 merged by Herron:
[operations/puppet@production] add misc cluster to eqsin and ulsfo

https://gerrit.wikimedia.org/r/559330

Change 559462 merged by Muehlenhoff:
[operations/puppet@production] Re-enable notifications for ganeti/ulsfo

https://gerrit.wikimedia.org/r/559462

Mentioned in SAL (#wikimedia-operations) [2019-12-20T09:31:58Z] <moritzm> applied Ganeti cluster setting to pass through CPU flags for MDS/SSBD to esams/ulsfo clusters T226444 T236216

MoritzMuehlenhoff claimed this task.
MoritzMuehlenhoff updated the task description. (Show Details)
MoritzMuehlenhoff added a subscriber: RobH.

I tested a failover and an instance migration successfully. I also changed the cluster setting so that CPU vulnerability flags are passed through. Netbox status was set to "Active". Closing!