Page MenuHomePhabricator

rack/setup/install ganeti300[123]
Closed, ResolvedPublic

Description

This task will track the racking/setup/installation of ganeti300[123] ordered on task T230620.

Racking Proposal: Setup via the google sheet rack layout.

ganeti3001:

  • - receive in system on procurement task T230620
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - disable embedded NIC
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

ganeti3002:

  • - receive in system on procurement task T230620
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - disable embedded NIC
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

ganeti3003:

  • - receive in system on procurement task T230620
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - disable embedded NIC
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Details

Related Gerrit Patches:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH removed a subscriber: RobH.Oct 23 2019, 6:02 PM

Change 545658 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Basic install for new esams hosts

https://gerrit.wikimedia.org/r/545658

Change 545662 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] basic DNS entries for new esams hosts

https://gerrit.wikimedia.org/r/545662

Change 545662 merged by BBlack:
[operations/dns@master] basic DNS entries for new esams hosts

https://gerrit.wikimedia.org/r/545662

Change 545658 merged by BBlack:
[operations/puppet@production] Basic install for new esams hosts

https://gerrit.wikimedia.org/r/545658

Vgutierrez updated the task description. (Show Details)Oct 24 2019, 5:58 AM
Vgutierrez updated the task description. (Show Details)Oct 24 2019, 6:03 AM

Change 545701 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] install_server: Fix MAC addresses for new esams boxes

https://gerrit.wikimedia.org/r/545701

Change 545701 merged by Vgutierrez:
[operations/puppet@production] install_server: Fix MAC addresses for new esams boxes

https://gerrit.wikimedia.org/r/545701

ganeti3003 switch information

xe-6/0/13

Change 545880 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] esams: mgmt dns for rack 16

https://gerrit.wikimedia.org/r/545880

Change 545880 merged by BBlack:
[operations/dns@master] esams: mgmt dns for rack 16

https://gerrit.wikimedia.org/r/545880

Change 545893 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Add dhcp macaddrs for esams rack 16 hosts

https://gerrit.wikimedia.org/r/545893

Change 545893 merged by BBlack:
[operations/puppet@production] Add dhcp macaddrs for esams rack 16 hosts

https://gerrit.wikimedia.org/r/545893

Change 545895 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] esams mgmt dns for rack 14

https://gerrit.wikimedia.org/r/545895

Change 545895 merged by BBlack:
[operations/dns@master] esams mgmt dns for rack 14

https://gerrit.wikimedia.org/r/545895

Change 545908 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] esams: macaddrs for all new rack 14 hosts

https://gerrit.wikimedia.org/r/545908

Change 545908 merged by BBlack:
[operations/puppet@production] esams: macaddrs for all new rack 14 hosts

https://gerrit.wikimedia.org/r/545908

Dzahn updated the task description. (Show Details)Oct 24 2019, 6:39 PM
Dzahn updated the task description. (Show Details)

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti3001.esams.wmnet

The log can be found in /var/log/wmf-auto-reimage/201910241840_dzahn_45143_ganeti3001_esams_wmnet.log.

Completed auto-reimage of hosts:

['ganeti3001.esams.wmnet']

Of which those FAILED:

['ganeti3001.esams.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti3001.esams.wmnet

The log can be found in /var/log/wmf-auto-reimage/201910241841_dzahn_45283_ganeti3001_esams_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti3002.esams.wmnet

The log can be found in /var/log/wmf-auto-reimage/201910241842_dzahn_45458_ganeti3002_esams_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti3003.esams.wmnet

The log can be found in /var/log/wmf-auto-reimage/201910241843_dzahn_45635_ganeti3003_esams_wmnet.log.

Completed auto-reimage of hosts:

['ganeti3001.esams.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['ganeti3003.esams.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['ganeti3002.esams.wmnet']

and were ALL successful.

Dzahn updated the task description. (Show Details)Oct 24 2019, 7:26 PM
Dzahn added a subscriber: Dzahn.
  • OS installed
  • in puppet with spare role
  • set to "staged" in netbox

Are all previous boxes already done? If so then it can be handed over to Filippo i think.

Dzahn changed the task status from Open to Stalled.Oct 24 2019, 8:01 PM

These need to stay on role(spare) for now, per bblack "we haven't figured out the edge DC ganeti cluster configs yet"

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

ganeti3003.esams.wmnet

The log can be found in /var/log/wmf-auto-reimage/201910250429_dzahn_168265_ganeti3003_esams_wmnet.log.

Completed auto-reimage of hosts:

['ganeti3003.esams.wmnet']

Of which those FAILED:

['ganeti3003.esams.wmnet']
RobH moved this task from Backlog to Racking Tasks on the ops-esams board.Oct 25 2019, 3:48 PM
Papaul updated the task description. (Show Details)Oct 29 2019, 8:12 PM

@BBlack, what were your plans here? Can others in SRE help with some of that perhaps?

(If this is more than a simple "rack/setup/install", let's file a separate task about that, and resolve the DC-Ops specific one.)

Papaul removed Papaul as the assignee of this task.Nov 22 2019, 4:26 AM
Papaul added a subscriber: Papaul.
BBlack updated the task description. (Show Details)Nov 22 2019, 2:34 PM

IMPORTANT NOTE ganeti3003 is temporarily repurposed as a critical authdns server and is in live production use for that role (see also: T236479 ). Do not reimage or touch ganeti3003. The other two (ganeti3001 and ganeti3002) are free to image and set up as a 2-node ganeti cluster, with the third node to join later when its temporary duties are complete.

Change 553046 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Switch Ganeti servers in esams/ulsfo to Buster

https://gerrit.wikimedia.org/r/553046

Change 553046 merged by Muehlenhoff:
[operations/puppet@production] Switch Ganeti servers in esams/ulsfo to Buster

https://gerrit.wikimedia.org/r/553046

Change 554533 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] ganeti3003: buster installer

https://gerrit.wikimedia.org/r/554533

Change 554533 merged by BBlack:
[operations/puppet@production] ganeti3003: buster installer

https://gerrit.wikimedia.org/r/554533

BBlack updated the task description. (Show Details)Dec 4 2019, 3:36 PM

With T236479 closed, ganeti3003 is no longer special and everyone can ignore the IMPORTANT NOTE earlier.

Dzahn changed the task status from Stalled to Open.Dec 5 2019, 6:14 PM

Change 558247 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Add cluster defs for edge ganetis

https://gerrit.wikimedia.org/r/558247

Change 558247 merged by BBlack:
[operations/puppet@production] Add cluster defs for edge ganetis

https://gerrit.wikimedia.org/r/558247

Change 559297 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] ganeti300[123] disable notifications during setup

https://gerrit.wikimedia.org/r/559297

Change 559297 merged by Herron:
[operations/puppet@production] ganeti300[123] disable notifications during setup

https://gerrit.wikimedia.org/r/559297

Change 559313 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] ganeti: assign ganeti300[123] role::ganeti

https://gerrit.wikimedia.org/r/559313

Change 559315 had a related patch set uploaded (by Herron; owner: Herron):
[labs/private@master] add dummy esams and eqsin ganeti keys to pacify PCC

https://gerrit.wikimedia.org/r/559315

Change 559315 merged by Herron:
[labs/private@master] add dummy esams and eqsin ganeti keys to pacify PCC

https://gerrit.wikimedia.org/r/559315

Change 559324 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] add ganeti01.svc.esams.wmnet forward/reverse ipv4 records

https://gerrit.wikimedia.org/r/559324

herron added a subscriber: herron.EditedDec 19 2019, 5:58 AM

These hosts have been reimaged with buster, certs created, and patches uploaded to enable ganeti.

After a +1 on https://gerrit.wikimedia.org/r/559313 will be ready to proceed with ganeti setup in esams

Change 559313 merged by Muehlenhoff:
[operations/puppet@production] ganeti: assign ganeti300[123] role::ganeti

https://gerrit.wikimedia.org/r/559313

Change 559324 merged by Herron:
[operations/dns@master] add ganeti01.svc.esams.wmnet forward/reverse ipv4 records

https://gerrit.wikimedia.org/r/559324

The esams ganeti cluster is now up and running, and netflow3001 has been created there as a first VM.

After @MoritzMuehlenhoff has a chance to double check that all looks good, and we've re-enabled alerting, we should be in good shape to resolve this task.

Dzahn added a comment.Dec 20 2019, 4:30 AM

@herron netflow3001 and netflow4001 are popping up in Icinga because of the microcode mitigations. I think a reboot would fix that though.

Mentioned in SAL (#wikimedia-operations) [2019-12-20T09:31:58Z] <moritzm> applied Ganeti cluster setting to pass through CPU flags for MDS/SSBD to esams/ulsfo clusters T226444 T236216

Change 559710 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Re-enable notifications for ganeti3*, setup complete

https://gerrit.wikimedia.org/r/559710

Change 559710 merged by Muehlenhoff:
[operations/puppet@production] Re-enable notifications for ganeti3*, setup complete

https://gerrit.wikimedia.org/r/559710

MoritzMuehlenhoff closed this task as Resolved.Dec 20 2019, 9:50 AM
MoritzMuehlenhoff claimed this task.
MoritzMuehlenhoff updated the task description. (Show Details)

The esams ganeti cluster is now up and running, and netflow3001 has been created there as a first VM.
After @MoritzMuehlenhoff has a chance to double check that all looks good, and we've re-enabled alerting, we should be in good shape to resolve this task.

I tested a failover and an instance migration successfully. I also changed the cluster setting so that CPU vulnerability flags are passed through. Notifications in Icinga have been re-enabled and Netbox status was set to "Active". Closing!

Mentioned in SAL (#wikimedia-operations) [2020-01-10T10:24:20Z] <moritzm> rename Ganeti group for esams from "default" to "row_OE" T236216