Page MenuHomePhabricator

(Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudvirt103[1-9].eqiad.wmnet

Hostname / Racking / Installation Details

To be named cloudvirt103[1-9].eqiad.wmnet.

Will be racked in row B, with 2 10Gb network connections per server. Network connections will be the same as e.g. cloudvirt1030.

All we need is raid10 for system drives -- probably no need for hardware raid here.

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudvirt1031.eqiad.wmnet:

  • - receive in system on procurement task T243471
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudvirt1032.eqiad.wmnet:

  • - receive in system on procurement task T243471
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudvirt1033.eqiad.wmnet:

  • - receive in system on procurement task T243471
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudvirt1034.eqiad.wmnet:

  • - receive in system on procurement task T243471
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudvirt1035.eqiad.wmnet:

  • - receive in system on procurement task T243471
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudvirt1036.eqiad.wmnet:

  • - receive in system on procurement task T243471
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudvirt1037.eqiad.wmnet:

  • - receive in system on procurement task T243471
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudvirt1038.eqiad.wmnet:

  • - receive in system on procurement task T243471
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudvirt1039.eqiad.wmnet:

  • - receive in system on procurement task T243471
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH added a parent task: Unknown Object (Task).May 1 2020, 5:38 PM
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.
Jclark-ctr renamed this task from (Need By: TDB) rack/setup/install cloudvirt103[1-4].eqiad.wmnet to (Need By: TDB) rack/setup/install cloudvirt10[31-39]eqiad.wmnet.May 21 2020, 2:44 PM

We did research in T248425: Test using trunked interfaces for cloudvirts and found that we could reduce the number of 10G ports from 2 to 1, but I'm not sure that is needed following T251632: (Need By: 2020-06-12) rack/setup/install WMCS 10G switches. I think we are probably at the point with this where @Andrew and @Bstorm need to sync up with the DCOps folks on the big picture plan for getting these 9 cloudvirts and the 12 cloudosd hosts from T251619: (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org all stuffed into the racks.

Regardless of whether or not we move existing cloudvirts from 2 ports to 1, we can definitely rack these new servers with only one 10g connection if we take the vlan steps described in T248425.

Regardless of whether or not we move existing cloudvirts from 2 ports to 1, we can definitely rack these new servers with only one 10g connection if we take the vlan steps described in T248425.

Yes, but we also will have 96 10G ports with the new switches, so if they can be provisioned first we should be in good shape for ports.

wiki_willy renamed this task from (Need By: TDB) rack/setup/install cloudvirt10[31-39]eqiad.wmnet to (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet.Jun 8 2020, 8:24 PM

host rack. Switchport. asset tag
cloudvirt1031 C8 7 WMF4817
cloudvirt1032 C8 8 WMF4816
cloudvirt1033 C8 9 WMF4815
cloudvirt1034 C8 10 WMF4814
cloudvirt1035 C8 11 WMF4813
cloudvirt1036 D5 12 WMF4812
cloudvirt1037 D5 13 WMF4811
cloudvirt1038 D5 14 WMF4810
cloudvirt1039 D5 15 WMF4832

Change 617502 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for cloudvirt servers, netbox script has already been run

https://gerrit.wikimedia.org/r/617502

Change 617502 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for cloudvirt servers, netbox script has already been run

https://gerrit.wikimedia.org/r/617502

Change 617557 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Update dhcpd file with mac addresses for cloudvirt hosts

https://gerrit.wikimedia.org/r/617557

Change 617558 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] Add records for cloudvirt103[1-9]

https://gerrit.wikimedia.org/r/617558

Change 617559 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Add production dns for cloudvirt1031-1039

https://gerrit.wikimedia.org/r/617559

Change 617559 merged by Cmjohnson:
[operations/dns@master] Add production dns for cloudvirt1031-1039

https://gerrit.wikimedia.org/r/617559

Change 617557 merged by Cmjohnson:
[operations/puppet@production] Update and remove tabs in hcpd file with mac addresses for cloudvirt hosts

https://gerrit.wikimedia.org/r/617557

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1031.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007310013_andrew_10787.log.

Completed auto-reimage of hosts:

['cloudvirt1031.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1031.eqiad.wmnet']

Change 617590 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudvirt103[1-9]: use a simple one-drive raid config

https://gerrit.wikimedia.org/r/617590

Change 617558 abandoned by Andrew Bogott:
[operations/dns@master] Add records for cloudvirt103[1-9]

Reason:
Chris beat me to it!

https://gerrit.wikimedia.org/r/617558

Change 617590 merged by Andrew Bogott:
[operations/puppet@production] cloudvirt103[1-9]: use a simple one-volume raid config

https://gerrit.wikimedia.org/r/617590

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1031.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007310238_andrew_30326.log.

Completed auto-reimage of hosts:

['cloudvirt1031.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1031.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1032.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007310256_andrew_31961.log.

Completed auto-reimage of hosts:

['cloudvirt1032.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1032.eqiad.wmnet']

Change 617593 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudvirt103[1-9]: puppetize as thinvirts

https://gerrit.wikimedia.org/r/617593

Change 617593 merged by Andrew Bogott:
[operations/puppet@production] cloudvirt103[1-9]: puppetize as thinvirts

https://gerrit.wikimedia.org/r/617593

Change 617595 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudvirt103[1-9] -> debian stretch

https://gerrit.wikimedia.org/r/617595

Change 617595 merged by Andrew Bogott:
[operations/puppet@production] cloudvirt103[1-9] -> debian stretch

https://gerrit.wikimedia.org/r/617595

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1031.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007310334_andrew_4333.log.

Completed auto-reimage of hosts:

['cloudvirt1031.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1031.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1031.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007310338_andrew_6188.log.

Completed auto-reimage of hosts:

['cloudvirt1031.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1031.eqiad.wmnet']

Change 617596 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova-compute: Remove a reference to a (now-not-always-present) mountpoint

https://gerrit.wikimedia.org/r/617596

Change 617596 merged by Andrew Bogott:
[operations/puppet@production] nova-compute: Remove a reference to a (now-not-always-present) mountpoint

https://gerrit.wikimedia.org/r/617596

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1031.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007310404_andrew_8596.log.

Completed auto-reimage of hosts:

['cloudvirt1031.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1031.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1032.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007310439_andrew_15860.log.

Completed auto-reimage of hosts:

['cloudvirt1032.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1032.eqiad.wmnet']

Change 617686 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudvirt103[1-9]: move to insetup until I can figure out what's happening

https://gerrit.wikimedia.org/r/617686

Change 617686 merged by Andrew Bogott:
[operations/puppet@production] cloudvirt103[1-9]: move to insetup until I can figure out what's happening

https://gerrit.wikimedia.org/r/617686

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1031.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007311515_andrew_7228.log.

Completed auto-reimage of hosts:

['cloudvirt1031.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1032.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007311604_andrew_14877.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1031.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007311607_andrew_16020.log.

Completed auto-reimage of hosts:

['cloudvirt1032.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['cloudvirt1031.eqiad.wmnet']

and were ALL successful.

Change 617742 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Update role for cloudvirt1031 and 1032

https://gerrit.wikimedia.org/r/617742

Change 617742 merged by Andrew Bogott:
[operations/puppet@production] Update role for cloudvirt1031 and 1032

https://gerrit.wikimedia.org/r/617742

Change 617748 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudvirt103[1-9]: rename nics for Stretch

https://gerrit.wikimedia.org/r/617748

Change 617748 merged by Andrew Bogott:
[operations/puppet@production] cloudvirt103[1-9]: rename nics for Stretch

https://gerrit.wikimedia.org/r/617748

Andrew added a subscriber: ayounsi.

I have cloudvirt1031 and 1032 running nova-compute, and things look right from the host OS.

Guest VMs can't access the network at all. They show eth0 as being up but can't reach any of the neutron infra (e.g. for dhcp) That makes me think that there's still a rule missing in the vlan setup, which I think would be a @ayounsi question (or possibly an arturo question but he's out for several weeks).

Indeed, miss-configuration from my side, the vlan was configured as access instead of trunk

[edit interfaces interface-range vlan-cloud-instances2-eqiad unit 0 family ethernet-switching]
-      interface-mode access;
+      interface-mode trunk;

replace pattern vlan-cloud-instances2-eqiad with cloud-instances2-eqiad-trunk

Should be good now!

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1031.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008031405_andrew_32611.log.

Completed auto-reimage of hosts:

['cloudvirt1031.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1031.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1032.eqiad.wmnet', 'cloudvirt1033.eqiad.wmnet', 'cloudvirt1034.eqiad.wmnet', 'cloudvirt1035.eqiad.wmnet', 'cloudvirt1036.eqiad.wmnet', 'cloudvirt1037.eqiad.wmnet', 'cloudvirt1038.eqiad.wmnet', 'cloudvirt1039.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008031454_andrew_12610.log.

Networking looks better -- thanks @ayounsi

Completed auto-reimage of hosts:

['cloudvirt1032.eqiad.wmnet', 'cloudvirt1038.eqiad.wmnet', 'cloudvirt1033.eqiad.wmnet', 'cloudvirt1039.eqiad.wmnet', 'cloudvirt1036.eqiad.wmnet', 'cloudvirt1035.eqiad.wmnet', 'cloudvirt1034.eqiad.wmnet', 'cloudvirt1037.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1032.eqiad.wmnet', 'cloudvirt1038.eqiad.wmnet', 'cloudvirt1033.eqiad.wmnet', 'cloudvirt1039.eqiad.wmnet', 'cloudvirt1036.eqiad.wmnet', 'cloudvirt1035.eqiad.wmnet', 'cloudvirt1034.eqiad.wmnet', 'cloudvirt1037.eqiad.wmnet']
Andrew updated the task description. (Show Details)

All hosts are up and running canary VMs. I've marked them as 'active' in netbox.