Page MenuHomePhabricator

rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev
Closed, ResolvedPublic

Description

This task will cover the racking, setup, and installation of 4 new cloud-dev hosts purchased on T210781.

Hostname Proposal:

This should be confirmed by cloud-services-team!

T210781 states that the order for it covers: 1 x cloudcontrol2xxx-dev & 1-3 x cloudvirt2xxx-dev. There are NO cloudanything-dev in codfw at present, so these will have the following hostnames:

cloudcontrol2001-dev
cloudvirt2001-dev
cloudvirt2002-dev
cloudvirt2003-dev

Racking Proposal:
This should be confirmed by netops and/or cloud-services-team!

@RobH is making a number of assumptions here, and will need both netops and cloud-services-team to confirm things before @Papaul spends time racking the cloudvirt200[1-3]-dev hosts.

cloudcontrol2001-dev: cloudcontrol100[34] in eqiad are in the public vlan, since it appears they need to interact with both the cloudvirt hosts, and the rest of the world. So cloudcontrol2001-dev can be racked in any 1G rack in ANY row, since it will be placed in the public vlan. Don't rack in c1-codfw, since it has the system that will be counterpoint system.

cloudvirt200[1-3]-dev: cloudvirts have to be in a row that has both the cloud-hosts1 and cloud-instances[12] vlans for proper networking support. Row B in eqiad has those vlans on the switch and they are in use (while they aren't in use on other rows), so it seems row B is the row to put all cloudvirts on in codfw.

cloudcontrol2001-dev:

  • - receive in system on procurement task T210781
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location) (B1)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) (raid1.cfg)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

cloudvirt2001-dev:

  • - receive in system on procurement task T210781
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location) (B3)
  • - please ensure BOTH 1G interfaces are hooked up. cloudvirts use the first interface for the OS/host and the second interface for instance traffic.
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) (raid1.cfg)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

cloudvirt2002-dev:

  • - receive in system on procurement task T210781
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location) (B5)
  • - please ensure BOTH 1G interfaces are hooked up. cloudvirts use the first interface for the OS/host and the second interface for instance traffic.
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) (raid1.cfg)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

cloudvirt2003-dev:

  • - receive in system on procurement task T210781
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location) (B8)
  • - please ensure BOTH 1G interfaces are hooked up. cloudvirts use the first interface for the OS/host and the second interface for instance traffic.
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) (raid1.cfg)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH triaged this task as Medium priority.Jan 23 2019, 12:08 AM

Ok, just confirmed with @ayounsi that row B is the row for cloudvirt hosts!

@Papaul: Unless the cloud-services-team states differently, I think you can move ahead on racking with what I put in the task description above!

Ok, just confirmed with @ayounsi that row B is the row for cloudvirt hosts!

@Papaul: Unless the cloud-services-team states differently, I think you can move ahead on racking with what I put in the task description above!

The racking plan from @RobH looks right to me. cloudcontrol2001-dev will replace labtestcontrol2001 which is currently in rack B5, but I think the new server can go anywhere other than C1 which is where labtestcontrol2003 (which will become cloudcontrol2002-dev when we reimage it) lives today.

Change 486391 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Add mgmt DNS entries for cloudcontrol2001-dev and cloudvirt200[123]-dev

https://gerrit.wikimedia.org/r/486391

Change 486391 merged by Dzahn:
[operations/dns@master] DNS: Add mgmt DNS entries for cloudcontrol2001-dev and cloudvirt200[123]-dev

https://gerrit.wikimedia.org/r/486391

I find it really confusing that we are reusing numbering for these servers, even with the renaming for the new naming scheme.

I find it really confusing that we are reusing numbering for these servers, even with the renaming for the new naming scheme.

It doesn't bother me, but I think you should feel free to rename/renumber things as you see fit. As I understand it it's not a big deal for @Papaul to update labels.

I find it really confusing that we are reusing numbering for these servers, even with the renaming for the new naming scheme.

It doesn't bother me, but I think you should feel free to rename/renumber things as you see fit. As I understand it it's not a big deal for @Papaul to update labels.

I think I was able to identify and describe what was confusing me: T214499#4909226

@Andrew for all those new servers I am using for partman labvirt-ssd.cfg?

@Andrew can you also specify on this task in which VLAN eth1 needs to be for cloudvirt200[1-3]. Thanks

@Andrew can you also specify on this task in which VLAN eth1 needs to be for cloudvirt200[1-3]. Thanks

vlan 2105.

Mind that these new cloudvirts servers should be imaged using Debian Stretch, so eth1 might not be present (but eno50 or eno2 or whatever).

Change 486504 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Add production DNS enties for cloudcontrol2001-dev and cloudvirt200[123]-dev

https://gerrit.wikimedia.org/r/486504

@Andrew for all those new servers I am using for partman labvirt-ssd.cfg?

It depends on what raid controller we have. That recipe expects to see one big raid10 configured in hardware; if that's possible then go ahead and use that partman recipe; if not then I'll dig out a better one.

@Andrew there is no raid controller on the new servers. They all have 2x200GB SSD's

@Andrew there is no raid controller on the new servers. They all have 2x200GB SSD's

ok -- let's just use partman/raid1.cfg then, for now at least. Thanks!

Change 486700 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] Add DHCP MAC addrese and partman for cloudcontrol2001-dev and cloudvirt200[123]-dev

https://gerrit.wikimedia.org/r/486700

Change 486700 merged by Dzahn:
[operations/puppet@production] DHCP/partman: add cloudcontrol2001-dev and cloudvirt200[123]-dev

https://gerrit.wikimedia.org/r/486700

Change 486504 merged by Arturo Borrero Gonzalez:
[operations/dns@master] DNS: Add production DNS enties for cloudcontrol2001-dev and cloudvirt200[123]-dev

https://gerrit.wikimedia.org/r/486504

Change 488496 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Fix fixed-address name

https://gerrit.wikimedia.org/r/488496

Change 488496 merged by Papaul:
[operations/puppet@production] DHCP: Fix fixed-address name

https://gerrit.wikimedia.org/r/488496

second NIC configuration

cloudvirt2001-dev

Logical          Vlan          TAG     MAC         STP         Logical           Tagging 
interface        members               limit       state       interface flags  
ge-3/0/23.0                            294912                                     tagged     
                 cloud-instances2-b-codfw 2105 294912 Forwarding                  tagged

cloudvirt2002-dev

Logical          Vlan          TAG     MAC         STP         Logical           Tagging 
interface        members               limit       state       interface flags  
ge-5/0/22.0                            294912                                     tagged     
                 cloud-instances2-b-codfw 2105 294912 Forwarding                  tagged

cloudvirt2003-dev

Logical          Vlan          TAG     MAC         STP         Logical           Tagging 
interface        members               limit       state       interface flags  
ge-8/0/6.0                             294912                                     tagged     
                 cloud-instances2-b-codfw 2105 294912 Forwarding                  tagged

@aborrero @Andrew all yours . Let me know if you have any questions.

Change 488896 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/dns@master] cloudcontrol2001-dev: fix PTR record

https://gerrit.wikimedia.org/r/488896

Change 488896 merged by Arturo Borrero Gonzalez:
[operations/dns@master] cloudcontrol2001-dev: fix PTR record

https://gerrit.wikimedia.org/r/488896

Change 488897 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudcontrol2001-dev: spare system for now

https://gerrit.wikimedia.org/r/488897

Change 488897 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudcontrol2001-dev: spare system for now

https://gerrit.wikimedia.org/r/488897

Change 488899 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvirt200X-dev: add roles in codfw1dev

https://gerrit.wikimedia.org/r/488899

Change 488899 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvirt200X-dev: add roles in codfw1dev

https://gerrit.wikimedia.org/r/488899

Change 488902 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] hiera: cloudvirt200X-dev: add hosts overrides

https://gerrit.wikimedia.org/r/488902

Change 488902 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] hiera: cloudvirt200X-dev: add hosts overrides

https://gerrit.wikimedia.org/r/488902

Change 488905 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] hiera: cloudvirt200X-dev: fix wrong hiera keys names

https://gerrit.wikimedia.org/r/488905

Change 488905 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] hiera: cloudvirt200X-dev: fix wrong hiera keys names

https://gerrit.wikimedia.org/r/488905

@Andrew there is no raid controller on the new servers. They all have 2x200GB SSD's

ok -- let's just use partman/raid1.cfg then, for now at least. Thanks!

I will be switching cloudvirts to partman/raid1-lvm.cfg. We need LVM for nova at least in cloudvirt servers. I will left cloudcontrol2001-dev as is.

Error: /Stage[main]/Profile::Openstack::Base::Nova::Compute::Service/Mount[/var/lib/nova/instances]: Could not evaluate: Execution of '/bin/mount /var/lib/nova/instances' returned 32: mount: special device /dev/mapper/tank-data does not exist

Change 488914 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvirt200[123]-dev: use partman/raid1-lvm-xfs-nova.cfg

https://gerrit.wikimedia.org/r/488914

@Andrew there is no raid controller on the new servers. They all have 2x200GB SSD's

ok -- let's just use partman/raid1.cfg then, for now at least. Thanks!

I will be switching cloudvirts to partman/raid1-lvm.cfg. We need LVM for nova at least in cloudvirt servers. I will left cloudcontrol2001-dev as is.

Actually partman/raid1-lvm-xfs-nova.cfg I think is suitable. Is in use by other virt server.

Change 488914 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvirt200[123]-dev: use partman/raid1-lvm-xfs-nova.cfg

https://gerrit.wikimedia.org/r/488914

Script wmf-auto-reimage was launched by aborrero on cumin2001.codfw.wmnet for hosts:

['cloudvirt2001-dev.codfw.wmnet', 'cloudvirt2002-dev.codfw.wmnet', 'cloudvirt2003-dev.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201902071202_aborrero_1966.log.

Mentioned in SAL (#wikimedia-operations) [2019-02-07T12:03:16Z] <arturo> T214448 reimaging again cloudvirt200[1-3]-dev.codfw.wmnet

Change 488918 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] hiera: cloudvirt200[1-3]-dev: fix extra LVM volume name

https://gerrit.wikimedia.org/r/488918

Change 488918 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] hiera: cloudvirt200[1-3]-dev: fix extra LVM volume name

https://gerrit.wikimedia.org/r/488918

I'm seeing this in cloudvirt2003-dev:

[   13.270987] kvm: disabled by bios
[   13.729525] kvm: disabled by bios

Change 488926 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] hiera: cloudvirt200[1-3]-dev: fix again instance_dev hiera key

https://gerrit.wikimedia.org/r/488926

Change 488926 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] hiera: cloudvirt200[1-3]-dev: fix again instance_dev hiera key

https://gerrit.wikimedia.org/r/488926

Script wmf-auto-reimage was launched by aborrero on cumin2001.codfw.wmnet for hosts:

['cloudvirt2003-dev.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201902071344_aborrero_25895.log.

Completed auto-reimage of hosts:

['cloudvirt2003-dev.codfw.wmnet']

and were ALL successful.

on [cloudcontrol2001-dev: :

Server Error: Invalid relationship: Class[Profile::Openstack::Base::Keystone::Service] { require => Class[Profile::Openstack::Base::Keystone::Db] }, because Class[Profile::Openstack::Base::Keystone::Db] doesn't seem to be in the catalog

aborrero updated the task description. (Show Details)

This should be done now.

on [cloudcontrol2001-dev: :

Server Error: Invalid relationship: Class[Profile::Openstack::Base::Keystone::Service] { require => Class[Profile::Openstack::Base::Keystone::Db] }, because Class[Profile::Openstack::Base::Keystone::Db] doesn't seem to be in the catalog

This is solved already. Thanks for reporting (I fixed it because I saw this report :-P)

FYI:

"CRITICAL - degraded: The system is operational but one or more units failed."
"CRITICAL: Status of the systemd unit glance_rsync_images"

I think these are resolved now (I just reinstalled some packages; not sure what went wrong originally.)

aborrero removed aborrero as the assignee of this task.

All notifications for these hosts are (permanently?) disabled. Wondering if that is desired or maybe they should just not be in monitoring in the first place if by definition they are "dev" hosts.

currently we have the following alerts but nobody gets notifications about them because those are disabled

cloudcontrol2001-dev - systemd state
cloudnet2002-dev - systemd state
cloudnet2003-dev - Check whether microcode mitigations for CPU vulnerabilities are applied, DPKG state
cloudvirt2001-dev - DPKG
cloudvirt2002-dev - DPKG
cloudvirt2003-dev -systemd state

Screenshot at 2020-03-26 08-18-48.png (182×1 px, 44 KB)

ACKed to handle Icinga alerts