Page MenuHomePhabricator

CloudVPS: drain and rebuild labvirt1009 as cloudvirt1009
Closed, ResolvedPublic

Description

We are a bit short on capacity on eqiad1. Let's reallocate this virt server and rename while at it.

root@labcontrol1001:~# nova list --all-tenants --host labvirt1009
+--------------------------------------+-----------------------+------------------+--------+------------+-------------+-------------------------------------+
| ID                                   | Name                  | Tenant ID        | Status | Task State | Power State | Networks                            |
+--------------------------------------+-----------------------+------------------+--------+------------+-------------+-------------------------------------+
| b95898d3-2d2c-4289-b400-2e81ef31aea4 | fullstackd-1550230187 | admin-monitoring | ACTIVE | -          | Running     | public=10.68.23.13                  |
| 0e82f3c8-af65-433a-89dc-0f3425e7f585 | maps-tiles2           | maps             | ACTIVE | -          | Running     | public=10.68.17.110                 |
| 09f60ae1-78aa-4d17-8ff7-6d5fb29006dc | tools-bastion-03      | tools            | ACTIVE | -          | Running     | public=10.68.23.58, 208.80.155.163  |
| 2c0cf363-c7c3-42ad-94bd-e586f2492321 | tools-exec-1420       | tools            | ACTIVE | -          | Running     | public=10.68.21.42, 208.80.155.148  |
| 5fe2782a-13fa-4b10-baf1-f577dc698d7d | tools-exec-1427       | tools            | ACTIVE | -          | Running     | public=10.68.16.94, 208.80.155.200  |
| 52ac2af1-fea2-4793-8d08-c850a186acf5 | tools-exec-1428       | tools            | ACTIVE | -          | Running     | public=10.68.20.54, 208.80.155.195  |
| 5a41a2b1-5bdd-4d52-ba1c-72273b4fe6f3 | tools-exec-1440       | tools            | ACTIVE | -          | Running     | public=10.68.22.236, 208.80.155.215 |
+--------------------------------------+-----------------------+------------------+--------+------------+-------------+-------------------------------------+
  • drain labvirt1009
  • check RAID configuration to create spare disks
  • disable puppet in labvirt1009
  • merge puppet patch to rename, get the new debian installer working and disable notifications (rename hieradata/hosts/labvirt1009yaml to cloudvirt1009.yaml and add "profile::base::notifications: disabled" temporarily)
  • merge dns patch to add the new FQDNs (partial, the old mgmt names still remains)
  • run the wmf-auto-reimage-host script (used old-school method)
  • merge DNS cleanup patch
  • merge puppet patch to re-enable notifications (remove "profile::base::notifications")
  • netbox update https://netbox.wikimedia.org/dcim/devices/1453/
  • update docs https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Deployments
  • physical relabeling and switch port description T216281
  • done

Details

Related Gerrit Patches:
operations/dns : masterpurge references to labvirt1009
operations/puppet : productioncloudvirt1009 - Network config fixes
operations/puppet : productioncloudvirt1009 - Define correct role
operations/puppet : productioncloudvirt1009 - Define partman recipe
operations/puppet : productionwmcs: Rename labvirt1009 to cloudvirt1009
operations/dns : masterwmcs: Rename labvirt1009 to cloudvirt1009
operations/puppet : productioncloudvps: main: depool labvirt1009

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 15 2019, 11:43 AM
aborrero renamed this task from CloudVPS: Drain and rebuild labvirt1009 as cloudvirt10 to CloudVPS: drain and rebuild labvirt1009 as cloudvirt10.Feb 15 2019, 11:43 AM
aborrero renamed this task from CloudVPS: drain and rebuild labvirt1009 as cloudvirt10 to CloudVPS: drain and rebuild labvirt1009 as cloudvirt1009.
aborrero triaged this task as High priority.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2019-02-15T12:22:17Z] <arturo> T216239 draining labvirt1009 with a command like this: root@cloudcontrol1004:~# wmcs-cold-migrate --region eqiad --nova-db nova 2c0cf363-c7c3-42ad-94bd-e586f2492321 labvirt1001

Mentioned in SAL (#wikimedia-operations) [2019-02-15T12:30:31Z] <arturo> T216239 schedule 1week of icinga downtime for labvirt1019

Mentioned in SAL (#wikimedia-cloud) [2019-02-15T13:10:38Z] <arturo> T216239 labvirt1019 has been drained

Mentioned in SAL (#wikimedia-operations) [2019-02-15T13:11:02Z] <arturo> T216239 labvirt1019 has been drained of any workload

aborrero updated the task description. (Show Details)Feb 15 2019, 1:11 PM

Change 490845 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: main: depool labvirt1009

https://gerrit.wikimedia.org/r/490845

Change 490845 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: main: depool labvirt1009

https://gerrit.wikimedia.org/r/490845

aborrero added a comment.EditedFeb 15 2019, 1:45 PM

For the record, before any operation in the RAID:

aborrero@labvirt1009:~ 4s $ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            189G   12K  189G   1% /dev
tmpfs            38G  1.1M   38G   1% /run
/dev/sda1        92G  5.5G   82G   7% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
none            5.0M     0  5.0M   0% /run/lock
none            189G  4.0K  189G   1% /run/shm
none            100M     0  100M   0% /run/user
/dev/sdb1       2.2T   86G  2.1T   4% /var/lib/nova/instances
aborrero@labvirt1009:~ $ df
Filesystem      1K-blocks     Used  Available Use% Mounted on
udev            198104956       12  198104944   1% /dev
tmpfs            39623524     1104   39622420   1% /run
/dev/sda1        95990796  5701700   85389912   7% /
none                    4        0          4   0% /sys/fs/cgroup
none                 5120        0       5120   0% /run/lock
none            198117616        4  198117612   1% /run/shm
none               102400        0     102400   0% /run/user
/dev/sdb1      2340340992 89305988 2251035004   4% /var/lib/nova/instances

And:

14:52 <arturo> +--------------------------Available Logical Drives----------------------------+
14:52 <arturo> | # 1,       RAID 1,  146.78 GB,                     OK                        |
14:52 <arturo> | # 2,     RAID 1+0,    2.39 TB,                     OK                        |
14:52 <arturo> +------------------------------------------------------------------------------+
aborrero updated the task description. (Show Details)Feb 15 2019, 1:50 PM
GTirloni updated the task description. (Show Details)Feb 15 2019, 2:54 PM

Change 490858 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] wmcs: Rename labvirt1009 to cloudvirt1009

https://gerrit.wikimedia.org/r/490858

Change 490860 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/dns@master] wmcs: Rename labvirt1009 to cloudvirt1009

https://gerrit.wikimedia.org/r/490860

Change 490860 merged by GTirloni:
[operations/dns@master] wmcs: Rename labvirt1009 to cloudvirt1009

https://gerrit.wikimedia.org/r/490860

Change 490858 merged by GTirloni:
[operations/puppet@production] wmcs: Rename labvirt1009 to cloudvirt1009

https://gerrit.wikimedia.org/r/490858

Script wmf-auto-reimage was launched by gtirloni on cumin1001.eqiad.wmnet for hosts:

labvirt1009.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201902151519_gtirloni_81470_labvirt1009_eqiad_wmnet.log.

Change 490871 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvirt1009 - Define partman recipe

https://gerrit.wikimedia.org/r/490871

Change 490871 merged by GTirloni:
[operations/puppet@production] cloudvirt1009 - Define partman recipe

https://gerrit.wikimedia.org/r/490871

Completed auto-reimage of hosts:

['cloudvirt1009.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1009.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on cumin1001.eqiad.wmnet for hosts:

cloudvirt1009.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201902151627_gtirloni_94981_cloudvirt1009_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1009.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1009.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on cumin1001.eqiad.wmnet for hosts:

cloudvirt1009.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201902151627_gtirloni_95079_cloudvirt1009_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1009.eqiad.wmnet']

and were ALL successful.

GTirloni updated the task description. (Show Details)Feb 15 2019, 5:14 PM

Change 490889 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvirt1009 - Define correct role

https://gerrit.wikimedia.org/r/490889

Change 490889 merged by GTirloni:
[operations/puppet@production] cloudvirt1009 - Define correct role

https://gerrit.wikimedia.org/r/490889

aborrero updated the task description. (Show Details)Feb 15 2019, 6:37 PM

Change 490902 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvirt1009 - Network config fixes

https://gerrit.wikimedia.org/r/490902

Change 490902 merged by GTirloni:
[operations/puppet@production] cloudvirt1009 - Network config fixes

https://gerrit.wikimedia.org/r/490902

Change 490930 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] purge references to labvirt1009

https://gerrit.wikimedia.org/r/490930

Change 490930 merged by Andrew Bogott:
[operations/dns@master] purge references to labvirt1009

https://gerrit.wikimedia.org/r/490930

Andrew updated the task description. (Show Details)Feb 15 2019, 9:07 PM
Andrew updated the task description. (Show Details)

This is moved and running a canary. It's good to have in case of emergencies but ideally we'll update it with 10Gb ethernet before it takes on production VMs.

aborrero lowered the priority of this task from High to Medium.Mar 8 2019, 12:58 PM

Lowering priority. @Andrew are you sure this host has 10GB hardware NICs available?

Unless is somehow disabled in BIOS, I only see this:

aborrero@cloudvirt1009:~ $ lspci | grep -i ethernet
0a:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0a:00.1 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0a:00.2 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0a:00.3 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
Andrew added a comment.Mar 8 2019, 2:11 PM

Unless is somehow disabled in BIOS, I only see this:

10Gb nics are disabled in the bios for every one of our cloudvirts that aren't already running 10Gb. Details on https://phabricator.wikimedia.org/T216195

aborrero removed aborrero as the assignee of this task.Mar 11 2019, 1:12 PM
aborrero lowered the priority of this task from Medium to Low.
aborrero moved this task from Doing to Important on the cloud-services-team (Kanban) board.

The NIC is being tracked in T216324: relocate/reimage cloudvirt1009 with 10G interfaces. This task can be closed once the NIC is in place and the relabeling has been done in T216281: Update label and switch to rename labvirt1009 to cloudvirt1009

GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:11 PM
Andrew closed this task as Resolved.May 16 2019, 9:11 PM
Andrew updated the task description. (Show Details)