Maniphest T208733

Rename labvirt1017 to cloudvirt1017, move to eqiad1
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Nov 5 2018, 3:59 PM

Description

After quarry is migrated later today, this host can be moved to the new region.

Reimage + rename this server to the new naming scheme.

Timeline would be:

disable puppet in labvirt1017
merge puppet patch to rename and get the new debian installer working
merge dns patch to add the new FQDNs (partial, the old mgmt names still remains)
run the wmf-auto-reimage-host script (used old-school method)
merge DNS cleanup patch
netbox update https://netbox.wikimedia.org/dcim/devices/1357/
update docs https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Deployments
physical relabeling and switch port description (T208945)
done

Details

Subject	Repo	Branch	Lines +/-
Nova: add cloudvirt1017 to the scheduler pool	operations/puppet	production	+2 -1
cloudvps: hieradata for cloudvirt1017	operations/puppet	production	+0 -0
cloudvps: rename+reimage labvirt1017 as cloudvirt1017	operations/dns	master	+5 -4
cloudvps: reimage+rename labvirt1017 as cloudvirt1017	operations/puppet	production	+11 -11

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		aborrero	T208733 Rename labvirt1017 to cloudvirt1017, move to eqiad1
		Resolved		• Cmjohnson	T208945 Relabel labvirt1017.eqiad.wmnet as cloudvirt1017.eqiad.wmnet

Event Timeline

Andrew created this task.Nov 5 2018, 3:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 5 2018, 3:59 PM

I will handle this task with @GTirloni (if he is interested).

Sure, let's do it!

• GTirloni updated the task description. (Show Details)Nov 7 2018, 12:59 PM

• GTirloni removed a subscriber: Aklapper.

aborrero updated the task description. (Show Details)Nov 7 2018, 1:00 PM

• GTirloni mentioned this in T208945: Relabel labvirt1017.eqiad.wmnet as cloudvirt1017.eqiad.wmnet.Nov 7 2018, 1:04 PM

• GTirloni updated the task description. (Show Details)Nov 7 2018, 1:07 PM

Change 472140 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvps: reimage+rename labvirt1017 as cloudvirt1017

https://gerrit.wikimedia.org/r/472140

gerritbot added a project: Patch-For-Review.Nov 7 2018, 1:12 PM

aborrero updated the task description. (Show Details)Nov 7 2018, 1:16 PM

Change 472142 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/dns@master] cloudvps: rename+reimage labvirt1017 as cloudvirt1017

https://gerrit.wikimedia.org/r/472142

aborrero updated the task description. (Show Details)Nov 7 2018, 1:20 PM

aborrero updated the task description. (Show Details)Nov 7 2018, 1:23 PM

Change 472140 merged by GTirloni:
[operations/puppet@production] cloudvps: reimage+rename labvirt1017 as cloudvirt1017

https://gerrit.wikimedia.org/r/472140

Change 472142 merged by GTirloni:
[operations/dns@master] cloudvps: rename+reimage labvirt1017 as cloudvirt1017

https://gerrit.wikimedia.org/r/472142

Mentioned in SAL (#wikimedia-cloud) [2018-11-07T13:49:39Z] <arturo> T208733 moving labvirt1017 from main deployment to eqiad1 and renaming it to cloudvirt1017

aborrero updated the task description. (Show Details)Nov 7 2018, 1:53 PM

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

labvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071411_gtirloni_21761_labvirt1017_eqiad_wmnet.log.

wmf-auto-reimage-host -p T208733 --rename cloudvirt1017.eqiad.wmnet --rename-mgmt cloudvirt1017.mgmt.eqiad.wmnet labvirt1017.eqiad.wmnet labvirt1017.mgmt.eqiad.wmnet

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

labvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071422_gtirloni_24260_labvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['labvirt1017.eqiad.wmnet']

Of which those FAILED:

['labvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

labvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071429_gtirloni_25942_labvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071549_gtirloni_14020_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071549_gtirloni_14176_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071615_gtirloni_21777_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071621_gtirloni_23337_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071622_gtirloni_24537_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071639_gtirloni_29369_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071713_gtirloni_7282_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071714_gtirloni_7523_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Faced too many issues with the reimage script. This is where it's last failing:

# wmf-auto-reimage-host --no-downtime -p T208733 cloudvirt1017.eqiad.wmnet cloudvirt1017.mgmt.eqiad.wmnet
17:14:55 | cloudvirt1017.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output:
sudo tail -F /var/log/wmf-auto-reimage/201811071714_gtirloni_7523_cloudvirt1017_eqiad_wmnet.log
sudo tail -F /var/log/wmf-auto-reimage/201811071714_gtirloni_7523_cloudvirt1017_eqiad_wmnet_cumin.out
IPMI Password: 
17:15:03 | cloudvirt1017.eqiad.wmnet | Validated host
17:15:09 | cloudvirt1017.eqiad.wmnet | Removed from Puppet
17:15:09 | cloudvirt1017.eqiad.wmnet | WARNING: Unable to remove from Debmonitor, got: 404
17:15:09 | cloudvirt1017.eqiad.wmnet | Set Boot Device to pxe
17:15:09 | cloudvirt1017.eqiad.wmnet | Power cycling
17:15:09 | cloudvirt1017.eqiad.wmnet | Chassis Power Control: Cycle
17:19:16 | cloudvirt1017.eqiad.wmnet | Still waiting for reboot after 5.0 minutes
17:19:16 | cloudvirt1017.eqiad.wmnet | Uptime checked
17:19:16 | cloudvirt1017.eqiad.wmnet | Host up (Debian installer)
17:23:46 | cloudvirt1017.eqiad.wmnet | Still waiting for reboot after 5.0 minutes
17:28:47 | cloudvirt1017.eqiad.wmnet | Still waiting for reboot after 10.0 minutes
17:30:52 | cloudvirt1017.eqiad.wmnet | Uptime checked
17:30:52 | cloudvirt1017.eqiad.wmnet | Host up
17:30:57 | cloudvirt1017.eqiad.wmnet | Puppet CSR generated, fingerprint is: A6:AC:55:2B:EC:FD:0F:4D:FA:8E:ED:72:3B:DA:E1:4D:9D:FA:43:EC:D1:15:E1:94:81:93:97:38:39:E3:CA:82
17:30:57 | cloudvirt1017.eqiad.wmnet | Polling until a Puppet sign request appears
17:31:01 | cloudvirt1017.eqiad.wmnet | Signed Puppet cert
17:31:03 | cloudvirt1017.eqiad.wmnet | Validated host
17:31:03 | cloudvirt1017.eqiad.wmnet | Scheduled delayed downtime on Icinga
17:31:03 | cloudvirt1017.eqiad.wmnet | Started first puppet run (sit back, relax, and enjoy the wait)
17:31:03 | cloudvirt1017.eqiad.wmnet | Sleeping for 120 seconds
17:31:22 | cloudvirt1017.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Failed to puppet_first_run
17:31:22 | cloudvirt1017.eqiad.wmnet | REIMAGE END | retcode=2
17:33:03 | cloudvirt1017.eqiad.wmnet | Running Puppet on the Icinga server
17:37:17 | einsteinium.wikimedia.org | Puppet run completed
ERROR:cumin.transports.clustershell.AsyncEventHandler:100.0% (1/1) of nodes failed to execute command 'icinga-downtime ...ab_task=T208733'': einsteinium.wikimedia.org
CRITICAL:cumin.transports.clustershell.AsyncEventHandler:0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
17:37:18 | cloudvirt1017.eqiad.wmnet | Unable to run wmf-downtime-host: Failed to icinga_downtime
ERROR:wmf-downtime-host:Unable to run wmf-downtime-host
Traceback (most recent call last):
  File "/usr/local/sbin/wmf-downtime-host", line 67, in main
    lib.icinga_downtime(args.host, user, args.phab_task_id, title='wmf-downtime-host')
  File "/usr/local/lib/python3.4/dist-packages/wmf_auto_reimage_lib.py", line 536, in icinga_downtime
    run_cumin('icinga_downtime', icinga_host, [command])
  File "/usr/local/lib/python3.4/dist-packages/wmf_auto_reimage_lib.py", line 469, in run_cumin
    raise RuntimeError('Failed to {label}'.format(label=label))
RuntimeError: Failed to icinga_downtime

Change 472223 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvps: hieradata for cloudvirt1017

https://gerrit.wikimedia.org/r/472223

Change 472223 merged by GTirloni:
[operations/puppet@production] cloudvps: hieradata for cloudvirt1017

https://gerrit.wikimedia.org/r/472223

• GTirloni updated the task description. (Show Details)Nov 7 2018, 6:33 PM

cloudvirt1017 is up & running now by using the old-school method (manually make server boot through PXE and use install-console).

Change 472253 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Nova: add cloudvirt1017 to the scheduler pool

https://gerrit.wikimedia.org/r/472253

Change 472253 merged by GTirloni:
[operations/puppet@production] Nova: add cloudvirt1017 to the scheduler pool

https://gerrit.wikimedia.org/r/472253

VMs are being successfully spawned on cloudvirt1017 (2 tests VMs and one from a user project so far).

• GTirloni updated the task description. (Show Details)Nov 8 2018, 11:18 AM

All pending stuff is blocked by T208945.

aborrero changed the task status from Open to Stalled.Nov 12 2018, 1:08 PM

• Cmjohnson closed subtask T208945: Relabel labvirt1017.eqiad.wmnet as cloudvirt1017.eqiad.wmnet as Resolved.Nov 26 2018, 5:20 PM

Andrew closed this task as Resolved.Nov 26 2018, 7:26 PM

Andrew updated the task description. (Show Details)

Rename labvirt1017 to cloudvirt1017, move to eqiad1Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Rename labvirt1017 to cloudvirt1017, move to eqiad1
Closed, ResolvedPublic
Actions

Related Objects
Search...