Page MenuHomePhabricator

Rename labvirt1017 to cloudvirt1017, move to eqiad1
Closed, ResolvedPublic

Description

After quarry is migrated later today, this host can be moved to the new region.

Reimage + rename this server to the new naming scheme.

Timeline would be:

Event Timeline

aborrero triaged this task as Medium priority.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
aborrero added subscribers: GTirloni, aborrero.

I will handle this task with @GTirloni (if he is interested).

Change 472140 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvps: reimage+rename labvirt1017 as cloudvirt1017

https://gerrit.wikimedia.org/r/472140

Change 472142 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/dns@master] cloudvps: rename+reimage labvirt1017 as cloudvirt1017

https://gerrit.wikimedia.org/r/472142

Change 472140 merged by GTirloni:
[operations/puppet@production] cloudvps: reimage+rename labvirt1017 as cloudvirt1017

https://gerrit.wikimedia.org/r/472140

Change 472142 merged by GTirloni:
[operations/dns@master] cloudvps: rename+reimage labvirt1017 as cloudvirt1017

https://gerrit.wikimedia.org/r/472142

Mentioned in SAL (#wikimedia-cloud) [2018-11-07T13:49:39Z] <arturo> T208733 moving labvirt1017 from main deployment to eqiad1 and renaming it to cloudvirt1017

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

labvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071411_gtirloni_21761_labvirt1017_eqiad_wmnet.log.

wmf-auto-reimage-host -p T208733 --rename cloudvirt1017.eqiad.wmnet --rename-mgmt cloudvirt1017.mgmt.eqiad.wmnet labvirt1017.eqiad.wmnet labvirt1017.mgmt.eqiad.wmnet

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

labvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071422_gtirloni_24260_labvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['labvirt1017.eqiad.wmnet']

Of which those FAILED:

['labvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

labvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071429_gtirloni_25942_labvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071549_gtirloni_14020_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071549_gtirloni_14176_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071615_gtirloni_21777_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071621_gtirloni_23337_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071622_gtirloni_24537_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071639_gtirloni_29369_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071713_gtirloni_7282_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by gtirloni on neodymium.eqiad.wmnet for hosts:

cloudvirt1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811071714_gtirloni_7523_cloudvirt1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1017.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1017.eqiad.wmnet']

Faced too many issues with the reimage script. This is where it's last failing:

# wmf-auto-reimage-host --no-downtime -p T208733 cloudvirt1017.eqiad.wmnet cloudvirt1017.mgmt.eqiad.wmnet
17:14:55 | cloudvirt1017.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output:
sudo tail -F /var/log/wmf-auto-reimage/201811071714_gtirloni_7523_cloudvirt1017_eqiad_wmnet.log
sudo tail -F /var/log/wmf-auto-reimage/201811071714_gtirloni_7523_cloudvirt1017_eqiad_wmnet_cumin.out
IPMI Password: 
17:15:03 | cloudvirt1017.eqiad.wmnet | Validated host
17:15:09 | cloudvirt1017.eqiad.wmnet | Removed from Puppet
17:15:09 | cloudvirt1017.eqiad.wmnet | WARNING: Unable to remove from Debmonitor, got: 404
17:15:09 | cloudvirt1017.eqiad.wmnet | Set Boot Device to pxe
17:15:09 | cloudvirt1017.eqiad.wmnet | Power cycling
17:15:09 | cloudvirt1017.eqiad.wmnet | Chassis Power Control: Cycle
17:19:16 | cloudvirt1017.eqiad.wmnet | Still waiting for reboot after 5.0 minutes
17:19:16 | cloudvirt1017.eqiad.wmnet | Uptime checked
17:19:16 | cloudvirt1017.eqiad.wmnet | Host up (Debian installer)
17:23:46 | cloudvirt1017.eqiad.wmnet | Still waiting for reboot after 5.0 minutes
17:28:47 | cloudvirt1017.eqiad.wmnet | Still waiting for reboot after 10.0 minutes
17:30:52 | cloudvirt1017.eqiad.wmnet | Uptime checked
17:30:52 | cloudvirt1017.eqiad.wmnet | Host up
17:30:57 | cloudvirt1017.eqiad.wmnet | Puppet CSR generated, fingerprint is: A6:AC:55:2B:EC:FD:0F:4D:FA:8E:ED:72:3B:DA:E1:4D:9D:FA:43:EC:D1:15:E1:94:81:93:97:38:39:E3:CA:82
17:30:57 | cloudvirt1017.eqiad.wmnet | Polling until a Puppet sign request appears
17:31:01 | cloudvirt1017.eqiad.wmnet | Signed Puppet cert
17:31:03 | cloudvirt1017.eqiad.wmnet | Validated host
17:31:03 | cloudvirt1017.eqiad.wmnet | Scheduled delayed downtime on Icinga
17:31:03 | cloudvirt1017.eqiad.wmnet | Started first puppet run (sit back, relax, and enjoy the wait)
17:31:03 | cloudvirt1017.eqiad.wmnet | Sleeping for 120 seconds
17:31:22 | cloudvirt1017.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Failed to puppet_first_run
17:31:22 | cloudvirt1017.eqiad.wmnet | REIMAGE END | retcode=2
17:33:03 | cloudvirt1017.eqiad.wmnet | Running Puppet on the Icinga server
17:37:17 | einsteinium.wikimedia.org | Puppet run completed
ERROR:cumin.transports.clustershell.AsyncEventHandler:100.0% (1/1) of nodes failed to execute command 'icinga-downtime ...ab_task=T208733'': einsteinium.wikimedia.org
CRITICAL:cumin.transports.clustershell.AsyncEventHandler:0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
17:37:18 | cloudvirt1017.eqiad.wmnet | Unable to run wmf-downtime-host: Failed to icinga_downtime
ERROR:wmf-downtime-host:Unable to run wmf-downtime-host
Traceback (most recent call last):
  File "/usr/local/sbin/wmf-downtime-host", line 67, in main
    lib.icinga_downtime(args.host, user, args.phab_task_id, title='wmf-downtime-host')
  File "/usr/local/lib/python3.4/dist-packages/wmf_auto_reimage_lib.py", line 536, in icinga_downtime
    run_cumin('icinga_downtime', icinga_host, [command])
  File "/usr/local/lib/python3.4/dist-packages/wmf_auto_reimage_lib.py", line 469, in run_cumin
    raise RuntimeError('Failed to {label}'.format(label=label))
RuntimeError: Failed to icinga_downtime

Change 472223 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvps: hieradata for cloudvirt1017

https://gerrit.wikimedia.org/r/472223

Change 472223 merged by GTirloni:
[operations/puppet@production] cloudvps: hieradata for cloudvirt1017

https://gerrit.wikimedia.org/r/472223

cloudvirt1017 is up & running now by using the old-school method (manually make server boot through PXE and use install-console).

Change 472253 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Nova: add cloudvirt1017 to the scheduler pool

https://gerrit.wikimedia.org/r/472253

Change 472253 merged by GTirloni:
[operations/puppet@production] Nova: add cloudvirt1017 to the scheduler pool

https://gerrit.wikimedia.org/r/472253

VMs are being successfully spawned on cloudvirt1017 (2 tests VMs and one from a user project so far).

aborrero moved this task from Doing to Blocked on the cloud-services-team (Kanban) board.

All pending stuff is blocked by T208945.

aborrero changed the task status from Open to Stalled.Nov 12 2018, 1:08 PM
Andrew updated the task description. (Show Details)