Page MenuHomePhabricator

Unable to ssh to VM backend.wikicommunityhealth.eqiad1.wikimedia.cloud
Closed, ResolvedPublic

Description

Write the description below

From irc:

User CristianCantoro is unable to ssh to the VM backend.wikicommunityhealth.eqiad1.wikimedia.cloud from the bastion:

$ ssh backend.wikicommunityhealth.eqiad1.wikimedia.cloud
ssh: connect to host backend.wikicommunityhealth.eqiad1.wikimedia.cloud port 22: Connection refused

Event Timeline

dcaro triaged this task as High priority.Aug 4 2021, 11:02 AM
dcaro created this task.

The machine seems to have failed to setup the local disks when running cloud-init:

...
M[K[[0m[0;31m*     [0m] A start job is running for /dev/dis���e76f05e1d30b (1min 27s / 1min 30s)
M[K[[0;1;31m*[0m[0;31m*    [0m] A start job is running for /dev/dis���e76f05e1d30b (1min 28s / 1min 30s)
M[K[[0;31m*[0;1;31m*[0m[0;31m*   [0m] A start job is running for /dev/dis���e76f05e1d30b (1min 28s / 1min 30s)
M[K[ [0;31m*[0;1;31m*[0m[0;31m*  [0m] A start job is running for /dev/dis���e76f05e1d30b (1min 29s / 1min 30s)
M[K[  [0;31m*[0;1;31m*[0m[0;31m* [0m] A start job is running for /dev/dis���e76f05e1d30b (1min 29s / 1min 30s)
M[K[   [0;31m*[0;1;31m*[0m[0;31m*[0m] A start job is running for /dev/dis���e76f05e1d30b (1min 30s / 1min 30s)
M[K[[0;1;31m TIME [0m] Timed out waiting for device [0;1;���8-86fc-4552-9e41-e76f05e1d30b[0m.
[K[[0;1;33mDEPEND[0m] Dependency failed for [0;1;39m/mnt/backdata[0m.
[[0;1;33mDEPEND[0m] Dependency failed for [0;1;39mLocal File Systems[0m.
...

The virsh console is unresponsive, asked @CristianCantoro if it's ok to reboot the VM, and if they were able to ssh to
it ever (maybe it failed to provision).

Hi @dcaro,

The machine seems to have failed to setup the local disks when running cloud-init:

[...]

The virsh console is unresponsive, asked @CristianCantoro if it's ok to reboot the VM, and if they were able to ssh to
it ever (maybe it failed to provision).

strange, I am sure I was able to log in and I have installed some libraries and tools (e g. docker).

Reboot at will, if needed :-)

Thanks for your assistance!

C

Mentioned in SAL (#wikimedia-cloud) [2021-08-04T12:19:17Z] <dcaro> rebooting backend instance (T288069)

Mentioned in SAL (#wikimedia-cloud) [2021-08-04T12:38:03Z] <dcaro> stopping backend instance (T288069)

Mentioned in SAL (#wikimedia-cloud) [2021-08-04T12:42:11Z] <dcaro> the server seems to run cloud-init every time it boots (T288069)

Mentioned in SAL (#wikimedia-cloud) [2021-08-04T13:16:31Z] <dcaro> migrated the backend vm to cloudvirt1040, same host as frontend, still getting stuck at boot (T288069)

@CristianCantoro having spent some time trying to debug the issue, it seems it's going to be tricky to continue debugging, would you mind trying rebuilding the instance? (if you don't mind losing what's there now)
Thanks!

Hi @dcaro,

@CristianCantoro having spent some time trying to debug the issue, it seems it's going to be tricky to continue debugging, would you mind trying rebuilding the instance? (if you don't mind losing what's there now)

No problem at all, the volumes are empty and the instances are new as well. I have just installed docker and a few other things, otherwise, they are quite vanilla.

Should I rebuild it with the "Rebuild instance" command from horizon?

C

Should I rebuild it with the "Rebuild instance" command from horizon?

It seems that I am still unable to SSH into the instance even after the rebuild:

$ ssh -J cristiancantoro@bastion.wmcloud.org cristiancantoro@backend.wikicommunityhealth.eqiad1.wikimedia.cloud
channel 0: open failed: connect failed: Connection refused
stdio forwarding failed
ssh_exchange_identification: Connection closed by remote host

Mentioned in SAL (#wikimedia-cloud) [2021-08-04T16:00:32Z] <dcaro> rebuilding backend instance to debug initialization process (T288069)

Mentioned in SAL (#wikimedia-cloud) [2021-08-04T16:06:56Z] <dcaro> rebuilt backend instance without the attached volume, and the instance is up and reachable, will try with the volume (T288069)

Mentioned in SAL (#wikimedia-cloud) [2021-08-04T16:11:41Z] <dcaro> rebooted the VM and it's back up, with prompt on virsh, and reachable through ssh, CristianCantoro can you try and confirm?(T288069)

Mentioned in SAL (#wikimedia-cloud) [2021-08-04T16:11:41Z] <dcaro> rebooted the VM and it's back up, with prompt on virsh, and reachable through ssh, CristianCantoro can you try and confirm?(T288069)

I confirm I can now log into the machine with SSH.