Page MenuHomePhabricator

Multiple CloudVPS instances lost their ips (unreachable)
Closed, ResolvedPublic

Description

IRC report that tools db was down. Seems many things alerting as down in #wikimedia-cloud-feed

Incident report: https://wikitech.wikimedia.org/wiki/Incidents/2023-09-29_CloudVPS_vms_losing_network_connectivity

Script being used to bring the vms online (we might want to reuse the console handling parts):

1
2import os
3import sys
4import argparse
5import yaml
6import keystoneauth1
7from keystoneauth1.identity import v3
8from keystoneauth1 import session as keystone_session
9from keystoneclient.v3 import client as keystoneclient
10from novaclient import client as novaclient
11import glanceclient
12import subprocess
13import time
14
15
16def keystone_session(project="cloudinfra"):
17 auth = v3.Password(
18 auth_url="https://openstack.eqiad1.wikimediacloud.org:25000/v3",
19 username="novaobserver",
20 password="Fs6Dq2RtG8KwmM2Z", # https://gerrit.wikimedia.org/g/operations/puppet/+/16906c693da99eacdf7be557cc19e110a30c96f1/hieradata/cloud/eqiad1.yaml#36
21 user_domain_name='Default',
22 project_domain_name='Default',
23 project_name=project,
24 )
25 return keystoneauth1.session.Session(auth=auth)
26
27
28def fix_instance(vm_hypervisor, vm_libvirt_id, ip_address):
29 one_liner = f"( ip route get 208.80.154.224 | grep 172.16.0.1 ) || ( ip link set ens3 up; ip addr add {ip_address}/21 dev ens3; ip route add default via 172.16.0.1 )"
30 p = subprocess.Popen(
31 [
32 "/usr/bin/ssh",
33 # "-v",
34 "-tt",
35 # "-F/etc/cumin/ssh_config",
36 vm_hypervisor,
37 f"sudo virsh console {vm_libvirt_id} --force"
38 ],
39 bufsize=0,
40 stdin=subprocess.PIPE,
41 stdout=subprocess.PIPE,
42 stderr=subprocess.PIPE,
43 text=True,
44 # env={"SSH_AUTH_SOCK": "/run/keyholder/proxy.sock"}
45 )
46
47 for line in iter(p.stdout.readline, b''):
48 #print(" ", line, end="")
49 if "Escape character" in line:
50 time.sleep(0.5)
51 p.stdin.write("\r\n")
52 p.stdin.flush()
53 elif "root@" in line:
54 time.sleep(0.2)
55 p.stdin.write(one_liner)
56 p.stdin.flush()
57 time.sleep(0.2)
58 p.stdin.write("\r\n")
59 p.stdin.flush()
60 time.sleep(0.2)
61 p.kill()
62 break
63
64
65
66def fix_project(project, broken_images):
67 nova = novaclient.Client("2.0", session=keystone_session(project))
68 servers = nova.servers.list(
69 detailed=True,
70 sort_keys=["display_name"],
71 sort_dirs=["asc"],
72 )
73 for server in servers:
74 if server.image["id"] not in broken_images:
75 continue
76 if server.status != "ACTIVE":
77 print(server.name, project, server.status, "SKIP")
78 continue
79 vm_info = server._info
80 vm_hypervisor = vm_info['OS-EXT-SRV-ATTR:hypervisor_hostname']
81 vm_libvirt_id = vm_info['OS-EXT-SRV-ATTR:instance_name']
82 for sdn, interfaces in server.addresses.items():
83 for interface in interfaces:
84 if interface["addr"].startswith("172.16."):
85 print(server.name, project, interface["addr"], vm_hypervisor, vm_libvirt_id)
86 fix_instance(vm_hypervisor, vm_libvirt_id, interface["addr"])
87
88
89def get_broken_images():
90 glance = glanceclient.Client(
91 version="2",
92 session=keystone_session(),
93 interface="public",
94 )
95
96 broken = []
97 for image in glance.images.list():
98 if "bullseye" in image["name"]:
99 broken.append(image["id"])
100 return broken
101
102
103
104def main():
105 broken_images = get_broken_images()
106 keystone = keystoneclient.Client(
107 session=keystone_session(),
108 interface="public",
109 timeout=2,
110 )
111
112 for project in keystone.projects.list(enabled=True, domain="default"):
113 if project.name.startswith("a") or project.name.startswith("b") or project.name.startswith("c"):
114 continue
115 print("*", project.name)
116 fix_project(project.name, broken_images)
117
118
119
120if __name__ == "__main__":
121 main()

Event Timeline

RhinosF1 triaged this task as Unbreak Now! priority.Sep 29 2023, 6:57 AM
dcaro changed the task status from Open to In Progress.Sep 29 2023, 7:06 AM
dcaro claimed this task.
dcaro moved this task from To refine to Doing on the User-dcaro board.
dcaro added a project: User-dcaro.

The main issue si that the cloudvps proxy is down, due to it not having any dhcpclient to renew the public ip.

This seems to be a result of:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/961005

We are manually trying to reinstall the package to restore service before reverting the patch if needed.

Statanalyser seems to be affected, as it did not run this morning.

dcaro renamed this task from Multiple CloudVPS instances down to Multiple CloudVPS instances lost their ips (unreachable).Sep 29 2023, 7:43 AM

https://en.wikipedia.beta.wmflabs.org/ is down as well. This restbase tests don't work. I guess I'll do somethign else with my day, then.

Mentioned in SAL (#wikimedia-cloud) [2023-09-29T08:36:19Z] <taavi> start script to fix networking on broken bullseye instances T347665

Change 961986 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] wmcs: instance: install isc-dhcp-client

https://gerrit.wikimedia.org/r/961986

Change 961986 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs: instance: install isc-dhcp-client

https://gerrit.wikimedia.org/r/961986

aborrero lowered the priority of this task from Unbreak Now! to High.Sep 29 2023, 8:50 AM
aborrero subscribed.

lowering priority as most of the cloud infra has been recovered already.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T09:04:37Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-72 (T347665)

XTools is down: https://xtools.wmcloud.org/ec/enwiki/ I assume it's related to this? cc @MusikAnimal

yes. and it's back now.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T09:06:18Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-72 (T347665)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T09:06:36Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-76 (T347665)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T09:15:16Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-76 (T347665)

My tool just went down ( http://vector-dark.toolforge.org/ ) - the address is reachable but browser is loading forever. And I'm unable to ssh into it (asks for password and then hangs).

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T10:48:30Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for all workers (T347665)

My tool just went down ( http://vector-dark.toolforge.org/ ) - the address is reachable but browser is loading forever. And I'm unable to ssh into it (asks for password and then hangs).

It works for me now, can you retry?

I'll close this for now, there's a few things that we can improve for the next time but those will be addressed on their own tasks.

Thanks!

My tool just went down ( http://vector-dark.toolforge.org/ ) - the address is reachable but browser is loading forever. And I'm unable to ssh into it (asks for password and then hangs).

It works for me now, can you retry?

I'll close this for now, there's a few things that we can improve for the next time but those will be addressed on their own tasks.

Thanks!

Yeah, works for me too.

Change 971345 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] pontoon: Set additional_purged_packages to be empty

https://gerrit.wikimedia.org/r/971345

Change 971345 merged by Andrea Denisse:

[operations/puppet@production] pontoon: Set additional_purged_packages to be empty

https://gerrit.wikimedia.org/r/971345