Multiple CloudVPS instances lost their ips (unreachable)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RhinosF1
	Sep 29 2023, 6:56 AM

Description

IRC report that tools db was down. Seems many things alerting as down in #wikimedia-cloud-feed

Incident report: https://wikitech.wikimedia.org/wiki/Incidents/2023-09-29_CloudVPS_vms_losing_network_connectivity

Script being used to bring the vms online (we might want to reuse the console handling parts):

P52761 (An Untitled Masterwork)

1
2	import os
3	import sys
4	import argparse
5	import yaml
6	import keystoneauth1
7	from keystoneauth1.identity import v3
8	from keystoneauth1 import session as keystone_session
9	from keystoneclient.v3 import client as keystoneclient
10	from novaclient import client as novaclient
11	import glanceclient
12	import subprocess
13	import time
14
15
16	def keystone_session(project="cloudinfra"):
17	auth = v3.Password(
18	auth_url="https://openstack.eqiad1.wikimediacloud.org:25000/v3",
19	username="novaobserver",
20	password="Fs6Dq2RtG8KwmM2Z", # https://gerrit.wikimedia.org/g/operations/puppet/+/16906c693da99eacdf7be557cc19e110a30c96f1/hieradata/cloud/eqiad1.yaml#36
21	user_domain_name='Default',
22	project_domain_name='Default',
23	project_name=project,
24	)
25	return keystoneauth1.session.Session(auth=auth)
26
27
28	def fix_instance(vm_hypervisor, vm_libvirt_id, ip_address):
29	one_liner = f"( ip route get 208.80.154.224 \| grep 172.16.0.1 ) \|\| ( ip link set ens3 up; ip addr add {ip_address}/21 dev ens3; ip route add default via 172.16.0.1 )"
30	p = subprocess.Popen(
31	[
32	"/usr/bin/ssh",
33	# "-v",
34	"-tt",
35	# "-F/etc/cumin/ssh_config",
36	vm_hypervisor,
37	f"sudo virsh console {vm_libvirt_id} --force"
38	],
39	bufsize=0,
40	stdin=subprocess.PIPE,
41	stdout=subprocess.PIPE,
42	stderr=subprocess.PIPE,
43	text=True,
44	# env={"SSH_AUTH_SOCK": "/run/keyholder/proxy.sock"}
45	)
46
47	for line in iter(p.stdout.readline, b''):
48	#print(" ", line, end="")
49	if "Escape character" in line:
50	time.sleep(0.5)
51	p.stdin.write("\r\n")
52	p.stdin.flush()
53	elif "root@" in line:
54	time.sleep(0.2)
55	p.stdin.write(one_liner)
56	p.stdin.flush()
57	time.sleep(0.2)
58	p.stdin.write("\r\n")
59	p.stdin.flush()
60	time.sleep(0.2)
61	p.kill()
62	break
63
64
65
66	def fix_project(project, broken_images):
67	nova = novaclient.Client("2.0", session=keystone_session(project))
68	servers = nova.servers.list(
69	detailed=True,
70	sort_keys=["display_name"],
71	sort_dirs=["asc"],
72	)
73	for server in servers:
74	if server.image["id"] not in broken_images:
75	continue
76	if server.status != "ACTIVE":
77	print(server.name, project, server.status, "SKIP")
78	continue
79	vm_info = server._info
80	vm_hypervisor = vm_info['OS-EXT-SRV-ATTR:hypervisor_hostname']
81	vm_libvirt_id = vm_info['OS-EXT-SRV-ATTR:instance_name']
82	for sdn, interfaces in server.addresses.items():
83	for interface in interfaces:
84	if interface["addr"].startswith("172.16."):
85	print(server.name, project, interface["addr"], vm_hypervisor, vm_libvirt_id)
86	fix_instance(vm_hypervisor, vm_libvirt_id, interface["addr"])
87
88
89	def get_broken_images():
90	glance = glanceclient.Client(
91	version="2",
92	session=keystone_session(),
93	interface="public",
94	)
95
96	broken = []
97	for image in glance.images.list():
98	if "bullseye" in image["name"]:
99	broken.append(image["id"])
100	return broken
101
102
103
104	def main():
105	broken_images = get_broken_images()
106	keystone = keystoneclient.Client(
107	session=keystone_session(),
108	interface="public",
109	timeout=2,
110	)
111
112	for project in keystone.projects.list(enabled=True, domain="default"):
113	if project.name.startswith("a") or project.name.startswith("b") or project.name.startswith("c"):
114	continue
115	print("*", project.name)
116	fix_project(project.name, broken_images)
117
118
119
120	if __name__ == "__main__":
121	main()

Details

	Subject	Repo	Branch	Lines +/-
	pontoon: Set additional_purged_packages to be empty	operations/puppet	production	+1 -0
	wmcs: instance: install isc-dhcp-client	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	AmandaNP	T347661 UTRS - No route to host
Resolved	dcaro	T347665 Multiple CloudVPS instances lost their ips (unreachable)
Open	dcaro	T347681 Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right
Open	taavi	T347683 openstack: create a cookbook to inject commands to VMs via console at scale
Open	dcaro	T347694 monitoring: find out how we could have been paged for outage "Multiple CloudVPS instances lost their IPs"

Event Timeline

RhinosF1 created this task.Sep 29 2023, 6:56 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 29 2023, 6:56 AM

RhinosF1 triaged this task as Unbreak Now! priority.Sep 29 2023, 6:57 AM

Chlod subscribed.Sep 29 2023, 6:58 AM

Soda subscribed.Sep 29 2023, 6:58 AM

DreamRimmer subscribed.Sep 29 2023, 7:02 AM

dcaro changed the task status from Open to In Progress.Sep 29 2023, 7:06 AM

dcaro claimed this task.

dcaro moved this task from To refine to Doing on the User-dcaro board.

dcaro added a project: User-dcaro.

The main issue si that the cloudvps proxy is down, due to it not having any dhcpclient to renew the public ip.

This seems to be a result of:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/961005

We are manually trying to reinstall the package to restore service before reverting the patch if needed.

Statanalyser seems to be affected, as it did not run this morning.

dcaro renamed this task from Multiple CloudVPS instances down to Multiple CloudVPS instances lost their ips (unreachable).Sep 29 2023, 7:43 AM

Arian_Ar subscribed.Sep 29 2023, 7:46 AM

stwalkerster subscribed.Sep 29 2023, 7:51 AM

https://wikipedialibrary.wmflabs.org/ is down.

https://accounts.wmflabs.org/ is also down

LSobanski subscribed.Sep 29 2023, 8:15 AM

https://en.wikipedia.beta.wmflabs.org/ is down as well. This restbase tests don't work. I guess I'll do somethign else with my day, then.

Mentioned in SAL (#wikimedia-cloud) [2023-09-29T08:36:19Z] <taavi> start script to fix networking on broken bullseye instances T347665

Change 961986 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] wmcs: instance: install isc-dhcp-client

https://gerrit.wikimedia.org/r/961986

gerritbot added a project: Patch-For-Review.Sep 29 2023, 8:40 AM

Change 961986 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs: instance: install isc-dhcp-client

https://gerrit.wikimedia.org/r/961986

lowering priority as most of the cloud infra has been recovered already.

XTools is down: https://xtools.wmcloud.org/ec/enwiki/ I assume it's related to this? cc @MusikAnimal

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T09:04:37Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-72 (T347665)

In T347665#9209696, @Novem_Linguae wrote:

XTools is down: https://xtools.wmcloud.org/ec/enwiki/ I assume it's related to this? cc @MusikAnimal

yes. and it's back now.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T09:06:18Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-72 (T347665)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T09:06:36Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-76 (T347665)

Maintenance_bot removed a project: Patch-For-Review.Sep 29 2023, 9:10 AM

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T09:15:16Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-76 (T347665)

dcaro updated the task description. (Show Details)Sep 29 2023, 9:16 AM

valerio.bozzolan subscribed.Sep 29 2023, 9:23 AM

My tool just went down ( http://vector-dark.toolforge.org/ ) - the address is reachable but browser is loading forever. And I'm unable to ssh into it (asks for password and then hangs).

fnegri added a parent task: T347661: UTRS - No route to host.Sep 29 2023, 9:55 AM

TuukkaH subscribed.Sep 29 2023, 10:04 AM

aborrero mentioned this in T347681: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right.Sep 29 2023, 10:13 AM

aborrero added a subtask: T347681: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T10:48:30Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for all workers (T347665)

Sent an update to the announce list: https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/Q75R6W2LQSVILIFHMQTETCE666UF4MBT/

In T347665#9209846, @Msz2001 wrote:

My tool just went down ( http://vector-dark.toolforge.org/ ) - the address is reachable but browser is loading forever. And I'm unable to ssh into it (asks for password and then hangs).

It works for me now, can you retry?

I'll close this for now, there's a few things that we can improve for the next time but those will be addressed on their own tasks.

Thanks!

dcaro closed this task as Resolved.Sep 29 2023, 1:02 PM

dcaro updated the task description. (Show Details)Sep 29 2023, 2:43 PM

In T347665#9210352, @dcaro wrote:

In T347665#9209846, @Msz2001 wrote:

My tool just went down ( http://vector-dark.toolforge.org/ ) - the address is reachable but browser is loading forever. And I'm unable to ssh into it (asks for password and then hangs).

It works for me now, can you retry?

I'll close this for now, there's a few things that we can improve for the next time but those will be addressed on their own tasks.

Thanks!

Yeah, works for me too.

Change 971345 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] pontoon: Set additional_purged_packages to be empty

https://gerrit.wikimedia.org/r/971345

gerritbot added a project: Patch-For-Review.Nov 3 2023, 6:54 AM

Change 971345 merged by Andrea Denisse: