Page MenuHomePhabricator

Drain C8 rack
Closed, ResolvedPublic

Description

Hosts in the rack:

  • cloudbackup1003 - no need
  • cloudcephmon1003 - no need
  • cloudcephmon1004 - no need
  • cloudcephosd1006 - need drain
  • cloudcephosd1007 - need drain
  • cloudcephosd1008 - need drain
  • cloudcephosd1009 - need drain
  • cloudcephosd1016 - need drain
  • cloudcephosd1017 - need drain
  • cloudcephosd1018 - need drain
  • cloudcephosd1021 - need drain
  • cloudcephosd1022 - done
  • cloudcephosd1035 - need drain
  • cloudcontrol1005 - no need
  • cloudgw1001 - no need
  • cloudlb1001 - no need
  • cloudnet1005 - no need
  • cloudrabbit1001 - no need
  • cloudservices1006 - no need
  • cloudvirt1031 - drained
  • cloudvirt1032 - drained
  • cloudvirt1033 - drained
  • cloudvirt1034 - drained
  • cloudvirt1035 - drained

For the drain I used a few handy scripts (leaving here with prospect to move to cookbooks soon-ish):

1#!/usr/bin/env python3
2import subprocess
3import json
4
5
6health_json = subprocess.check_output(["ceph", "health", "detail", "-f=json"])
7health_data = json.loads(health_json)
8
9for message_data in health_data["checks"]["OSD_NEARFULL"]["detail"]:
10 message = message_data["message"]
11 osd = message.split(" ", 1)[0]
12 print(f"Reweighting osd {osd}")
13 weight_json = subprocess.check_output(["ceph", "osd", "df", "-f=json", osd])
14 weight_data = json.loads(weight_json)
15 cur_weight = weight_data["nodes"][0]["reweight"]
16 print(f" cur_weight={cur_weight}")
17 new_weight = cur_weight - 0.05
18 if new_weight < 0.5:
19 print(f"WARNING: weight of {osd} will go under 0.5 -> {new_weight}")
20 print(f" new_weight={new_weight}")
21
22
23print("If that's ok, hit enter")
24input()
25for message_data in health_data["checks"]["OSD_NEARFULL"]["detail"]:
26 message = message_data["message"]
27 osd = message.split(" ", 1)[0]
28 print(f"Reweighting osd {osd}")
29 weight_json = subprocess.check_output(["ceph", "osd", "df", "-f=json", osd])
30 weight_data = json.loads(weight_json)
31 cur_weight = weight_data["nodes"][0]["reweight"]
32 print(f" cur_weight={cur_weight}")
33 new_weight = cur_weight - 0.05
34 print(f" new_weight={new_weight}")
35 subprocess.check_call(["ceph", "osd", "reweight", osd, str(new_weight)])
36

1#!/usr/bin/env python3
2import subprocess
3import json
4import click
5
6
7
8
9@click.command()
10@click.argument("RACK")
11def main(rack: str) -> None:
12 rack_json = subprocess.check_output(["ceph", "osd", "df", rack, "-f=json"])
13 rack_data = json.loads(rack_json)
14 to_reweight: list[tuple[str, float]] = []
15
16 for node_data in rack_data["nodes"]:
17 osd = node_data["name"]
18 osd_variance = node_data["var"]
19 if osd_variance < 1.1:
20 print(f"Skipping osd {osd}, not too full (var={osd_variance})")
21 continue
22
23 print(f"Reweighting osd {osd}")
24 weight_json = subprocess.check_output(["ceph", "osd", "df", "-f=json", osd])
25 weight_data = json.loads(weight_json)
26 cur_weight = weight_data["nodes"][0]["reweight"]
27 print(f" cur_weight={cur_weight}")
28 new_weight = cur_weight - 0.05
29 if new_weight < 0.5:
30 print(f"WARNING: weight of {osd} will go under 0.5 -> {new_weight}")
31 print(f" new_weight={new_weight}")
32 to_reweight.append((osd, new_weight))
33
34
35 print(f"Reweighting a total of {len(to_reweight)} osds")
36 print("If that's ok, hit enter")
37 input()
38 for osd, new_weight in to_reweight:
39 print(f"Reweighting osd {osd}")
40 subprocess.check_call(["ceph", "osd", "reweight", osd, str(new_weight)])
41
42if __name__ == "__main__":
43 main()

1#!/usr/bin/env python3
2import subprocess
3import json
4
5
6tree_json = subprocess.check_output(["ceph", "osd", "tree", "-f=json"])
7tree_data = json.loads(tree_json)
8
9osds = []
10new_weight = 1
11
12for node in tree_data["nodes"]:
13 if not node["name"].startswith("osd."):
14 continue
15
16 osd = node["name"]
17 print(f"Reweighting osd {osd}")
18 weight_json = subprocess.check_output(["ceph", "osd", "df", "-f=json", osd])
19 weight_data = json.loads(weight_json)
20 cur_weight = weight_data["nodes"][0]["reweight"]
21 print(f" cur_weight={cur_weight}")
22 if cur_weight == 0:
23 print(f" SKIPPING as it's out of the cluster")
24 continue
25 elif cur_weight == 1:
26 print(f" SKIPPING as it's already reset")
27
28 print(f" new_weight={new_weight}")
29 osds.append(osd)
30
31
32print("If that's ok, hit enter")
33input()
34for osd in osds:
35 subprocess.check_call(["ceph", "osd", "reweight", osd, str(new_weight)])
36

Event Timeline

@aborrero, @fnegri, @Andrew can you give the list a look and add any notes about if they need draining or not?

We should drain the osds and cloudvirts. The few other should be fine.

We should drain the osds and cloudvirts. The few other should be fine.

yeah, cloudnet, cloudservices, cloudlb, cloudgw, the service they provide should survive a switch outage.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T16:21:01Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1031.eqiad.wmnet' (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T16:42:37Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1031.eqiad.wmnet' (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T16:43:30Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1032.eqiad.wmnet' (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T17:08:13Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1032.eqiad.wmnet' (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T17:08:34Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1033.eqiad.wmnet' (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T17:22:43Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1033.eqiad.wmnet' (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T18:05:54Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1034.eqiad.wmnet' (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T18:29:03Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1034.eqiad.wmnet' (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T18:45:31Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1035.eqiad.wmnet' (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T18:58:43Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1035.eqiad.wmnet' (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-17T16:11:27Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.undrain_node (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-17T16:11:38Z] <wmbot~dcaro@urcuchillay> END (ERROR) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=97) (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-17T16:24:14Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.undrain_node (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-17T21:24:55Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T374043)

dcaro triaged this task as Medium priority.Sep 18 2024, 2:15 PM

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-19T09:55:41Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.undrain_node (T374043)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-19T09:56:31Z] <wmbot~dcaro@urcuchillay> END (ERROR) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=97) (T374043)

No, this is done