Page MenuHomePhabricator

Productionize new clouddb* hosts (clouddb1022-1033)
Open, MediumPublic

Description

The following hosts need to be productionized:

  • clouddb1022 expansion s3, x3 - cloned
  • clouddb1023 expansion s3, x3 - cloned
  • clouddb1024 expansion x4, x1 - (s4 cloned, pending the renaming to x4 and puppet changes)
  • clouddb1025 s4 (cloned), s6 (cloned) [temp replacement host for clouddb1019, later will host expansion x4, x1]
  • clouddb1026 refresh for clouddb1013 (s1)
  • clouddb1027 refresh for clouddb1014 (s2, s7)
  • clouddb1028 refresh for clouddb1015 (s4, s6)
  • clouddb1029 refresh for clouddb1016 (s8, s5)
  • clouddb1030 refresh for clouddb1017 (s1)
  • clouddb1031 refresh for clouddb1018 (s2, s7)
  • clouddb1032 refresh for clouddb1019 (s4, s6)
  • clouddb1033 refresh for clouddb1020 (s8, s5)
  • Check all hosts are in zarcillo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1204364 merged by Marostegui:

[operations/puppet@production] clouddb1022: Initial puppet run

https://gerrit.wikimedia.org/r/1204364

Change #1204749 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] check_private_data_report: Add new hosts

https://gerrit.wikimedia.org/r/1204749

cloning clouddb1022:x3 from clouddb1016

Change #1204749 merged by Marostegui:

[operations/puppet@production] check_private_data_report: Add new hosts

https://gerrit.wikimedia.org/r/1204749

Change #1204812 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Productionize clouddb1023

https://gerrit.wikimedia.org/r/1204812

Change #1204812 merged by Marostegui:

[operations/puppet@production] mariadb: Productionize clouddb1023

https://gerrit.wikimedia.org/r/1204812

Change #1205001 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Productionize clouddb1024

https://gerrit.wikimedia.org/r/1205001

Change #1205001 merged by Marostegui:

[operations/puppet@production] mariadb: Productionize clouddb1024

https://gerrit.wikimedia.org/r/1205001

Change #1205003 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] clouddb1024: Change s4 with x4

https://gerrit.wikimedia.org/r/1205003

Change #1205003 merged by Marostegui:

[operations/puppet@production] clouddb1024: Change s4 with x4

https://gerrit.wikimedia.org/r/1205003

I think for now, the hosts with x4 will be left as s4, otherwise things will get weird with just clouddb* belonging to x4 (which will also show up in orchestrator as a new section with just those two hosts).
Once x4 is fully ready to be split from s4, we'll migrate the clouddb* hosts under that new name

Change #1207125 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Productionize clouddb1025

https://gerrit.wikimedia.org/r/1207125

Change #1207125 merged by Marostegui:

[operations/puppet@production] mariadb: Productionize clouddb1025

https://gerrit.wikimedia.org/r/1207125

Change #1207757 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] check_private_data_report: Add clouddb102[45]

https://gerrit.wikimedia.org/r/1207757

Change #1207757 merged by Marostegui:

[operations/puppet@production] check_private_data_report: Add clouddb102[45]

https://gerrit.wikimedia.org/r/1207757

Change #1210944 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] clouddb1024: Enable notifications

https://gerrit.wikimedia.org/r/1210944

Change #1210944 merged by Marostegui:

[operations/puppet@production] clouddb1024: Enable notifications

https://gerrit.wikimedia.org/r/1210944

Change #1211044 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] clouddb1025: Enable notifications

https://gerrit.wikimedia.org/r/1211044

Change #1211044 merged by Marostegui:

[operations/puppet@production] clouddb1025: Enable notifications

https://gerrit.wikimedia.org/r/1211044

@fnegri @taavi clouddb1024:s4 and clouddb1025:s4 can be added to the load balancer and start receiving traffic whenever you want - those hosts will be later x4 (in a few months, but nothing is required for now).

Change #1211083 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] conftool-data: Add clouddb1024/5 as x4

https://gerrit.wikimedia.org/r/1211083

Mentioned in SAL (#wikimedia-operations) [2025-11-25T15:29:48Z] <marostegui> Add clouddb1023 (s3,x3) to zarcillo T409557

Change #1211083 merged by Majavah:

[operations/puppet@production] conftool-data: Add clouddb1024/5 as x4

https://gerrit.wikimedia.org/r/1211083

Change #1212343 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] clouddb102[23]: Enable notifications

https://gerrit.wikimedia.org/r/1212343

Change #1212343 merged by Marostegui:

[operations/puppet@production] clouddb102[23]: Enable notifications

https://gerrit.wikimedia.org/r/1212343

@taavi @fnegri clouddb1022 and clouddb1023 (both x3 and s3) are ready to be added to the LB.

Change #1256417 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] conftool-data: move s3, x3 to new hosts

https://gerrit.wikimedia.org/r/1256417

Change #1259113 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] conftool-data: move s3, x3 to new hosts (part 2)

https://gerrit.wikimedia.org/r/1259113

Change #1256417 merged by FNegri:

[operations/puppet@production] conftool-data: move s3, x3 to new hosts (part 1)

https://gerrit.wikimedia.org/r/1256417

s3 and x3 traffic is now routed to clouddb102[23]:

fnegri@cumin1003:~$ sudo confctl select service=s3 get
{"clouddb1017.eqiad.wmnet": {"weight": 100, "pooled": "no"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=s3"}
{"clouddb1023.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=s3"}
{"clouddb1013.eqiad.wmnet": {"weight": 100, "pooled": "no"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=s3"}
{"clouddb1022.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=s3"}
fnegri@cumin1003:~$ sudo confctl select service=x3 get
{"clouddb1020.eqiad.wmnet": {"weight": 100, "pooled": "no"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=x3"}
{"clouddb1023.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=x3"}
{"clouddb1016.eqiad.wmnet": {"weight": 100, "pooled": "no"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=x3"}
{"clouddb1022.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=x3"}

@Marostegui given clouddb1019 is dead (T422813: clouddb1019 down) and we won't be getting its replacement clouddb1032 for a while, I was thinking if we should (temporarily) re-evaluate the allocation of sections for clouddb102[2-5] so that we have redundancy for s4 and s6. At the moment s4 and s6 are only running in clouddb1015, do you think it's feasible to clone another copy of them in e.g. clouddb102[23]? This would be only a temporary measure until clouddb1032 is available.

I can do that but I'll have to stop that host to clone the new one, would that be okay for the service?

I can do that but I'll have to stop that host to clone the new one, would that be okay for the service?

Good point, I forgot about that! :) How long does it take more or less? We could schedule it for some time next week and I can send a notification out to cloud-announce.

I think it may take around 5-6 hours. But to be on the safe side in case of issues, I'd suggest we announce a possible 24h times degradation stating that it is likely to take way less.
I'd also be doing one at the time, so we reduce the impact to one section at the time.
Let me know what you think

@Marostegui for s6, let's proceed with announcing a 24h maintenance window. Let me know when you have time to do it and I'll send out the announcement.

For s4, I have an idea but I'm not sure if it's doable: given that s4 is already cloned to clouddb102[45] (in preparation for the x4 split), could you clone another copy of s4 from that one instead of cloning from clouddb1015? This would avoid downtime for s4 replicas users.

@Marostegui for s6, let's proceed with announcing a 24h maintenance window. Let me know when you have time to do it and I'll send out the announcement.

I can work on it on Tuesday. So I can get s4 done on Monday.

For s4, I have an idea but I'm not sure if it's doable: given that s4 is already cloned to clouddb102[45] (in preparation for the x4 split), could you clone another copy of s4 from that one instead of cloning from clouddb1015? This would avoid downtime for s4 replicas users.

Yeah, that's a very good idea! We could also simply add that one to the LB too, but then we'd end with clouddb1032 with s6 and clouddb1024 with s4, better to have both on the same host.

Change #1273769 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Productionize clouddb1032

https://gerrit.wikimedia.org/r/1273769

Change #1273769 abandoned by Marostegui:

[operations/puppet@production] mariadb: Productionize clouddb1032

Reason:

Wrong patch

https://gerrit.wikimedia.org/r/1273769

Change #1273777 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site.pp: Move clouddb1024 to analytics

https://gerrit.wikimedia.org/r/1273777

Change #1273777 merged by Marostegui:

[operations/puppet@production] site.pp: Move clouddb1024 to analytics

https://gerrit.wikimedia.org/r/1273777

Change #1273785 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] cloudb1024: Add s6

https://gerrit.wikimedia.org/r/1273785

Change #1275286 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] eqiad.yaml: Add clouddb1024

https://gerrit.wikimedia.org/r/1275286

Change #1275286 merged by Marostegui:

[operations/puppet@production] eqiad.yaml: Add clouddb1025 to s4

https://gerrit.wikimedia.org/r/1275286

Change #1259113 merged by FNegri:

[operations/puppet@production] conftool-data: move s3, x3 to new hosts (part 2)

https://gerrit.wikimedia.org/r/1259113

Change #1273785 merged by Marostegui:

[operations/puppet@production] cloudb1025: Add s6

https://gerrit.wikimedia.org/r/1273785

Change #1275747 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] clouddb1025: Disable notifications

https://gerrit.wikimedia.org/r/1275747

Change #1275748 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] eqiad.yaml: Add clouddb1025 to s6

https://gerrit.wikimedia.org/r/1275748

Change #1275748 merged by Marostegui:

[operations/puppet@production] eqiad.yaml: Add clouddb1025 to s6

https://gerrit.wikimedia.org/r/1275748

Downtime for s6 was limited to less than 30 mins, from 05:30 UTC to 05:58 UTC:

root@cloudlb1001:~# journalctl --since -6h -g "Server wikireplica-db-web-s6"
Apr 21 05:30:03 cloudlb1001 haproxy[2659098]: [WARNING]  (2659098) : Server wikireplica-db-web-s6/clouddb1015.eqiad.wmnet is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 req>
Apr 21 05:58:31 cloudlb1001 haproxy[2659098]: [WARNING]  (2659098) : Server wikireplica-db-web-s6/clouddb1015.eqiad.wmnet is UP. 1 active and 0 backup servers online. 0 sessions requeued, 0 t>

s4 and s6 are now pooled in clouddb1025:

fnegri@cumin1003:~$ sudo confctl select name=clouddb1025.eqiad.wmnet get
{"clouddb1025.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=s4"}
{"clouddb1025.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=s6"}