Page MenuHomePhabricator

Backups for x3
Closed, ResolvedPublic

Description

We are slowly splitting s8 into the main section (for all tables) and x3 (term store tables, they start with wbt_). Currently out of 1.6TB in s8, these tables will be moved out:

root@db1172:/srv/sqldata/wikidatawiki# ls -Ssh | grep -i wbt_
249G wbt_item_terms.ibd
 51G wbt_text_in_lang.ibd
 47G wbt_term_in_lang.ibd
 25G wbt_text.ibd
 31M wbt_property_terms.ibd
 64K wbt_type.ibd

which roughly translates to 370GB. They all are derivative data but rebuilding it will take a long time (still slightly outdated data would be fine). This also means s8 main cluster will shrink to 1.2TB. It could probably sit next to s8 backup sources in a separate maraidb daemon but that's for @jcrespo to decide

  • Expected date: 2025-04-21

Event Timeline

Thanks, could you also add a date of when this should be up and running (I am asking you to guesstimate, we can change it later, but that way I can prepare best)?

The code changes to allow the switch is mostly ready but hardware got into problems. eqiad request got lost (T379752#10453412) and codfw supermicro one had issues (I don't know it's resolved or not). So my guess would be something along the lines of three of months until they get racked and productionized and pooled into production (and then we can slowly start the switch).

The code changes to allow the switch is mostly ready but hardware got into problems. eqiad request got lost (T379752#10453412) and codfw supermicro one had issues (I don't know it's resolved or not). So my guess would be something along the lines of three of months until they get racked and productionized and pooled into production (and then we can slowly start the switch).

I am trying to find out what is going on with the HW. codfw Dell hosts are in place (they've been there for a few months no). Supermicro hosts in both DCs aren't missing in action. Dell hosts in eqiad were received in Nov but never racked as far as I know.

jcrespo triaged this task as High priority.

Change #1148278 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Setup backups for x3

https://gerrit.wikimedia.org/r/1148278

Icinga downtime and Alertmanager silence (ID=3e611667-66d6-4637-a866-27b93dc8803e) set by jynus@cumin1002 for 4:00:00 on 2 host(s) and their services with reason: Move s8 to s3

db2200.codfw.wmnet,db1216.eqiad.wmnet

Change #1148278 merged by Jcrespo:

[operations/puppet@production] dbbackups: Setup backups for x3

https://gerrit.wikimedia.org/r/1148278

@Marostegui @Ladsgroup The backups are setup, however:

  • db1216 and db2200 are not yet replicating from x3 masters, this needs changing; but
  • There is a blocker on how ports are defined. x2 and consequently x3 was badly chosen for ports in multiinstance, as there is a conflict between x1 aditional main port (3320 + 20) and x3 (3340). This was why latest port was chosen as 3350. x2 should be something like 3390 or 3391 and thus x3 3391 or 3400) so there are not conflicts. Right now, if db1216 is restarted, x3 will fail to start. Happy to contemplate other options (like not being port + 20 everywhere).

^ this last part is something only you can decide (but I will happy to implement and deploy whatever you decide).

I don't really have any strong opinions on the ports, we rarely use them anyway, so we can change them as you wish.
I am fine with x2 being 3390 and x3 3400

Ok, let me send a patch and I will ask for a +1 from both of you on the proposal, so I will be able to restart db1216 without issues. I think it should only affect db1216 and db2200 and they will be the only multiinstance x3 hosts (I don't think there is any multiinstance x2 host).

Correct - we do not have multi-instance hosts in production anywhere

Change #1148822 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Change x2 and x3 ports to avoid conflicts with extra port

https://gerrit.wikimedia.org/r/1148822

image.png (441×2 px, 89 KB)

I am now going to merge and restart the instances.

Icinga downtime and Alertmanager silence (ID=0aaaf16c-1af9-4d2a-91a4-ea1a68d943e2) set by jynus@cumin1002 for 4:00:00 on 2 host(s) and their services with reason: Restart x3

db2200.codfw.wmnet,db1216.eqiad.wmnet

Change #1148822 merged by Jcrespo:

[operations/puppet@production] mariadb: Change x2 and x3 ports to avoid conflicts with extra port

https://gerrit.wikimedia.org/r/1148822

Mentioned in SAL (#wikimedia-operations) [2025-05-21T15:25:07Z] <jynus> forgetting 4 old instances @ orchestrator-web T384274

The backups are being generated correctly, as show here: T384274#10844543, and the hosts are replicating from the hosts mentioned at @ T351820#10844786 The only thing I didn't do was upgrading to 10.11, as that would be complex right now without upgrading more sections. I declare this as resolved, and defer pending maintenance operations for the x3 split of these 2 instances to @Ladsgroup.

I've updated zarcillo to monitor the new section:

Other than that, I consider this resolved.

Change #1149329 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Check for newly created x3 backups at icinga

https://gerrit.wikimedia.org/r/1149329

Change #1149329 merged by Jcrespo:

[operations/puppet@production] dbbackups: Check for newly created x3 backups at icinga

https://gerrit.wikimedia.org/r/1149329

I've added monitoring to the new backups, on both icinga and backupmon:

image.png (66×1 px, 21 KB)

Change #1149338 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Update backup ports for x3

https://gerrit.wikimedia.org/r/1149338

Change #1149338 merged by Jcrespo:

[operations/puppet@production] dbbackups: Update backup ports for x3

https://gerrit.wikimedia.org/r/1149338