Page MenuHomePhabricator

Decide how to use the new clouddb hosts (clouddb102[2-5])
Closed, ResolvedPublic

Description

4 new clouddb hosts (clouddb1022 to cloudd1025) have just been racked in T393733: Q4:rack/setup/install clouddb102[2-5].

The existing 8 clouddb hosts (clouddb1013 to clouddb1020) are due for replacement soon (2025 according to this spreadsheet).

My understanding is that in this FY we're going to order 8 additional hosts for a total of 12 clouddbs.

We need to define how we want to spread the database sections now that we have more hosts. This might allow us to also replicate additional sections, as discussed in T395881: Set up x1 replication to Wiki Replicas.

Event Timeline

In our discussions back then during the budgeting. The idea of these new hosts were these:

  • Adding one instance (not host) for x1
  • Moving s4 and s8 to their own instances
    • I think this doesn't make much sense anymore given that we are have split s8 to x3 and we will split s4 to x4 but this also means we will need to set up replication of x4 plus we need to probably move out x3 out of the host that's currently supporting both s8 and x3 (and other sections too)

IIRC, the idea was to move out s4 and s8 to dedicated hosts and use the free space to give to x1, x3 and x4

Obviously, the map of which section to where was not defined.

fnegri triaged this task as Medium priority.Aug 20 2025, 2:30 PM
akosiaris changed the task status from Open to Stalled.Sep 16 2025, 2:01 PM

Setting to stalled, while we figure out the exact details of this one.

So my proposal would be to do this considering 8 existing hosts + 4 expansion hosts already racked.
Considerations

  • Leave s1 alone
  • Add x1, x4 and of course maintain x3
  • Two sections per host (excluding s1) now that we have enough hosts

This is what we currently have

clouddb1013: s1, s3
clouddb1014: s2, s7
clouddb1015: s4, s6
clouddb1016: s8, x3, s5
clouddb1017: s1, s3
clouddb1018: s2, s7
clouddb1019: s4, s6
clouddb1020: s8, x3, s5

My proposal would be:

clouddb1013 : s1           (s3 removed)
clouddb1014 : s2, s7       (unchanged)
clouddb1015 : s4, s6       (unchanged)
clouddb1016 : s8, s5       (x3 removed)
clouddb1017 : s1           (s3 removed)
clouddb1018 : s2, s7       (unchanged)
clouddb1019 : s4, s6       (unchanged)
clouddb1020 : s8, s5       (x3 removed)

### New hosts ###

clouddb1022 : s3, x3
clouddb1023 : s3, x3
clouddb1024 : x4, x1
clouddb1025 : x4, x1

x4 is still a bit far from being ready (T398709 T404715) but we should just plan for it.

Thoughts?

Are you leaving s1 alone based on its size, or expected traffic? Are some sections growing in size much faster than others?

If size allows, I think it might be interesting to consider having 3 copies of (all|some) sections, on 3 different hosts. The reason I'm suggesting this is that we have only one copy with a high query timeout (the "analytics" replica with timeout at 10800 seconds), so when we have to depool that copy, we're effectively degrading the user experience because we're redirecting traffic to the "web" replica with a lower timeout of 300 seconds.

I have to say I'm not entirely sure we need to have a special host with a lower timeout, we could also experiment with having the same value of 10800 on all hosts, in that case 2 copies would be enough and we could load-balance queries between the 2 copies.

It would also be nice to gather some more data on which sections get more traffic from users.

fnegri changed the task status from Stalled to In Progress.Oct 23 2025, 2:53 PM

Are you leaving s1 alone based on its size, or expected traffic? Are some sections growing in size much faster than others?

If size allows, I think it might be interesting to consider having 3 copies of (all|some) sections, on 3 different hosts. The reason I'm suggesting this is that we have only one copy with a high query timeout (the "analytics" replica with timeout at 10800 seconds), so when we have to depool that copy, we're effectively degrading the user experience because we're redirecting traffic to the "web" replica with a lower timeout of 300 seconds.

It could cause a lot of issues. I'm not sure a host can take three sections at the same time (in terms of writes, read, etc.) [1] I think it can even hit storage limits (specially when we need to do maint work and large alter tables) and most importantly by splitting to three it means each section will get much smaller portion of the memory for innodb buffer pool which degrades the performance of the queries for everyone at all times. We could look into buying more hosts so we could still have two sections in each host but on three hosts but that'd be a bit expensive (I think it'd be around $100K for the expansion).

[1] Except s5, s6, x1, x3. They are tiny.

So my proposal would be to do this considering 8 existing hosts + 4 expansion hosts already racked.
Considerations

  • Leave s1 alone
  • Add x1, x4 and of course maintain x3
  • Two sections per host (excluding s1) now that we have enough hosts

This is what we currently have

clouddb1013: s1, s3
clouddb1014: s2, s7
clouddb1015: s4, s6
clouddb1016: s8, x3, s5
clouddb1017: s1, s3
clouddb1018: s2, s7
clouddb1019: s4, s6
clouddb1020: s8, x3, s5

My proposal would be:

clouddb1013 : s1           (s3 removed)
clouddb1014 : s2, s7       (unchanged)
clouddb1015 : s4, s6       (unchanged)
clouddb1016 : s8, s5       (x3 removed)
clouddb1017 : s1           (s3 removed)
clouddb1018 : s2, s7       (unchanged)
clouddb1019 : s4, s6       (unchanged)
clouddb1020 : s8, s5       (x3 removed)

### New hosts ###

clouddb1022 : s3, x3
clouddb1023 : s3, x3
clouddb1024 : x4, x1
clouddb1025 : x4, x1

x4 is still a bit far from being ready (T398709 T404715) but we should just plan for it.

Thoughts?

s2 and s7 both are rather big so I hoped to find a tiny section to swap but I can't find another home for them (x1 and x4 seemed the most likely but x4 will be around 1TB too so nope). i.e. The arrangement looks good to me.

Are you leaving s1 alone based on its size, or expected traffic? Are some sections growing in size much faster than others?

We are ok in terms of disk space, but s1 is the busiest section in production and may have replication issues if not isolated.

If size allows, I think it might be interesting to consider having 3 copies of (all|some) sections, on 3 different hosts. The reason I'm suggesting this is that we have only one copy with a high query timeout (the "analytics" replica with timeout at 10800 seconds), so when we have to depool that copy, we're effectively degrading the user experience because we're redirecting traffic to the "web" replica with a lower timeout of 300 seconds.

This can potentially bring lots of performance issues, having 3 copies, while disk space is okish for now it can possibly make us see lag, due to the fact that we'd have to split memory buffers into 3 rather than two. I don't think this is very good approach.

I don't have any figures in mind but I don't think we live with a host depooled very often (and for very long time) so I think we are trying to solve an issue that doesn't arise often by risking a general performance problem.

s1 is the busiest section in production and may have replication issues if not isolated.

Did we ever have replication issues on s1 now that it's coexisting with s3? I'm also confused because we have a host with 8 sections (an-redacteddb1001) and replication seems to work fine there, so I assumed we could easily handle 3 sections if there's enough disk space.

I think we are trying to solve an issue that doesn't arise often

I agree the issue does not arise often and is not a serious issue, so if you are worried about performance issues I'm ok with sticking to your plan above, where s1 is isolated and other hosts have 2 sections each.

s1 is the busiest section in production and may have replication issues if not isolated.

Did we ever have replication issues on s1 now that it's coexisting with s3? I'm also confused because we have a host with 8 sections (an-redacteddb1001) and replication seems to work fine there, so I assumed we could easily handle 3 sections if there's enough disk space.

an-redacteddb1001 doesn't get any reads, except once a month to be sqooped, so it's hard to compare these two (where in wmcs it's constantly getting a lot of reads). We used to have more sections in one db for wmcs and it was a pain. Also noting that the host has much bigger disks.

s1 is the busiest section in production and may have replication issues if not isolated.

Did we ever have replication issues on s1 now that it's coexisting with s3? I'm also confused because we have a host with 8 sections (an-redacteddb1001) and replication seems to work fine there, so I assumed we could easily handle 3 sections if there's enough disk space.

The workload between both hosts is very different and shouldn't be compared. Essentially an-redacteddb1001 has no reads other than a few days during the month.

I think we are trying to solve an issue that doesn't arise often

I agree the issue does not arise often and is not a serious issue, so if you are worried about performance issues I'm ok with sticking to your plan above, where s1 is isolated and other hosts have 2 sections each.

I'd rather stick to this plan now that we've got enough HW to do so.
In any case, maybe it is time to re-evaluate the query killers timers, but that's something you will probably know better as you have more contact of what the community may or may not need in terms of query times.

fnegri closed this task as Resolved.EditedOct 29 2025, 2:04 PM
fnegri moved this task from In progress to Done on the cloud-services-team (FY2025/2026-Q1-Q2) board.

in wmcs it's constantly getting a lot of reads

Some sections are idle most of the time, and I think clouddb hosts EDIT: some clouddb hosts could withstand a much higher load than what they currently receive, but I don't have good stats to prove it. I will do more research on this as part of T381587: [wikireplicas] Gather usage stats.

maybe it is time to re-evaluate the query killers timers

Yes I think so, I'll open a separate task for that.

I will mark this one as Resolved as I think we can proceed with the proposal above.

I created T408692: Set up replication on new hosts clouddb102[2-5] to track the implementation work.

in wmcs it's constantly getting a lot of reads

Some sections are idle most of the time, and I think clouddb hosts could withstand a much higher load than what they currently receive, but I don't have good stats to prove it. I will do more research on this as part of T381587: [wikireplicas] Gather usage stats.

Being OS idle is different from getting reads (they maybe in the pool size because we have enough RAM to do so)

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-24h&to=now&timezone=utc&var-job=$__all&var-server=clouddb1013&var-port=13311&refresh=1m&viewPanel=panel-3

And just a quick look right now on the enwiki one, queries with 0 optimization running right now

| 324421741 | x       | 10.64.151.2:55462    | enwiki_p | Query     |        5 | Sending data                                           | select count(*) from linter inner join page on page.page_id=linter.linter_page where linter_cat=4 and page.page_namespace=0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |    0.000 |
| 324421749 | x       | 10.64.151.2:55536    | enwiki_p | Execute   |        9 | Sending data                                           | SET STATEMENT max_statement_time=10 FOR SELECT COUNT(*) AS `user_editcount` FROM `revision_userindex` WHERE `rev_actor` = (SELECT actor_id FROM actor WHERE `actor_name` = ?)                                                                                                                                                                                                                                                                                                                                                                                                                                                               |    0.000 |
| 324421876 | x       | 10.64.151.2:56772    | enwiki_p | Execute   |        4 | Sending data                                           | SET STATEMENT max_statement_time=10 FOR SELECT COUNT(DISTINCT rev_page) AS count FROM `revision_userindex` WHERE `rev_actor` = (SELECT actor_id FROM actor WHERE `actor_name` = ?)                                                                                                                                                                                                                                                                                                                                                                                                                                                          |    0.000 |
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |    0.000 |
| 324421981 | x       | 10.64.151.2:55372    | enwiki_p | Execute   |        0 | Sending data                                           | SET STATEMENT max_statement_time=10 FOR SELECT COUNT(DISTINCT rev_page) AS count FROM `revision_userindex` WHERE `rev_actor` = (SELECT actor_id FROM actor WHERE `actor_name` = ?)                                                                                                                                                                                                                                                                                                                                                                                                                                                          |    0.000 |
| 324421985 | x       | 10.64.151.2:55412    | enwiki_p | Query     |        0 | Sending data                                           | SELECT *,(SELECT `actor_user` FROM `actor` WHERE `actor`.`actor_id` = `recentchanges`.`rc_actor`) AS `userid`,(SELECT `actor_name` FROM `actor` WHERE `actor`.`actor_id` = `recentchanges`.`rc_actor`) AS `user_name`,(SELECT `comment_text` FROM `comment` WHERE `comment`.`comment_id` = `recentchanges`.`rc_comment_id`) AS `comment`,(SELECT `comment_data` FROM `comment` WHERE `comment`.`comment_id` = `recentchanges`.`rc_comment_id`) AS `comment_data` FROM `recentchanges` WHERE rc_namespace=0 AND rc_timestamp>=20251024192244 AND rc_this_oldid>'0' AND rc_timestamp<=20251029141323 ORDER BY `rc_this_oldid` ASC LIMIT 10000 |    0.000 |

And this is a production host in s1 as well

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-24h&to=now&timezone=utc&var-job=$__all&var-server=db1169&var-port=9104&refresh=1m&viewPanel=panel-3

So as you can see the host is definitely getting queries and expensive ones.

maybe it is time to re-evaluate the query killers timers

Yes I think so, I'll open a separate task for that.

Please be aware that those have been working fine for years, and before getting them the infra was pretty much down all the time. So please consider carefully what the changes may be - again, I don't think we are having a host depooled most of the time so I am not sure if this is worth investing time on.
You do own the infra, so of course this is entirely up to you but I am advising to proceed with caution.

And just a quick look right now on the enwiki one

Sorry, I did not explain myself clearly: what I meant to say was that it looks like some sections are much more loaded than others. I think ideally we would try to distribute the load as evenly as possible across the hosts we have, and the current load seems quite uneven. I think that's partly because we don't have enough data on which sections users are hitting more often, which is what I want to improve with T381587: [wikireplicas] Gather usage stats.

s1 is indeed getting a fair amount of queries, but if you look e.g. at clouddb1014 the picture is much different: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-24h&to=now&timezone=utc&var-job=$__all&var-server=clouddb1014&var-port=13312&refresh=1m&viewPanel=panel-3

Please be aware that those have been working fine for years, and before getting them the infra was pretty much down all the time.

Yes I am aware that the current setup is the result of many previous iterations, and I did not mean to say the current setup is "wrong" (apologies if my comment may have sounded that way).

All that I'm saying is that given we are defining a new setup, going from 8 to 12 hosts, I think this is a good opportunity to think on how we can further improve the current setup and make sure that we use all the hosts as evenly and efficiently as possible.

And just a quick look right now on the enwiki one

Sorry, I did not explain myself clearly: what I meant to say was that it looks like some sections are much more loaded than others. I think ideally we would try to distribute the load as evenly as possible across the hosts we have, and the current load seems quite uneven. I think that's partly because we don't have enough data on which sections users are hitting more often, which is what I want to improve with T381587: [wikireplicas] Gather usage stats.

Having more data would be very welcomed.
At the time, we split them based on the production usage as we didn't have data (the old clouddb* infra was constantly broken and with pretty much data drifts everywhere so not reliable at all).

s1 is indeed getting a fair amount of queries, but if you look e.g. at clouddb1014 the picture is much different: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-24h&to=now&timezone=utc&var-job=$__all&var-server=clouddb1014&var-port=13312&refresh=1m&viewPanel=panel-3

Please be aware that those have been working fine for years, and before getting them the infra was pretty much down all the time.

Yes I am aware that the current setup is the result of many previous iterations, and I did not mean to say the current setup is "wrong" (apologies if my comment may have sounded that way).

All that I'm saying is that given we are defining a new setup, going from 8 to 12 hosts, I think this is a good opportunity to think on how we can further improve the current setup and make sure that we use all the hosts as evenly and efficiently as possible.

Yeah, absolutely. We have to keep in mind the usage from the users AND what comes from production (writes) which can potentially cause lag (spikes or infinity) if we are too aggressive placing sections and not using the RAM efficiently. Also, keep in mind we need to have room to grow as the wikis grow, data grow, and the data that we can fit in memory will shrink so we'll have to read more from disk (IO). Hence my comments earlier about placing 3 sections per host.
I'd rather be conservative here.

My point is also, what is the problem we are trying to solve? If it is just the fact that the service is degraded when we depool a host, I'd gather stats on how many times we have this situation running over the months (I'd say it is just mostly when we do upgrades and it doesn't take much more than 15-30 minutes at the time). If it is when a host has issues...fortunately we've not had that situation many times (famous last words) and if the service is degraded (a bit) when we do, I think that's something we can live with if we are able to ensure future stability (no lag basically) by having sections isolated/sharing less resources.

On the other hand, I think one of the easiest things to double check (to test and rollback if needed) are the query killers time, especially on the webservice. Because if we start moving data around, that's going to take days to do and days to rollback if we have to (and days or weeks to see the performance stabilised)

Thanks @Marostegui for the additional considerations. Your comment about RAM usage convinced me that placing 3 sections on a single host is too much of a risk for the benefit it would give us.

My point is also, what is the problem we are trying to solve? [...] If it is when a host has issues

Yes, I think it would be nice to improve (if possible) our resilience to a host being depooled for a long time. This hasn't happened until now, but it might happen in case of hardware failures. I think this could potentially be quite painful and I would like to test how realistic it is for the remaining host to withstand all the traffic (web+analytics), and what issues we would have in such a scenario (lag? slower queries? locks?).

I will gather more query stats, and also do some experimentation with the query killer timeouts.