Page MenuHomePhabricator

ToolsDB upgrade => Bullseye, MariaDB 10.4
Open, Needs TriagePublic

Description

We can't really just keep pushing this to the future.. ToolsDB is still on Stretch + MariaDB 10.1 and needs to be upgraded to more modern things, which I believe currently means Bullseye and MariaDB 10.4.

ToolsDB is currently running on two dedicated localdisk hypervisors. Can we temporarily add a third one to make the migration easier? I was thinking of maybe using the replica (once recovered from T301951) to clone a new 10.4 instance without causing any service impact to the primary.

We have two upgrades that need to happen here: the hypervisors (cloudvirt1019 and 1020) need to be upgraded to Bullseye, and the VMs (clouddb100[1-4]) need to be upgraded to at least Buster and, ideally, also Bullseye.

Clouddb1001 and 1004 are on cloudvirt1019
Clouddb1002 and 1003 are on cloudvirt1020

Clouddb1001 is the (partial) toolsdb database, with (partial) replica clouddb1002
Clouddb1004 and Clouddb1003 are postgres servers and I'm not entirely sure what's on them.

Event Timeline

Does toolsdb have to be upgraded before we implement an alternative solution? For example, if toolsdb is migrated to trove databases for instance, do we need to upgrade beforehand?

I am not suggesting we migrate instead of upgrade, but I wanted to clarify all options we had.

Does toolsdb have to be upgraded before we implement an alternative solution? For example, if toolsdb is migrated to trove databases for instance, do we need to upgrade beforehand?

I am not suggesting we migrate instead of upgrade, but I wanted to clarify all options we had.

My understanding is that replacing the current ToolsDB with Trove is blocked on trove-backups (aka Swift), which was delayed by 6-12 months late last year. (Feel free to correct me on that, this is all based on what I've understood based on various tickets and #-cloud-admin). In addition to that being available, there's still a non-zero engineering work for integrating everything into Striker and other related work for making it easy to use for tool maintainers. The migration itself will also take a while due to the amount of tools with various levels of active maintenance currently using ToolsDB.

So we're realistically not talking about fully being able to decomission ToolsDB for at least a year, maybe for longer. I don't think we should delay migrating off Stretch/10.1 that far, although I'm usually one of the first cloud admins pushing us to upgrade things running old software :-) Curious to hear others thoughs about that.

cloudvirt1019, 1020 and 1028 are our local-storage hypervisors. Right now 1028 is only hosting a couple of etc nodes so has plenty of space for shuffling and rebulding.

I don't have a quick answer to the trove question yet, but that's the easy answer to taavi's original question.

Involved mysql hosts:
clouddb1001 -> master
clouddb1002 -> slave (with replication filters for Replicate_Wild_Ignore_Table: s51412\_\_data.%,s51071\_\_templatetiger\_p.%,s52721\_\_pagecount\_stats\_p.%,s51290\_\_dpl\_p.%)

If I recall correctly, those filters were applied cause they did massive amount of writes and would create lots of lag on the replicas. Those users were warned that their data was not replicated and they were told that they'd need to keep a backup themselves (or assume that their data could be lost in case of master failure).

I'm not a DBA per se but how can I help you move this forward?

For toolsdb (clouddb1001/1002) I think the first steps would be something like this (not necessarily in order):

After that we can safely decommission clouddb1002 and after figuring out what the postgres instances are for we can also re-image cloudvirt1020

Unrelated but worth revisiting now: why is toolsdb not in the tools project?

[0]: I really don't have a clue on how to clone mariadb instances this big and ideally without read-only time on the primary, any help would be needed

My fear is that the answer to [0] is that we can't.

I'm checking in with Brooke about whether she thinks we can move these VMs to Ceph; her answer to that will help decide whether or not we should just do an in-place upgrade for this.

My fear is that the answer to [0] is that we can't.

We may be able to clone from the current secondary without causing read-only time on the primary. In any case this is definitely something that any feedback from the DBAs would be most welcome.

@Ladsgroup, thank you for the offer! I spoke with Manuel about this on IRC but wound up still undercertain about the path forward.

My favorite thing would be to build new, third replicas with modern platforms (as Taavi suggests), migrate data over, and then cut over the service. My impression is that that's not realistic, but I don't entirely understand if that's right or wrong.

If migration requires extensive user downtime then we probably want to do in-place upgrades, in which case I could use a DBA (or similar) to help me design the upgrade plan and sit with me during implementation in case something goes awry. It seems vaguely better to do the hypervisor upgrade at the same time since it will also result in downtime, but those two things aren't necessarily coupled.

Update: Brooke thinks that Ceph is worth a try, and so do I. BUT, it will consume a ton of ceph space which I'm on the fence about. @Ladsgroup if you have thoughts about the build-and-migrate vs the in-place pathways, that will help decide (cince moving to ceph would definitely require the build-and-migrate path).

What is left here? I thought we done it with the upgrade we did recently.

We upgraded the hypervisor, not the VM itself. All of my previous questions remain re: upgrading the actual DB and VM.

oh my bad. Sorry.

So having a replica in bullseye makes much more sense. You set it up, then failover to that and then you can easily upgrade the old primary.

I understand making the new replica is hard specially the huge amount of writes coming in but as long as you keep the binlog, you can replay it. The problem being that it can't catch up and for that I suggest looking at large write queries in toolsdb and either ask the tool owner(s) to reduce the write or pause it for a while. The thing is that toolsdb writes are really really large, they have larger flow than s8 and wikidata (s8) has 100 edits per second. I can spend some time to hunt the problematic writes.

Sounds like we sort of agree that we should atleast try building new instances instead of trying in-place upgrades. That gives us a path forward but leaves us (me?) some questions. Notably:

  • @Andrew, what flavor should we use for the new VM? If we're going with Ceph, I'm guessing something with enough RAM/CPU and a massive Cinder volume? Otherwise we're going to need some local flavor.
  • I see Phabricator activity suggesting that MariaDB 10.6 might also be an option. Is it? Should we wait for it to be? Or is a 10.1 -> 10.6 jump too risky?

Also I'd like to explore moving ToolsDB into the tools openstack project or at least change the naming scheme to something more distinctive.

I can spend some time to hunt the problematic writes.

<3

Don't go for 10.6 yet, I am still testing it. 10.4 is what we have in production.

In T301949#7906471, @Majavah wrote:
  • @Andrew, what flavor should we use for the new VM? If we're going with Ceph, I'm guessing something with enough RAM/CPU and a massive Cinder volume? Otherwise we're going to need some local flavor.

We should definitely give cinder/ceph a try for the database volume. It may turn out to not be fast enough to meet our needs but I'm really hoping it'll work fine. The current flavor used for toolsdb1001 and toolsdb1002 is 16 cores and 64GB of RAM, so I've created a new flavor to hold these new VMs named 'g3.cores16.ram64.disk20.10xiops' which should support much faster disk access.

  • I see Phabricator activity suggesting that MariaDB 10.6 might also be an option. Is it? Should we wait for it to be? Or is a 10.1 -> 10.6 jump too risky?

Also I'd like to explore moving ToolsDB into the tools openstack project or at least change the naming scheme to something more distinctive.

I can spend some time to hunt the problematic writes.

<3

93% of the writes comes from two tools: heritage (majority) and yetkin. A double check of these two tools and their write patterns would be amazing.

And slowest write queries come from two other tools: terminator and commonsdelinquent with write queries taking north of 5 minutes quite commonly. You can see them in /srv/labsdb/data/clouddb1001-slow.log. I enabled slow query logging for an hour or so. MediaWiki kills any write queries taking more than 3 seconds. You can do a similar thing and probably introduce a query killer for anything taking longer than 10 seconds (the threshold for write queries should be much smaller than read queries).

@Ladsgroup Thank you very much for these insights! Please feel free to share any other thoughts on how to improve the service performance or our ability to maintain it.

I suggest by implementing query killer for write queries. It should be easy as you already have a query killer IIRC. Will dig more and let you know!

Informed people, created tickets and added indexes. Fingers crossed it gets better.

I suggest by implementing query killer for write queries. It should be easy as you already have a query killer IIRC. Will dig more and let you know!

I would be _very_ careful when trying to kill writes. You can have lots of unexpected surprises there with all the rollbacks and such.

Fair, I assumed tools have less complexity like transactions and such but yeah :/

Fair, I assumed tools have less complexity like transactions and such but yeah :/

They also are likely to have no error handling whatsoever and to just fatally crash on a failed write. The vast majority of volunteer developed code falls under the umbrella of "quick hack that worked the day it was written". Professional system administrators who are used to working with professional software developers sometimes forget these things.

Mentioned in SAL (#wikimedia-cloud) [2022-06-28T13:03:36Z] <taavi> grant the tools project access to the g3.cores16.ram64.disk20.10xiops flavor T301949

Change 809213 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs: toolsdb: support mariadb 10.4 on bullseye

https://gerrit.wikimedia.org/r/809213

Change 809213 merged by David Caro:

[operations/puppet@production] P:wmcs: toolsdb: support mariadb 10.4 on bullseye

https://gerrit.wikimedia.org/r/809213

Mentioned in SAL (#wikimedia-cloud) [2022-06-28T16:30:53Z] <taavi> stopped mariadb on clouddb1002, starting rsync clouddb1002->tools-db-1 on a root screen session on tools-db-1 T301949

tools-db-1.tools.eqiad1.wikimedia.cloud now is now replicating. The only problem so far is that users and grants didn't seem to get copied over for some reason.