Page MenuHomePhabricator

[toolsdb] Upgrade to MariaDB 10.6
Closed, ResolvedPublic

Description

ToolsDB is currently running on MariaDB 10.4, which was maintained until 18 June 2024.

10.6 is the next LTS and is already used in some WMF production hosts, so we should upgrade to it as soon as feasible.

We should also upgrade the OS from Debian Bullseye (11) to Bookworm (12).

Upgrading might also help with the memory issues we had in T349695 and T353093, and will provide more monitoring options like memory_summary_global_by_event_name.

The procedure to upgrade from 10.4 to 10.6 was described by @Marostegui in this comment:

  • Stop mariadb
  • Remove 10.4 package
  • Install 10.6 package
  • Start mariadb
  • Run mysql_upgrade

Full upgrade procedure for ToolsDB: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Upgrading_to_a_newer_MariaDB_version

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Update tools-db DNS import idrepos/cloud/cloud-vps/tofu-infra!143fnegriT352206-2main
Move toolsdb primary host to tools-db-4repos/cloud/cloud-vps/tofu-infra!142fnegriT352206main
Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I would also encourage migrate to Bookworm too :)

@Marostegui I was thinking if we could take the chance and upgrade to an even more recent version. ToolsDB could potentially run a newer version than production, WDYT?

Edit: the only other available LTS version is 10.11, so we could consider upgrading to that one instead of 10.6.

@fnegri that is really up to your team - in production such big jumps in releases need to be carefully studied and benchmarked and that's why we took more conservative approaches. The optimizer tends to introduce quite big changes between majors and going from 10.4 to 10.11 is probably a no go for us. We need to study what the regressions might be, especially on query time and/or IO wait.

In the past the we did 10.1 -> 10.4 and that took quite sometime, same with 10.4 to 10.6, where I reported a number of bugs with the optimizer.

I've spent time testing 11.0 while it was RC and GA after talking to Monty in Finland a couple of years ago. There're massive changes in the optimizer in 11.0 and even though it looks fine, we won't go from 10.6 to 11.

In regards to your question, it depends on how much your service can afford possible regressions in the query plans or InnoDB. I'd advise you capture and replay some of the traffic against that 10.11 and test if it is acceptable.

I'd advise you capture and replay some of the traffic against that 10.11 and test if it is acceptable.

This sounds promising. Is there any tool that you've used and you would recommend to do capture/replay?

You can use a combination of:
https://docs.percona.com/percona-toolkit/pt-query-digest.html
https://docs.percona.com/percona-toolkit/pt-upgrade.html

You need to enable slow query log - you might want to first capture very slow queries and see how they'd behave and then slowly decrease the runtime to capture all queries for a given timeframe and analyze them and see if there're some important regressions there.

The latest version of Django (5.1.1) only supports MariaDB 10.5 or higher, and ToolsDB is currently running 10.4.29, so that's a hurdle I encountered while keeping EditGroup's dependencies up to date. If others encounter the same issue, the latest version of Django to work with MariaDB 10.4.29 seems to be 5.0.8.
It would be great to upgrade :)

fnegri changed the task status from Open to In Progress.Nov 11 2024, 2:21 PM

Edit: the only other available LTS version is 10.11, so we could consider upgrading to that one instead of 10.6.

11.4 is now the newest LTS, but given we're already late on this upgrade, I will follow the easiest path and upgrade to 10.6, so we're back to a supported version. We can explore upgrading to 10.11 or 11.4 in a separate task.

My plan is:

  • create a new replica tools-db-4 (running Debian Bookworm and MariaDB 10.4) and configure it to replicate from the current primary
  • upgrade tools-db-4 to MariaDB 10.6 and leave it running for a few days
  • If there are no issues, promote tools-db-4 to become the new primary
  • Create a new replica tools-db-5 replicating from tools-db-4
  • Delete old primary (tools-db-1) and old replica (tools-db-3)

I will document all steps in detail so that we can use this documentation to write a set of cookbooks in the future.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-11T14:37:40Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-db' (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-11T14:41:23Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=0) with prefix 'tools-db' (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-11T14:42:48Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.refresh_puppet_certs on tools-db-4.tools.eqiad1.wikimedia.cloud (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-11T14:44:43Z] <fnegri@cloudcumin1001> END (FAIL) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=99) on tools-db-4.tools.eqiad1.wikimedia.cloud (T352206)

Change #1089806 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] toolsdb: apply pinning to all debian versions

https://gerrit.wikimedia.org/r/1089806

Change #1089806 merged by FNegri:

[operations/puppet@production] toolsdb: apply pinning to all debian versions

https://gerrit.wikimedia.org/r/1089806

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-11T15:58:44Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.refresh_puppet_certs on tools-db-4.tools.eqiad1.wikimedia.cloud (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-11T16:02:57Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on tools-db-4.tools.eqiad1.wikimedia.cloud (T352206)

I'm creating a new replica tools-db-4 following the procedure at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Creating_a_new_replica_host

I already did two mistakes (both could be prevented with a cookbook that automates creating a new replica):

  • I created a snapshot and forgot to save the value of gtid_current_pos, that is needed to start replication on the new host. That means I had to throw away that snapshot and take a new one.
  • I didn't realize that the wmf-mariadb104 package is only available for Bullseye, and the new server I created is running Bookworm, where only wmf-mariadb106 is available. This is fine, as I want to upgrade to that version anyway, but upgrading doesn't work if the data dir is dirty, i.e. it's coming from a snapshot taken with BACKUP STAGE BLOCK_COMMIT; but without doing systemctl stop mariadb. To recover such a snapshot, the same version of MariaDB must be used (10.4) which cannot be installed on the new host.

I took a third snapshot of tools-db-3, this time after doing systemctl stop mariadb, so the data dir is not "dirty" and MariaDB 10.6 can use it. I'm currently copying this third snapshot to the new host (the copy takes a few hours).

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-15T13:49:52Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-db' (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-15T13:50:17Z] <fnegri@cloudcumin1001> END (FAIL) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=99) with prefix 'tools-db' (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-15T13:57:31Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.openstack.quota_increase (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-15T13:57:39Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-15T13:57:44Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-db' (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-15T14:03:33Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=0) with prefix 'tools-db' (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-15T14:03:56Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.refresh_puppet_certs on tools-db-5.tools.eqiad1.wikimedia.cloud (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-15T14:05:13Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on tools-db-5.tools.eqiad1.wikimedia.cloud (T352206)

Current situation:

  • tools-db-1: primary, running MariaDB 10.4
  • tools-db-3: replica, running MariaDb 10.4
  • tools-db-4: new server running MariaDB 10.6 and replicating from tools-db-3
  • tools-db-5: new server, currently copying the data from the snapshot

Current situation:

  • tools-db-1: primary, running MariaDB 10.4
  • tools-db-3: replica, running MariaDb 10.4
  • tools-db-4: new server running MariaDB 10.6 and replicating from tools-db-3
  • tools-db-5: new server, running MariaDB 10.6 and replicating from tools-db-4

I plan to failover the primary from tools-db-1 to tools-db-4 next Monday, 2024-11-25.

Edit: the only other available LTS version is 10.11, so we could consider upgrading to that one instead of 10.6.

11.4 is now the newest LTS, but given we're already late on this upgrade, I will follow the easiest path and upgrade to 10.6, so we're back to a supported version. We can explore upgrading to 10.11 or 11.4 in a separate task.

I would recommend you go to 10.6 and later (a lot later) to 10.11 (which is what we are starting to test in production). 11.X has lots of changes to the optimizer, so I'd suggest to wait a bit until we've done some good testing.

The upgrade is complete. tools-db-4 is the new primary and tools-db-5 is the new replica. I shut off the old hosts tools-db-1 and tools-db-3.

There was an unforeseen issue with DNS that caused a longer downtime than expected. ToolsDB was unreachable for about 30 minutes between approximately 13:30 UTC and 14:00 UTC.

I will write a more detailed report tomorrow before resolving this task.

Below is the summary of the actions performed during the upgrade on 2024-11-25 (based on the procedure in wikitech).

A screencast recording is available (for WMF Staff only) at https://drive.google.com/drive/folders/1EqtpOZzHIv57KysIkSpyzT_jRGtxeXsn

  • FLUSH TABLES WITH READ LOCK; on tools-db-1
  • Verified that all 4 hosts (tools-db-1, tools-db-3, tools-db-4, tools-db-5) were in sync with GTID 2886731301-2886731301-9248262892
  • Merged and applied the DNS Tofu change (merge-requests/142)
  • STOP SLAVE; on tools-db-3 and tool-db-4, leaving it running on tools-db-5 (as tools-db-5 is already in the desired final state, replicating from tools-db-4)
  • systemctl stop mariadb on tools-db-1 to close all active connections and force all clients to reconnect to the new host
  • Here we encountered the first issue: DNS queries for tools.db.svc.wikimedia.cloud were not resolving ("host not found")
    • Verified that openstack recordset list was showing the right value: 172.16.0.168. Something is wrong in Designate, maybe because the MR above changed the DNS record type from CNAME to A.
    • A page was triggered because ToolsDB was down.
    • @dcaro tried deleting and recreating the DNS record manually from Horizon
    • updating the IRC topic with !status fails for unknown reasons (permissions?)
    • Restarted Designate with cookbook wmcs.openstack.restart_openstack --designate --cluster-name eqiad1
    • Verified that from my laptop the DNS resolves correctly to the new IP, but from login.toolforge.org it does not resolve
    • @aborrero found this error message in pdns logs: got a CNAME referral (from cache) that causes a loop
    • @dcaro deleted the legacy record `tools.db.svc.wmflabs
    • DNS queries are now resolving with the old value (the one before the change in the Tofu MR
    • @aborrero restarted the DNS recursor. This fixed the DNS issue and the domains started resolving correctly.
  • RESET SLAVE ALL; on tools-db-4 (the new primary)
  • Second issue: Checking GTID showed it was now at 2886731301-2886731301-9248273355 on all 4 hosts, which is different from what it was before.
    • Restarted mariadb on tools-db-1 to check what is going on.
    • The GTID started slowly increasing again, even if SELECT @@read_only; was showing 1.
    • SHOW PROCESSLIST showed the only process was using the heartbeat database.
    • For some reason pt-hearbeat was able to write to the database even if read_only was set to 1.
    • sudo disable-puppet "upgrading mariadb" and systemctl stop pt-heartbeat-wikimedia
    • GTID in tools-db-1 is now static at 2886731301-2886731301-9248274392
    • I restarted replication on all hosts to make them all in sync again
      • In tools-db-3, START SLAVE; then STOP SLAVE; after a few seconds
      • In tools-db-4 (where I had previously did RESET SLAVE ALL I had to use this instead: CHANGE MASTER TO MASTER_HOST='tools-db-3.tools.eqiad1.wikimedia.cloud', MASTER_USER='repl', MASTER_PASSWORD='xxxxxx', MASTER_USE_GTID=slave_pos;
      • tools-db-5 was already replicating (I never stopped replication in that host)
      • Verified all 4 hosts were in sync with GTID 2886731301-2886731301-9248274392
      • In tools-db-4, STOP SLAVE; and RESET SLAVE ALL;
      • systemctl stop mariadb on tools-db-1
  • In tools-db-4, finally SET GLOBAL read_only=OFF;
  • Transactions start growing in tools-db-4, with a new GTID prefix: 2886729896-2886729896-116, 2886731301-2886731301-9248274392
  • Changed the static IP 185.15.56.15 to point to tools-db-4
  • Updated the hiera key profile::wmcs::services::toolsdb::primary_server: (used by the pt-heartbeat process)
  • run-puppet-agent in tools-db-4 and checked the heartbeat table was getting updated
    • DELETE FROM heartbeat WHERE server_id=XXXX to get rid of the old line that will not be updated anymore
  • Verified there were no firing alerts about ToolsDB

Things to change in the upgrade procedure:

  • mention to stop the old primary to close all active connections
  • mention the possible DNS issue and how to fix it
  • add a silence at the start of the procedure to prevent pages, and remove it at the end
  • update the topic of #wikimedia-cloud before starting the procedure, and again at the end
  • mention the floating IP that must be updated (used by maintain-db-users)
  • mention that the heartbeat table will contain a new line, and we can delete the old one
  • mention that pt-heartbeat can write even when read-only is set to true
  • I only realized this now writing these notes: always shutdown the old primary with SHUTDOWN; as recommended by the official docs, otherwise quitting the session where you typed FLUSH TABLES WITH READ LOCK; to run systemctl stop mariadb will remove the READ LOCK and commit pending transactions (tools will have many pending transactions after the READ LOCK is issued). This actually happened during this upgrade, but luckily I then re-started the replication and no transactions were lost.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-12-06T17:23:12Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.remove_instance for instance tools-db-3 (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-12-06T17:24:08Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-db-3 (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-12-06T17:25:19Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.remove_instance for instance tools-db-1 (T352206)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-12-06T17:26:12Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-db-1 (T352206)

fnegri moved this task from In progress to Done on the cloud-services-team (FY2024/2025-Q1-Q2) board.
fnegri moved this task from In Progress to Done on the Toolforge (Toolforge iteration 16) board.