Page MenuHomePhabricator

[toolsdb] Upgrade from 10.6.20 to 10.6.21
Closed, ResolvedPublic

Description

The APT package was already upgraded by unattended-upgrades to 10.6.21, but the running process is still running 10.6.20. Unattended-upgrades should not have upgraded this package, but there was an issue with the APT config, that was fixed in T385885: [toolsdb] Remove apt pinning and upgrade to latest version.

We need to run systemctl restart mariadb both in tools-db-4 and tools-db-5 for them to pick up the new binary.

This hasn't caused problems for 2 months, so I'm gonna schedule a proper maintenance window for this restart. I will restart both servers on Monday April, 28th at 13:00 UTC.

Side note: the reason we need a maintenance window is that restarting the primary can cause a few minutes of downtime, and failing over from primary to replica also creates some downtime because it requires a DNS change. Yes, we should think of a way to have zero-downtime failovers, maybe using keepalived.

Details of version mismatch
root@tools-db-5:~# zcat /var/log/unattended-upgrades/unattended-upgrades-dpkg.log.1.gz |head -4
Log started: 2025-03-01  06:05:52
(Reading database ... 47892 files and directories currently installed.)
Preparing to unpack .../wmf-mariadb106_10.6.21+deb12u1_amd64.deb ...
Unpacking wmf-mariadb106 (10.6.21+deb12u1) over (10.6.20+deb12u1) ...
root@tools-db-4:~# mariadb -e "SELECT VERSION()"
+---------------------+
| VERSION()           |
+---------------------+
| 10.6.20-MariaDB-log |
+---------------------+
root@tools-db-5:~# mariadb -e "SELECT VERSION()"
+---------------------+
| VERSION()           |
+---------------------+
| 10.6.20-MariaDB-log |
+---------------------+

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
toolsdb: Failover primaryrepos/cloud/cloud-vps/tofu-infra!220fnegriT392596main
Customize query in GitLab

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
fnegri changed the task status from Open to In Progress.Apr 28 2025, 12:39 PM

Mentioned in SAL (#wikimedia-cloud) [2025-04-28T13:06:52Z] <dhinus> tools-db-5: systemctl stop mariadb && systemctl start mariadb T392596

Mentioned in SAL (#wikimedia-cloud) [2025-04-28T13:07:16Z] <dhinus> tools-db-4: systemctl stop mariadb && systemctl start mariadb T392596

Restarting mariadb on tools-db-5 was very fast (just a few seconds).

On tools-db-4, the shutdown took about 50 minutes. It's not clear why, as the logs show that the previous shutdowns took just 4 minutes (on 2025-02-06) and 8 minutes (on 2025-02-24).

Full mariadb logs during this shutdown:

Apr 28 13:07:22 tools-db-4 systemd[1]: Stopping mariadb.service - mariadb database server...
Apr 28 13:07:22 tools-db-4 mysqld[992820]: 2025-04-28 13:07:22 0 [Note] /opt/wmf-mariadb106/bin/mysqld (initiated by: unknown): Normal shutdown
Apr 28 13:07:22 tools-db-4 mysqld[992820]: 2025-04-28 13:07:22 0 [Note] Event Scheduler: Killing the scheduler thread, thread id 2
Apr 28 13:07:22 tools-db-4 mysqld[992820]: 2025-04-28 13:07:22 0 [Note] Event Scheduler: Waiting for the scheduler thread to reply
Apr 28 13:07:22 tools-db-4 mysqld[992820]: 2025-04-28 13:07:22 0 [Note] Event Scheduler: Stopped
Apr 28 13:07:32 tools-db-4 mysqld[992820]: 2025-04-28 13:07:32 0 [Note] InnoDB: FTS optimize thread exiting.
Apr 28 13:57:33 tools-db-4 mysqld[992820]: 2025-04-28 13:57:33 0 [Note] InnoDB: Starting shutdown...
Apr 28 13:57:33 tools-db-4 mysqld[992820]: 2025-04-28 13:57:33 0 [Note] InnoDB: Dumping buffer pool(s) to /srv/labsdb/data/ib_buffer_pool
Apr 28 13:57:33 tools-db-4 mysqld[992820]: 2025-04-28 13:57:33 0 [Note] InnoDB: Restricted to 502944 pages due to innodb_buf_pool_dump_pct=25
Apr 28 13:57:33 tools-db-4 mysqld[992820]: 2025-04-28 13:57:33 0 [Note] InnoDB: Buffer pool(s) dump completed at 250428 13:57:33
Apr 28 13:57:38 tools-db-4 mysqld[992820]: 2025-04-28 13:57:38 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
Apr 28 13:57:38 tools-db-4 mysqld[992820]: 2025-04-28 13:57:38 0 [Note] InnoDB: Shutdown completed; log sequence number 113033160448471; transaction id 149012358807
Apr 28 13:57:39 tools-db-4 mysqld[992820]: 2025-04-28 13:57:39 0 [Note] /opt/wmf-mariadb106/bin/mysqld: Shutdown complete
Apr 28 13:57:39 tools-db-4 mysqld[992820]: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%""%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Apr 28 13:57:39 tools-db-4 systemd[1]: mariadb.service: Deactivated successfully.
Apr 28 13:57:39 tools-db-4 systemd[1]: Stopped mariadb.service - mariadb database server.

While the shutdown was in progress, there was a constant write activity on disk (hovering between 10 and 20 Mbps). Some MariaDB threads were sometimes entering a D state (seen in htop), but no thread was stuck in D state for more than a few seconds. CPU usage was about 10%, so only disk was the bottleneck.

This Grafana chart is maybe just a Prometheus glitch, but could also indicate some problem writing to Ceph:

Screenshot 2025-04-28 at 16.52.13.png (544×1 px, 84 KB)

fnegri moved this task from In progress to Done on the cloud-services-team (FY2024/2025-Q3-Q4) board.

I created T392828: [toolsdb] MariaDB sometimes takes very long to shut down to track the issue with slow shutdowns.

I also changed the upgrade procedure in Wikitech to mention we should always failover to the replica to work around the risk of slow shutdowns.