Page MenuHomePhabricator

[toolsdb] set up tools-db-1 to replicate from clouddb1001
Closed, ResolvedPublic

Assigned To
Authored By
Feb 17 2023, 6:10 PM
Referenced Files
F36932354: Screenshot 2023-03-29 at 19.04.48.png
Mar 29 2023, 5:05 PM
F36899182: Screenshot 2023-03-09 at 14.10.51.png
Mar 9 2023, 12:11 PM
F36898020: mysql_upgrade.log
Mar 8 2023, 2:51 PM
F36898019: mariadb_upgrade.log
Mar 8 2023, 2:51 PM
F36886904: Screenshot 2023-03-01 at 16.00.36.png
Mar 1 2023, 3:02 PM
F36886902: Screenshot 2023-03-01 at 15.58.20.png
Mar 1 2023, 3:02 PM
F36877278: Screenshot 2023-02-28 at 18.41.33.png
Feb 28 2023, 5:42 PM
F36877271: Screenshot 2023-02-28 at 18.39.26.png
Feb 28 2023, 5:40 PM


As detailed in T301949, tools-db-1 is a new VM with Ceph storage that will become the new primary ToolsDB database.

This task is to set up tools-db-1 as a MariaDB replica that will replicate from the current primary clouddb1001. I already did that once (T301949#8554308) but after the outage in T329535 the replica is no longer in sync and needs to be recreated from scratch.

The required steps are detailed below, based on the experience of my previous attempt. EDIT: Reusing the same steps from T301949#8554308 did not work this time. I have updated the strategy, and the steps below, to use a libvirt snapshot instead of an LVM snapshot. See the comments below for more details.

Step 2 will set MariaDB in read-only mode, and step 3 will stop it completely, so remember to write a message in IRC (#wikimedia-cloud) and send an email to to inform people of the brief ToolsDB outage (expected to last about 2 minutes in total).

  1. In the Horizon web interface (project: clouddb-services), create a new 4TB Cinder volume (type: high-iops) and attach it to clouddb1001. Create a partition and an ext4 filesystem.
fdisk /dev/vdb # use "g" to create the partition table, then "n" to create a single partition
mkfs.ext4 /dev/vdb1
  1. Flush data to disk and take note of the current GTID. If the "FLUSH TABLES" command hangs for more than a few seconds, open a separate mysql session, and manually kill (using KILL {id}) any running query (find the queries to kill using SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE command != 'Sleep' AND state not like 'Waiting%';).
SELECT @@global.gtid_binlog_pos; #store the result somewhere safe, it will be needed in step 9
  1. Stop MariaDB on clouddb1001 to ensure the snapshot we are going to take in the next step is "clean". I tried taking a snapshot without stopping MariaDB and that meant I could not load it cleanly on a new instance running a newer version of MariaDB (see T301949#8565151).
systemctl stop mariadb
  1. From the hypervisor cloudvirt1019, create a libvirt snapshot of the virtual disk attached to clouddb1001 (instance name i-000045d4) and mount it to /mnt-snapshot.
virsh domblklist i-000045d4 #take note of the virtual disk filename
virsh snapshot-create-as --disk-only --domain  i-000045d4
mkdir /mnt-snapshot
apt install libguestfs-tools
guestmount -a {disk_filename} -i --ro /mnt-snapshot
  1. Restart mariadb on clouddb1001, and enable write mode (this instance of mariadb is configured to start in read-only mode for extra safety).
systemctl start mariadb

systemctl start prometheus-mysqld-exporter #for some reason this must be restarted manually

SET GLOBAL read_only = 0;
  1. Detach the destination Cinder volume (tools-db-1) from clouddb1001 and attach it directly to the hypervisor using rbd map, then mount it to /mnt-cinder.
rbd map eqiad1-cinder/volume-9583ba88-b836-4233-8e51-6213d403f9c7
mkdir /mnt-cinder
mount /dev/rbd0p1 /mnt-cinder/
  1. Copy the snapshot to the attached Cinder volume.
# make sure to run this in a screen/tmux session as it will take a few hours
rsync -a --info=progress2 /mnt-snapshot/srv/labsdb/ /mnt-cinder/
  1. Unmount both the snapshot and the Cinder volume.
guestunmount /mnt-snapshot
umount /mnt-cinder
rbd unmap /dev/rbd0
  1. From a shell in a cloudcontrol* host, transfer the volume to the tools project.
export OS_PROJECT_ID=clouddb-services
openstack volume list
openstack volume transfer request create {volume-id}
export OS_PROJECT_ID=tools
openstack volume transfer request accept --auth-key {auth-key} {transfer-id}
  1. From the Horizon web interface (project: tools), create a Cinder volume snapshot (for extra safety in case something goes wrong in the following steps) and attach the volume to tools-db-1. Then from a shell in tools-db-1 make sure that MariaDB is not running, mount the volume, and start MariaDB.
systemctl stop mariadb
mount /dev/sdb1 /srv/labsdb
chown -R mysql:mysql /srv/labsdb/
systemctl start mariadb
  1. Run mysql_upgrade to upgrade the data format from MariaDB 10.1 to 10.4
  1. Check with journalctl -u mariadb -f that there is no error. Then start the replication, excluding two databases: linkwatcher (that will be migrated separately in T328691) and mw (that will be migrated separately in T328693).
SET GLOBAL replicate_wild_ignore_table="s51230__linkwatcher.%,s54518__mw.%";

SET GLOBAL gtid_slave_pos = "{the value from step 2}"

SET GLOBAL innodb_flush_log_at_trx_commit=2;

 MASTER_PASSWORD='{the password for the repl user},

  1. Using SHOW SLAVE STATUS\G or the ToolsDB Grafana dashboard, check that the replication lag (Seconds_Behind_Master) decreases to zero.

Event Timeline

fnegri updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2023-02-20T11:16:29Z] <dhinus> stopping mariadb on clouddb1001 for about 2 minutes (T329970)

The first attempt failed because the LVM snapshot space filled up in only 45 minutes (much faster than expected) when only 157GB had been copied from the LVM snapshot to the attached Cinder volume.

This was probably caused by an increase in write activity to the database, just around the time when I was copying data from the snapshot to the Cinder volume.

Screenshot 2023-02-20 at 16.59.09.png (1×1 px, 204 KB)

I will retry the same procedure in one hour, at 17:00 UTC. This will cause another short downtime of ToolsDB (expected: 5 mins).

Mentioned in SAL (#wikimedia-cloud) [2023-02-20T17:07:20Z] <dhinus> stopping mariadb on clouddb1001 for about 5 minutes (T329970)

The second attempt failed as well, with 858G copied (out of 1.6T) in 1 hour and 45 minutes, before the snapshot space filled up.

fnegri changed the task status from Open to In Progress.Feb 21 2023, 3:49 PM
fnegri triaged this task as High priority.
fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2022/2023-Q3) board.

@dcaro suggested an interesting alternative strategy: instead of creating an LVM snapshot, create a libvirt snapshot (specifically an external disk snapshot).

This is the procedure that I'm thinking about, to be performed in cloudvirt1019:

  • stop mariadb
  • virsh snapshot-create-as --disk-only --domain i-xxxx (this will create a new QCOW2 file in /var/lib/nova/instances/)
  • apt install guestmount
  • guestmount -a {path_to_old_disk_file} -i --ro /mnt
  • restart mariadb

If this works as I hope, we'll have a mountpoint to the disk file that was "frozen" when the snapshot was created. All new changes are saved by libvirt to a new disk layer, and the old one should not change anymore so it should be possible to copy all the MariaDB data files out and into a Cinder volume (to be resolved: how do I mount a Cinder volume in a cloudvirt*?)

I am testing this procedure in cloudvirt1020, first on toolsbeta-test-k8s-etcd-19 and if I'm successful there, on the replica instance clouddb1002. If both tests are successful, I will do the same on the primary db.

Note: apt install guestmount worked but returned some unexpected errors, not sure if it's something we should worry about. I'm also not sure why it did generate a new initrd file:

root@cloudvirt1020:~# apt install libguestfs-tools

su: warning: cannot change directory to /nonexistent: No such file or directory
Setting up mdadm (4.1-11) ...
Generating mdadm.conf... done.
update-initramfs: deferring update (trigger activated)

update-initramfs: Generating /boot/initrd.img-5.10.0-21-amd64
W: Possible missing firmware /lib/firmware/bnx2x/bnx2x-e2- for module bnx2x
W: Possible missing firmware /lib/firmware/bnx2x/bnx2x-e1h- for module bnx2x
W: Possible missing firmware /lib/firmware/bnx2x/bnx2x-e1- for module bnx2x
W: Possible missing firmware /lib/firmware/tigon/tg3_tso5.bin for module tg3
W: Possible missing firmware /lib/firmware/tigon/tg3_tso.bin for module tg3
W: Possible missing firmware /lib/firmware/tigon/tg3.bin for module tg3

Tested on toolsbeta-test-k8s-etcd-19 and it worked:

root@cloudvirt1020:~# virsh domblklist     i-00038d77
 Target   Source
 sda      /var/lib/nova/instances/7e53b91f-d40b-45c1-b3dd-fcb66cfaee75/disk

root@cloudvirt1020:~# virsh snapshot-create-as --disk-only --domain  i-00038d77
Domain snapshot 1677097361 created
root@cloudvirt1020:~# virsh snapshot-list --domain  i-00038d77
 Name         Creation Time               State
 1677097361   2023-02-22 20:22:41 +0000   disk-snapshot

root@cloudvirt1020:~# virsh domblklist     i-00038d77
 Target   Source
 sda      /var/lib/nova/instances/7e53b91f-d40b-45c1-b3dd-fcb66cfaee75/disk.1677097361

root@cloudvirt1020:~# guestmount -a /var/lib/nova/instances/7e53b91f-d40b-45c1-b3dd-fcb66cfaee75/disk -i --ro /mnt

root@cloudvirt1020:~# ls /mnt/
bin   dev  home        initrd.img.old  lib32  libx32      media  opt   root  sbin  sys  usr  vmlinuz
boot  etc  initrd.img  lib             lib64  lost+found  mnt    proc  run   srv   tmp  var  vmlinuz.old

Doing the same on clouddb1002 threw an error during guestmount:

root@cloudvirt1020:~# virsh domblklist     i-00004996
 Target   Source
 vda      /var/lib/nova/instances/7498566a-c160-4d7d-90eb-03470e5d80f3/disk

root@cloudvirt1020:~# virsh snapshot-create-as --disk-only --domain     i-00004996
Domain snapshot 1677097630 created
root@cloudvirt1020:~# virsh domblklist     i-00004996
 Target   Source
 vda      /var/lib/nova/instances/7498566a-c160-4d7d-90eb-03470e5d80f3/disk.1677097630

root@cloudvirt1020:~# ls -lh  /var/lib/nova/instances/7498566a-c160-4d7d-90eb-03470e5d80f3/
total 3.3T
-rw------- 1 nova         nova          38K Dec  7 22:03 console.log
-rw-r--r-- 1 nova         nova         3.3T Feb 22 20:27 disk
-rw------- 1 libvirt-qemu libvirt-qemu  71M Feb 22 20:27 disk.1677097630
-rw-r--r-- 1 nova         nova           79 Feb 21  2019
-rw-r--r-- 1 nova         nova         2.7K Feb 21  2019 libvirt.xml

root@cloudvirt1020:~# guestmount -a /var/lib/nova/instances/7498566a-c160-4d7d-90eb-03470e5d80f3 -i --ro /mnt
libguestfs: error: qemu-img: /tmp/libguestfsk1JZCt/overlay1.qcow2: qemu-img exited with error status 1.
To see full error messages you may need to enable debugging.
and run the command again.  For further information, read:
You can also run 'libguestfs-test-tool' and post the *complete* output
into a bug report or message to the libguestfs mailing list.

That was a stupid typo on my side (the guestmount command was missing the final /disk). Adding that, it worked, and I can access the database files:

root@cloudvirt1020:~# guestmount -a /var/lib/nova/instances/7498566a-c160-4d7d-90eb-03470e5d80f3/disk -i --ro /mnt
root@cloudvirt1020:~# ls /mnt
bin  boot  dev  etc  home  initrd.img  initrd.img.old  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var  vmlinuz  vmlinuz.old
root@cloudvirt1020:~# ls /mnt/srv/labsdb/
backup1         backup_test_T301949  data                    s51541_sulwatcher__logging__T191001.sql.gz  s53220__quickstatements_p  t_s_all_dabs.frm
backup_T326261  binlogs              s51412__data_ibds_back  s52004__hocr.sql                            tmp

Now I need to find out how/if I can mount a Cinder volume in cloudvirt1020, so that I can test copying those files out of the "guestmounted" snapshot.

With some great help from @dcaro and @aborrero I managed to mount the volume and start a copy from the libvirt snapshot to the "tools-db-1" Cinder volume:

root@cloudvirt1020:~# rbd map eqiad1-cinder/volume-9583ba88-b836-4233-8e51-6213d403f9c7

root@cloudvirt1020:~# fdisk -l /dev/rbd0
Disk /dev/rbd0: 3.91 TiB, 4294967296000 bytes, 8388608000 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 65536 bytes / 65536 bytes
Disklabel type: gpt
Disk identifier: 40AC4E59-FA94-43DE-96C3-63734D1A93F4

Device      Start        End    Sectors  Size Type
/dev/rbd0p1  2048 8388607966 8388605919  3.9T Linux filesystem

root@cloudvirt1020:~# mkdir /mnt-cinder
root@cloudvirt1020:~# mount /dev/rbd0p1 /mnt-cinder/

Note that for rbd map to work, I had to manually create /etc/ceph/ceph.client.admin.keyring and insert the content I found in the same file in cloudvirt1021. I checked with Cumin and for some reason that file is only present in 6 out of 50 cloudvirts:

===== NODE GROUP =====
(3) cloudvirt[2001-2003]-dev.codfw.wmnet
----- OUTPUT of 'ls /etc/ceph/ceph.client*' -----
===== NODE GROUP =====
(6) cloudvirt[1021-1022,1024-1027].eqiad.wmnet
----- OUTPUT of 'ls /etc/ceph/ceph.client*' -----
===== NODE GROUP =====
(41) cloudvirt[1017,1019-1020,1023,1028-1061].eqiad.wmnet,cloudvirt-wdqs[1001-1003].eqiad.wmnet
----- OUTPUT of 'ls /etc/ceph/ceph.client*' -----

The experiment with clouddb1002 worked: I re-attached the volume to tools-db-1, ran mysql_upgrade, and then started the replication:

MariaDB [(none)]> SET GLOBAL innodb_flush_log_at_trx_commit=2;
MariaDB [(none)]> RESET SLAVE;
MariaDB [(none)]> START SLAVE;

I will now do the same but taking a snapshot of the primary clouddb1001, which is running in cloudvirt1019.

fnegri updated the task description. (Show Details)

Replication is slow because the write activity on the primary has been higher than usual for the last couple of days.

Replication lag on clouddb1002 and tools-db-1:

Screenshot 2023-02-28 at 18.41.33.png (1×2 px, 203 KB)

Write activity on the primary:

Screenshot 2023-02-28 at 18.39.26.png (1×2 px, 284 KB)

I'll wait to see if that goes back down before taking the snapshot of the primary.

I'll wait to see if that goes back down before taking the snapshot of the primary.

[17:49]  <    taavi> hm, something happened recently which caused toolsdb free disk space to drop significantly:
[17:50]  <    taavi> maybe the timing seems to match
[17:57]  <    bd808> *sigh* we really need to figure out DB quotas
[17:59]  <    bd808> "Might one day allow for looking up the past GlobalUsage of a Commons file." sounds liek the description of a tool that could be paused if it causing issues.
[18:03]  <andrewbogott> taavi: that corresponds with dhinus finding new toolsdb replag
[18:04]  <    bd808> Code is at They are trying to load a lot of dumps data into ToolsDB and this is certainly going to end badly.
[18:09]  <    taavi> yeah, that is not going to end well
[18:10]  <andrewbogott> Shall I reach out to them or is one of you already on top of that?
[18:11]  <    taavi> I am not
[18:11]  <    bd808> andrewbogott: go for it. :)
[18:17]  <andrewbogott> I emailed and told them I'd kill the job in a few hours if they don't kill it first.

Write activity on the primary is back to normal:

Screenshot 2023-03-01 at 15.58.20.png (1×2 px, 286 KB)

And both replicas did catch up:

Screenshot 2023-03-01 at 16.00.36.png (1×2 px, 207 KB)

I'm gonna take a libvirt snapshot of the primary database clouddb1001 today at 15:30 UTC. This will involve a short downtime of ToolsDB (expected: 5 minutes). I have updated the description of this task with the procedure I will follow.

Mentioned in SAL (#wikimedia-cloud) [2023-03-06T15:33:12Z] <dhinus> stopping mariadb in clouddb1001 for a few minutes T329970

Mentioned in SAL (#wikimedia-cloud) [2023-03-06T15:38:19Z] <dhinus> mariadb is up again on clouddb1001 T329970

Snapshot created, currently copying from the snapshot to the Cinder volume.

root@cloudvirt1019:~# virsh snapshot-create-as --disk-only --domain  i-000045d4

root@cloudvirt1019:~# apt install libguestfs-tools
root@cloudvirt1019:~# guestmount -a /var/lib/nova/instances/c01a0492-bff0-4ce2-9395-833b406eee7b/disk -i --ro /mnt-snapshot

root@cloudvirt1019:~# rbd map eqiad1-cinder/volume-9583ba88-b836-4233-8e51-6213d403f9c7

root@cloudvirt1019:~# rsync -a --info=progress2 /mnt-snapshot/srv/labsdb/ /mnt-cinder/

rsync completed after 7 hours, I attached the volume to tools-db-1 and tried 3 times starting to replicate from the primary but it always fails with

[ERROR] Slave SQL: Could not execute Update_rows_v1 event on table s51138__heritage_p.prox_search; Can't find record in 'prox_search', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log log.316058, end_log_pos 670, Gtid 2886731673-2886731673-4328983117, Internal MariaDB error code: 1032

It looks like the GTID position is incorrect, but I'm manually setting it to the one I copied when I took the snapshot.

This is what I did yesterday on the primary before taking the snapshot:

Query OK, 0 rows affected (26.52 sec)

MariaDB [(none)]> SELECT @@global.gtid_binlog_pos;
| @@global.gtid_binlog_pos                                  |
| 0-2886731673-33522724637,2886731673-2886731673-4488544230 |
1 row in set (0.00 sec)

And this is what I'm doing now on the replica:

SET GLOBAL gtid_slave_pos = "0-2886731673-33522724637,2886731673-2886731673-4488544230";

For some reason, the binlog position I recorded before taking the snapshot was wrong. Maybe shutting down mariadb with systemctl stop mariadb actually adds some transactions to the binlog? I found the right value to use inspecting the file log.state in the snapshot, which contains the correct binlog position to use:

root@cloudvirt1019:/mnt-snapshot/srv/labsdb/binlogs# cat log.state

In tools-db-1:

SET GLOBAL gtid_slave_pos = "0-2886731673-33522724637,2886731673-2886731673-4488548509";

Replication is now working correctly, and tools-db-1 is slowly catching up with the primary.

To be extra safe (since I played a bit with the replication settings before finding the right value for gtid_slave_pos) I restarted from a clean state.

I deleted the entire data volume, recreated a new data volume from the Cinder snapshot I took yesterday, and started the replication using the correct gtid_slave_pos.

I also saved the output of both MariaDB logs and mysql_upgrade, they contain a few warnings and errors, but nothing that looks too scary. I'm attaching both files.

The replica caught up with the primary and is staying in sync. This task is complete. 🎉

Screenshot 2023-03-09 at 14.10.51.png (1×3 px, 336 KB)

fnegri reopened this task as In Progress.Mar 9 2023, 1:55 PM

Ouch, I stopped and restarted mariadb in tools-db-1, and did not realize that would reset the config replicate_wild_ignore_table to the value included in modules/role/templates/mariadb/mysqld_config/, which excludes other tables from replication and is not what I want. :/

I'm creating a Puppet patch to set that value differently in tools-db-1 and tools-db-2, while maintaining the current value in clouddb1002.

Change 896101 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] [toolsdb] Update config file but keep old one

Change 896101 merged by FNegri:

[operations/puppet@production] [toolsdb] Update config file but keep old one

Thankfully I had taken a Cinder snapshot before stopping and starting mariadb, so I could restart from that point now that the config file is fixed (after merging

tools-db-1 is again replicating and is back in sync with clouddb1001.

Replication has failed after 5 days (on Mar, 14 21:09 UTC) with an error I had not seen before:

[ERROR] Slave SQL: Error executing row event: ''mysql.user' is not of type 'BASE TABLE'', Gtid 2886731673-2886731673-4588079609, Internal MariaDB error code: 1347

In both tools-db-1 and tools-db-2, mysql.user is indeed of type VIEW, while in the old replica clouddb1002 it is of type BASE TABLE. I'm investigating on why this is the case.

This is a change introduced in MariaDB 10.4:

All user accounts, passwords, and global privileges are now stored in the mysql.global_priv table. The mysql.user table still exists and has exactly the same set of columns as before, but it’s now a view that references the mysql.global_priv table. Tools that analyze the mysql.user table should continue to work as before. From MariaDB 10.4.13, the dedicated mariadb.sys user is created as the definer of this view. Previously root was the definer, which resulted in privilege problems when this username was changed.

This is the transaction that the replica failed to apply:

root@clouddb1001:~# mysqlbinlog --no-defaults -j 36001860 --base64-output=decode-rows --verbose log.318649 > ~/log_extract


#230314 21:09:18 server id 2886731673  end_log_pos 36002072     Table_map: `mysql`.`user` mapped to number 6677173
# at 36002072
#230314 21:09:18 server id 2886731673  end_log_pos 36002227     Delete_rows: table id 6677173 flags: STMT_END_F
### DELETE FROM `mysql`.`user`
###   @1='%'
###   @2='s55337'


I manually deleted the user in tools-db-1 using the correct table for MariaDB 10.4 (mysql.global_priv instead of mysql_user), skipped the transaction and resumed replication:

MariaDB [(none)]> DELETE FROM mysql.global_priv WHERE user = 's55337';
Query OK, 1 row affected (0.015 sec)

MariaDB [(none)]> START SLAVE;

After 2 days, tools-db-1 is again in sync with the primary.

Screenshot 2023-03-29 at 19.04.48.png (952×3 px, 188 KB)

JJMC89 moved this task from Backlog to ToolsDB on the Data-Services board.