Page MenuHomePhabricator

ToolsDB upgrade => Bullseye, MariaDB 10.4
Closed, ResolvedPublic

Description

We can't really just keep pushing this to the future.. ToolsDB is still on Stretch + MariaDB 10.1 and needs to be upgraded to more modern things, which I believe currently means Bullseye and MariaDB 10.4.

ToolsDB is currently running on two dedicated localdisk hypervisors. Can we temporarily add a third one to make the migration easier? I was thinking of maybe using the replica (once recovered from T301951) to clone a new 10.4 instance without causing any service impact to the primary.

We have two upgrades that need to happen here: the hypervisors (cloudvirt1019 and 1020) need to be upgraded to Bullseye, and the VMs (clouddb100[1-4]) need to be upgraded to at least Buster and, ideally, also Bullseye.

Clouddb1001 and 1004 are on cloudvirt1019
Clouddb1002 and 1003 are on cloudvirt1020

Clouddb1001 is the (partial) toolsdb database, with (partial) replica clouddb1002

Clouddb1004 and Clouddb1003 are postgres servers and I'm not entirely sure what's on them. For these 2 servers, refer to T323159 instead.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Failed again, probably for a similar reason but on a different database and different table:

230117 12:17:41 [01] Copying ./s52861__secWatch/temp_pagelinks.ibd to /mnt/tools-db/backup-clouddb1001-20220117/s52861__secWatch/temp_pagelinks.ibd
2023-01-17 12:17:41 140580105484032 [ERROR] InnoDB: Tried to read 147456 bytes at offset 0. Was only able to read 114688.
[01] mariabackup: Error: xtrabackup_copy_datafile() failed.
[01] mariabackup: Error: failed to copy datafile.

1.8T were backed up when it failed (it reached 2.1T yesterday).

Trying one last time, if it fails again I'll use a different approach. I started a new backup at 7:45 UTC:

root@clouddb1001:~# mariabackup --backup --target-dir=/mnt/tools-db/backup-clouddb1001-20230118 --user root

This time it failed almost immediately:

230118 07:58:42 [01] Copying ./s51138__heritage_p/monuments_az_@0028az@0029.ibd to /mnt/tools-db/backup-clouddb1001-20230118/s51138__heritage_p/monuments_az_@0028az@0029.ibd
2023-01-18  7:58:42 139850929927936 [ERROR] InnoDB: Tried to read 10485760 bytes at offset 0. Was only able to read 131072.
[01] mariabackup: Error: xtrabackup_copy_datafile() failed.
[01] mariabackup: Error: failed to copy datafile.

Trying once more (restarted at 9:09 UTC).

Fourth time lucky! The backup completed successfully at 14:16 UTC, it took about 5 hours in total.

...

230118 14:16:34 Finished backing up non-InnoDB tables and files
230118 14:16:34 [01] Copying aria_log.00001172 to /mnt/tools-db/backup-clouddb1001-20230118/aria_log.00001172
230118 14:16:35 >> log scanned up to (71220802880838)
230118 14:16:36 >> log scanned up to (71220802880838)
230118 14:16:36 [01]        ...done
230118 14:16:36 [01] Copying aria_log_control to /mnt/tools-db/backup-clouddb1001-20230118/aria_log_control
230118 14:16:36 [01]        ...done
230118 14:16:36 [00] Writing xtrabackup_binlog_info
230118 14:16:36 [00]        ...done
230118 14:16:36 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
mariabackup: The latest check point (for incremental): '71220802880838'
230118 14:16:36 >> log scanned up to (71220802880838)
mariabackup: Stopping log copying thread.

230118 14:16:36 Executing UNLOCK TABLES
230118 14:16:36 All tables unlocked
230118 14:16:36 [00] Copying ib_buffer_pool to /mnt/tools-db/backup-clouddb1001-20230118/ib_buffer_pool
230118 14:16:36 [00]        ...done
230118 14:16:36 Backup created in directory '/mnt/tools-db/backup-clouddb1001-20230118/'
MySQL binlog position: filename 'log.310398', position '18665008', GTID of the last change '0-2886731673-33522724637,2886731673-2886731673-3486875456'
230118 14:16:36 [00] Writing backup-my.cnf
230118 14:16:36 [00]        ...done
230118 14:16:36 [00] Writing xtrabackup_info
230118 14:16:36 [00]        ...done
mariabackup: Transaction log of lsn (71217244365711) to (71220802880838) was copied.
230118 14:16:36 completed OK!

As expected, mariabackup locked the database during the last phase of the backup, causing a page at 14:05 UTC that resolved at 14:16 UTC when the backup completed. I expect the database was read-only during that time, but I did not verify it.

14:05 PROBLEM - toolschecker: toolsdb read/write on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/db/toolsdb - 340 bytes in 60.010 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

14:16 RECOVERY - toolschecker: toolsdb read/write on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 20.310 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

Now I'm running into another issue, when executing mariabackup --prepare on the generated backup, I get this error:

root@clouddb1001:~# mariabackup --prepare --target-dir=/mnt/tools-db/backup-clouddb1001-20230118
mariabackup based on MariaDB server 10.1.39-MariaDB Linux (x86_64)
mariabackup: cd to /mnt/tools-db/backup-clouddb1001-20230118/
mariabackup: This target seems to be not prepared yet.
mariabackup: xtrabackup_logfile detected: size=4003332096, start_lsn=(71217244365711)
mariabackup: using the following InnoDB configuration for recovery:
mariabackup:   innodb_data_home_dir = ./
mariabackup:   innodb_data_file_path = ibdata1:12M:autoextend
mariabackup:   innodb_log_group_home_dir = ./
mariabackup:   innodb_log_files_in_group = 1
mariabackup:   innodb_log_file_size = 4003332096
mariabackup: using the following InnoDB configuration for recovery:
mariabackup:   innodb_data_home_dir = ./
mariabackup:   innodb_data_file_path = ibdata1:12M:autoextend
mariabackup:   innodb_log_group_home_dir = ./
mariabackup:   innodb_log_files_in_group = 1
mariabackup:   innodb_log_file_size = 4003332096
mariabackup: Starting InnoDB instance for recovery.
mariabackup: Using 104857600 bytes for buffer pool (set by --use-memory parameter)
2023-01-18 14:57:21 140398223053504 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2023-01-18 14:57:21 140398223053504 [Note] InnoDB: The InnoDB memory heap is disabled
2023-01-18 14:57:21 140398223053504 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2023-01-18 14:57:21 140398223053504 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2023-01-18 14:57:21 140398223053504 [Note] InnoDB: Compressed tables use zlib 1.2.11
2023-01-18 14:57:21 140398223053504 [Note] InnoDB: Using SSE crc32 instructions
2023-01-18 14:57:21 140398223053504 [Note] InnoDB: Initializing buffer pool, size = 100.0M
2023-01-18 14:57:21 140398223053504 [Note] InnoDB: Completed initialization of buffer pool
2023-01-18 14:57:21 140398223053504 [Note] InnoDB: Highest supported file format is Barracuda.
2023-01-18 14:57:21 140398223053504 [Note] InnoDB: Starting crash recovery from checkpoint LSN=71217244365711
2023-01-18 14:57:33 140398223053504 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
2023-01-18 14:57:34 140398223053504 [Note] InnoDB: Starting a batch to recover 29260 pages from redo log
2023-01-18 14:57:35 7fb0f633c700  InnoDB: Assertion failure in thread 140398021560064 in file log0recv.cc line 1172
InnoDB: Failing assertion: !page || (ibool)!!page_is_comp(page) == dict_table_is_comp(index->table)
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
2023-01-18 14:57:35 7fb0f533a700  InnoDB: Assertion failure in thread 140398004774656 in file log0recv.cc line 1172
InnoDB: Failing assertion: !page || (ibool)!!page_is_comp(page) == dict_table_is_comp(index->table)
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
Aborted

I discussed this again with @jcrespo in IRC, maybe there was another concurrent operation that didn't prevent the backup from completing, but is showing up during the --prepare phase. In any case, the backup that was generated yesterday is probably not good enough for a full restore.

We discussed a few alternative options, that I will explore tomorrow:

  1. create an LVM snapshot of the primary database, copy it to /mnt/tools, then delete the snapshot. Try to restore this snapshot to a new host.
  2. use mydumper to create logical dumps, potentially one db at a time
  3. if none of the above work, attempt to backup the primary excluding some problematic databases/tables, then backup those ones separately and restore them to a different instance (effectively splitting the main ToolsDB, which is something we want to do anyway)

Mentioned in SAL (#wikimedia-cloud) [2023-01-24T15:46:06Z] <dhinus> clouddb1001 flush tables with read lock; (T301949)

Mentioned in SAL (#wikimedia-cloud) [2023-01-24T16:01:18Z] <dhinus> clouddb1001 unlock tables; (T301949)

Today I explored the "LVM snapshot" backup strategy, mostly following this guide. I had to manually kill a couple queries that were preventing flush tables with read lock from completing, then I was able to create a LVM snapshot, unlock the database and start copying the data from the snapshot to the attached Cinder volume.

root@clouddb1001:~# lvcreate -L13G -s -n data_lvm_backup /dev/vd/data
root@clouddb1001:~# mkdir /mnt/data_lvm_backup
root@clouddb1001:~# mount /dev/vd/data_lvm_backup /mnt/data_lvm_backup -onouuid,ro
root@clouddb1001:~# cp -a /mnt/data_lvm_backup/labsdb/data /mnt/tools-db/lvm_backup_20230124/

The problem is that as expected the LVM snapshot has a limited size (I created it with -L13G as that was the amount in the current LVM setup, see below). LVM snapshots are copy-on-write and will consume space to store the diff between the snapshot and the current state of the snapshotted volume. When the entire snapshot space is consumed, the snapshot will become invalid. Unfortunately the copy of the data from the snapshot volume to the Cinder volume was progressing at about 200MB/s and would require about 4 hours. The snapshot space was filling much faster, and was at 50% used after about 1 hour.

I stopped the copy, unmounted the volume, and deleted the snapshot with lvremove. I think I can work around this problem by shrinking the main LVM volume to create more space for the snapshot volume. This is the current situation:

Total PE              886143
Alloc PE / Size       882724 / 3.37 TiB
Free  PE / Size       3419 / 13.36 GiB

We need less than 3T for the main database volume (/dev/vd/data), so I should be able to shrink the main volume and increase the Free PE. I will try this on the replica host first.

Unfortunately shrinking the main LVM volume /dev/vd/data is not possible, because XFS filesystems can't be shrunk. We could potentially get some extra space for the snapshot volume by deactivating swap and removing the swap volume /dev/vd/swap (7.5G) but it's probably not going to be enough.

I will try a different strategy: creating a 13G snapshot like yesterday, but excluding the /srv/labsdb/data/s51230__linkwatcher directory (1.1TB) from the copy, so hopefully the copy will complete before the snapshot volume fills up. We can then migrate that database separately (already tracked in T224154).

Mentioned in SAL (#wikimedia-cloud) [2023-01-25T14:34:24Z] <dhinus> clouddb1001 flush tables with read lock; (T301949)

Mentioned in SAL (#wikimedia-cloud) [2023-01-25T14:37:09Z] <dhinus> clouddb1001 unlock tables; (T301949)

I created a new snapshot and tried using rsync to copy its content to the Cinder volume but excluding the directory s51230__linkwatcher. Unfortunately rsync did not complete in time, after 4 hours the snapshot volume is almost out of space and rsync only transferred 915G out of 1.5T.

root@clouddb1001:~# lvcreate -L13G -s -n data_lvm_snapshot /dev/vd/data
root@clouddb1001:~# mkdir /mnt/data_lvm_snapshot
root@clouddb1001:~# mount /dev/vd/data_lvm_snapshot /mnt/data_lvm_snapshot -onouuid,ro
root@clouddb1001:~# rsync -a --info=stats2,progress2 --exclude 's51230__linkwatcher' /mnt/data_lvm_snapshot/ /mnt/tools-db/lvm_backup_20230125/

For some reason, the copy with rsync seems to be significantly slower than what I was seeing yesterday using cp (confirmed by the I/O charts in Grafana, see image below). I will do some more benchmarking to see if there is a way to make rsync faster, or I will try using cp with some bash magic to exclude one directory from the copy.

Screenshot 2023-01-25 at 19.54.23.png (1×1 px, 219 KB)

On production we use transfer.py, which gets us consistent 300 MB/s on 1G networks thanks to compression: https://grafana.wikimedia.org/goto/fVRedQTVz?orgId=1 This only uses netcat underneath (+ hashing for error checking).

transfer.py does not seem to have an easy way to exclude a directory, am I right?

I tested again cp, rsync and tar. and based on my quick benchmarks, I think this command might complete the copy before the snapshot space fills up:

root@clouddb1001:/mnt/data_lvm_snapshot/labsdb# tar cf - --exclude='./backup1' --exclude='./data/s51230__linkwatcher' . | (cd /mnt/tools-db && tar xvf -)

I will run this tomorrow morning so I can keep an eye on the progress.

Mentioned in SAL (#wikimedia-cloud) [2023-01-27T10:33:53Z] <dhinus> clouddb1001 flush tables with read lock; (T301949)

Mentioned in SAL (#wikimedia-cloud) [2023-01-27T10:35:44Z] <dhinus> clouddb1001 unlock tables; (T301949)

Copying with tar cf as in the comment above worked, and completed just before the snapshot space filled up.

I have transferred the Cinder volume to the tools project, and attached it to tools-db-1, only to find out that you cannot do a MariaDB version upgrade on data files that were not shut down cleanly (I took the LVM snapshot while the DB was running):

Jan 27 15:03:21 tools-db-1 mysqld[3180577]: 2023-01-27 15:03:21 0 [ERROR] InnoDB: Upgrade after a crash is not supported. This redo log was created before MariaDB 10.2.2.

I need to attach the Cinder volume to a host running MariaDB 10.1 to "clean up" those data files, then shut down MariaDB cleanly, reattach the volume to tools-db-1 which is running 10.4, and this time it should start correctly.

The package wmf-mariadb101 does not install on Bullseye, but it might work on a Buster VM (haven't tried yet). Otherwise I need to find a way to start a Stretch VM in Cloud VPS.

I will continue this work next Monday.

I created a temporary stretch VM called tools-db-temp, and installed mariadb 10.1 there. I attached the backup volume, started MariaDB and dropped the database s51230__linkwatcher that had been excluded from the backup and was causing some warning in the logs. At this point I stopped MariaDB, unmounted the volume and reattached it to tools-db-1.

MariaDB 10.4 on tools-db-1 started successfully, but the logs show many errors like the following one on many different databases:

Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [ERROR] InnoDB: Operating system error number 2 in a file operation.
Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [ERROR] InnoDB: The error means the system cannot find the path specified.
Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them.
Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [ERROR] InnoDB: Cannot open datafile for read-only: './s51114_enwp10/automatedselection.ibd' OS error: 71
Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [Note] InnoDB: The InnoDB data dictionary table SYS_DATAFILES for tablespace ID 133050 was updated to use file ./s51114__enwp10/automatedselection.ibd.

And then another bunch like this one:

Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [ERROR] InnoDB: Refusing to load './s52481__scratch_en/articles.ibd' (id=2185422, flags=0); dictionary contains id=1926730, flags=0
Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [ERROR] InnoDB: A tablespace for ``s52481__citationhunt_en`.`articles`` has been found in multiple places;
Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [ERROR] InnoDB: Default location: ./s52481__citationhunt_en/articles.ibd, Space ID=1926730, Flags=0
Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [ERROR] InnoDB: Dictionary location: ./s52481__scratch_en/articles.ibd, Space ID=2185422, Flags=0
Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [Note] InnoDB: The InnoDB data dictionary table SYS_DATAFILES for tablespace ID 1926730 was updated to use file ./s52481__citationhunt_en/articles.ibd.

Finally, there was a long list of lines like:

Jan 30 17:14:57 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:57 0 [Warning] InnoDB: Table mysql/innodb_table_stats has length mismatch in the column name table_name.  Please run mysql_upgrade

And one single error mentioning an "Incorrect definition":

Jan 30 17:14:58 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:58 0 [ERROR] Incorrect definition of table mysql.event: expected column 'sql_mode' at position 14 to have type set('REAL_AS_FLOAT','PIPES_AS_CONCAT','ANSI_QUOTES','IGNORE_SPACE','IGNORE_BAD_TABLE_OPTIONS','ONLY_FULL_GROUP_BY','NO_UNSIGNED_SUBTRACTION','NO_D
IR_IN_CREATE','POSTGRESQL','ORACLE','MSSQL','DB2','MAXDB','NO_KEY_OPTIONS','NO_TABLE_OPTIONS','NO_FIELD_OPTIONS','MYSQL323','MYSQL40','ANSI','NO_AUTO_VALUE_ON_ZERO','NO_BACKSLASH_ESCAPES','STRICT_TRANS_TABLES','STRICT_ALL_TABLES','NO_ZERO_IN_DATE','NO_ZERO_DATE','INVALID_DATES','ERROR_FOR_DIVISION_BY_ZERO','TRADITION
AL','NO_AUTO_CREATE_USER','HIGH_NOT_PRECEDENCE','NO_ENGINE_SUBSTITUTION','PAD_CHAR_TO_FULL_LENGTH','EMPTY_STRING_IS_NULL','SIMULTANEOUS_ASSIGNMENT'), found type set('REAL_AS_FLOAT','PIPES_AS_CONCAT','ANSI_QUOTES','IGNORE_SPACE','IGNORE_BAD_TABLE_OPTIONS','ONLY_FULL_GROUP_BY','NO_UNSIGNED_SUBTRACTION','NO_DIR_IN_CREAT
E','POSTGRESQL','ORACLE','MSSQL','DB2','MAXDB','NO_KEY_OPTIONS','NO_TABLE_OPTIONS','NO_FIELD_OPTIONS','MYSQL323','MYSQL40','ANSI','NO_AUTO_VALU
Jan 30 17:14:58 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:58 0 [ERROR] mysqld: Event Scheduler: An error occurred when initializing system tables. Disabling the Event Scheduler.

I am now running mysql_upgrade as recommended and it's repairing/rebuilding tables one by one, taking quite a long time.

I created a temporary stretch VM called tools-db-temp, and installed mariadb 10.1 there. I attached the backup volume, started MariaDB and dropped the database s51230__linkwatcher that had been excluded from the backup and was causing some warning in the logs. At this point I stopped MariaDB, unmounted the volume and reattached it to tools-db-1.

MariaDB 10.4 on tools-db-1 started successfully, but the logs show many errors like the following one on many different databases:

Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [ERROR] InnoDB: Operating system error number 2 in a file operation.
Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [ERROR] InnoDB: The error means the system cannot find the path specified.
Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them.
Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [ERROR] InnoDB: Cannot open datafile for read-only: './s51114_enwp10/automatedselection.ibd' OS error: 71
Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [Note] InnoDB: The InnoDB data dictionary table SYS_DATAFILES for tablespace ID 133050 was updated to use file ./s51114__enwp10/automatedselection.ibd.

And then another bunch like this one:

Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [ERROR] InnoDB: Refusing to load './s52481__scratch_en/articles.ibd' (id=2185422, flags=0); dictionary contains id=1926730, flags=0
Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [ERROR] InnoDB: A tablespace for ``s52481__citationhunt_en`.`articles`` has been found in multiple places;
Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [ERROR] InnoDB: Default location: ./s52481__citationhunt_en/articles.ibd, Space ID=1926730, Flags=0
Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [ERROR] InnoDB: Dictionary location: ./s52481__scratch_en/articles.ibd, Space ID=2185422, Flags=0
Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [Note] InnoDB: The InnoDB data dictionary table SYS_DATAFILES for tablespace ID 1926730 was updated to use file ./s52481__citationhunt_en/articles.ibd.

Finally, there was a long list of lines like:

Jan 30 17:14:57 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:57 0 [Warning] InnoDB: Table mysql/innodb_table_stats has length mismatch in the column name table_name.  Please run mysql_upgrade

And one single error mentioning an "Incorrect definition":

Jan 30 17:14:58 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:58 0 [ERROR] Incorrect definition of table mysql.event: expected column 'sql_mode' at position 14 to have type set('REAL_AS_FLOAT','PIPES_AS_CONCAT','ANSI_QUOTES','IGNORE_SPACE','IGNORE_BAD_TABLE_OPTIONS','ONLY_FULL_GROUP_BY','NO_UNSIGNED_SUBTRACTION','NO_D
IR_IN_CREATE','POSTGRESQL','ORACLE','MSSQL','DB2','MAXDB','NO_KEY_OPTIONS','NO_TABLE_OPTIONS','NO_FIELD_OPTIONS','MYSQL323','MYSQL40','ANSI','NO_AUTO_VALUE_ON_ZERO','NO_BACKSLASH_ESCAPES','STRICT_TRANS_TABLES','STRICT_ALL_TABLES','NO_ZERO_IN_DATE','NO_ZERO_DATE','INVALID_DATES','ERROR_FOR_DIVISION_BY_ZERO','TRADITION
AL','NO_AUTO_CREATE_USER','HIGH_NOT_PRECEDENCE','NO_ENGINE_SUBSTITUTION','PAD_CHAR_TO_FULL_LENGTH','EMPTY_STRING_IS_NULL','SIMULTANEOUS_ASSIGNMENT'), found type set('REAL_AS_FLOAT','PIPES_AS_CONCAT','ANSI_QUOTES','IGNORE_SPACE','IGNORE_BAD_TABLE_OPTIONS','ONLY_FULL_GROUP_BY','NO_UNSIGNED_SUBTRACTION','NO_DIR_IN_CREAT
E','POSTGRESQL','ORACLE','MSSQL','DB2','MAXDB','NO_KEY_OPTIONS','NO_TABLE_OPTIONS','NO_FIELD_OPTIONS','MYSQL323','MYSQL40','ANSI','NO_AUTO_VALU
Jan 30 17:14:58 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:58 0 [ERROR] mysqld: Event Scheduler: An error occurred when initializing system tables. Disabling the Event Scheduler.

I am now running mysql_upgrade as recommended and it's repairing/rebuilding tables one by one, taking quite a long time.

I created a temporary stretch VM called tools-db-temp, and installed mariadb 10.1 there. I attached the backup volume, started MariaDB and dropped the database s51230__linkwatcher that had been excluded from the backup and was causing some warning in the logs. At this point I stopped MariaDB, unmounted the volume and reattached it to tools-db-1.

MariaDB 10.4 on tools-db-1 started successfully, but the logs show many errors like the following one on many different databases:

Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [ERROR] InnoDB: Operating system error number 2 in a file operation.
Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [ERROR] InnoDB: The error means the system cannot find the path specified.
Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them.
Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [ERROR] InnoDB: Cannot open datafile for read-only: './s51114_enwp10/automatedselection.ibd' OS error: 71
Jan 30 17:14:51 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:51 0 [Note] InnoDB: The InnoDB data dictionary table SYS_DATAFILES for tablespace ID 133050 was updated to use file ./s51114__enwp10/automatedselection.ibd.

And then another bunch like this one:

Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [ERROR] InnoDB: Refusing to load './s52481__scratch_en/articles.ibd' (id=2185422, flags=0); dictionary contains id=1926730, flags=0
Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [ERROR] InnoDB: A tablespace for ``s52481__citationhunt_en`.`articles`` has been found in multiple places;
Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [ERROR] InnoDB: Default location: ./s52481__citationhunt_en/articles.ibd, Space ID=1926730, Flags=0
Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [ERROR] InnoDB: Dictionary location: ./s52481__scratch_en/articles.ibd, Space ID=2185422, Flags=0
Jan 30 17:14:52 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:52 0 [Note] InnoDB: The InnoDB data dictionary table SYS_DATAFILES for tablespace ID 1926730 was updated to use file ./s52481__citationhunt_en/articles.ibd.

Finally, there was a long list of lines like:

Jan 30 17:14:57 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:57 0 [Warning] InnoDB: Table mysql/innodb_table_stats has length mismatch in the column name table_name.  Please run mysql_upgrade

And one single error mentioning an "Incorrect definition":

Jan 30 17:14:58 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:58 0 [ERROR] Incorrect definition of table mysql.event: expected column 'sql_mode' at position 14 to have type set('REAL_AS_FLOAT','PIPES_AS_CONCAT','ANSI_QUOTES','IGNORE_SPACE','IGNORE_BAD_TABLE_OPTIONS','ONLY_FULL_GROUP_BY','NO_UNSIGNED_SUBTRACTION','NO_D
IR_IN_CREATE','POSTGRESQL','ORACLE','MSSQL','DB2','MAXDB','NO_KEY_OPTIONS','NO_TABLE_OPTIONS','NO_FIELD_OPTIONS','MYSQL323','MYSQL40','ANSI','NO_AUTO_VALUE_ON_ZERO','NO_BACKSLASH_ESCAPES','STRICT_TRANS_TABLES','STRICT_ALL_TABLES','NO_ZERO_IN_DATE','NO_ZERO_DATE','INVALID_DATES','ERROR_FOR_DIVISION_BY_ZERO','TRADITION
AL','NO_AUTO_CREATE_USER','HIGH_NOT_PRECEDENCE','NO_ENGINE_SUBSTITUTION','PAD_CHAR_TO_FULL_LENGTH','EMPTY_STRING_IS_NULL','SIMULTANEOUS_ASSIGNMENT'), found type set('REAL_AS_FLOAT','PIPES_AS_CONCAT','ANSI_QUOTES','IGNORE_SPACE','IGNORE_BAD_TABLE_OPTIONS','ONLY_FULL_GROUP_BY','NO_UNSIGNED_SUBTRACTION','NO_DIR_IN_CREAT
E','POSTGRESQL','ORACLE','MSSQL','DB2','MAXDB','NO_KEY_OPTIONS','NO_TABLE_OPTIONS','NO_FIELD_OPTIONS','MYSQL323','MYSQL40','ANSI','NO_AUTO_VALU
Jan 30 17:14:58 tools-db-1 mysqld[3436392]: 2023-01-30 17:14:58 0 [ERROR] mysqld: Event Scheduler: An error occurred when initializing system tables. Disabling the Event Scheduler.

I am now running mysql_upgrade as recommended and it's repairing/rebuilding tables one by one, taking quite a long time.

Whilst I do appreciate all the efforts you are making, are you sure you're going to trust all this data with all the errors it's giving everytime a migration is tried?
I assume it's quite lots of data, but isn't it safer to take a logical mysqldump (or mydumper) and import it?
This binary migration is being proved as very very hard and also the data might be in a very corrupted state already.

OS error: 71 means there could be some kind of permission error (at filesystem level). Check all the files are writable or have the right user. The other errors are "expected"- you should do a restart after mysql_upgrade / rebuilds and most of them should dissappear. E.g. the events definition should correct itself, but if not, it is easier. Things like GTID coordinates may need a reset, etc.

Check also the upgrade notes from 10.1 of 10.4 in production in Wikitech- some grants may need to be changed in a batch job.

After that, most of those issues may be corrected by exporting any data, deleting the tables and importing them again (or exporting again from source). Those corruptions on the data dictionary may have caused your original issues.

Sorry for the very long comment posted before, I have now trimmed it to a more reasonable length.

Once mysql_upgrade completes I will try to set this new db to replicate from the old primary, and see if it manages to catch up and stay in sync or not.

@Marostegui @jcrespo thanks for the feedback, I will trust your experience here on whether it's reasonable to promote this new db to primary, or if I should give up and dump everything with mydumper and re-import to a clean db. I imagine mydumper will have its own set of complications...

mysql_upgrade completed after more than 24 hours with no visible errors in the output.

I started replication from clouddb1001 to the new tools-db-1, excluding the linkwatcher database:

SET GLOBAL replicate_wild_ignore_table="s51230__linkwatcher.%";

SET GLOBAL gtid_slave_pos = "0-2886731673-33522724637,2886731673-2886731673-3611909387";

CHANGE MASTER TO
 MASTER_HOST='clouddb1001.clouddb-services.eqiad1.wikimedia.cloud',
 MASTER_USER='repl',
 MASTER_PASSWORD='xxxxxxxxxx',
 MASTER_USE_GTID=slave_pos;

START SLAVE;

SET GLOBAL innodb_flush_log_at_trx_commit=2;

Replication is ongoing and Seconds_Behind_Master is slowly decreasing from the expected value of 431865 (5 days ago, when the LVM snapshot was taken).

First replication error:

Could not execute Update_rows_v1 event on table s54518__mw.online; Can't find record in 'online', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND;

I added s54518__mw.% to replicate_wild_ignore_table and resumed the replication:

SET GLOBAL replicate_wild_ignore_table="s51230__linkwatcher.%,s54518__mw.%";
START SLAVE;

The same database is also excluded from replication in clouddb1002 because it caused replication errors in T266587 and T322039.

The new host caught up with the primary and is staying in sync, though the lag increased up to 4 hours at 16:00 UTC, before catching up again. As can be seen in the graph below, the other replica clouddb1002 also struggled to stay in sync at that time, but was able to catch up faster, probably thanks to the faster disk.

Screenshot 2023-02-02 at 19.25.16.png (1×1 px, 227 KB)

Before we can promote tools-db-1 to become the new primary and we point the DNS to it, we need to sort out the two databases that I had to exclude from the replication.

I have created two sub-tasks to try and move them to dedicated Trove instances:

  • T328691 Migrate linkwatcher db to Trove
  • T328693 Migrate "mw" db to Trove

After 10 days, tools-db-1 is still in sync with the primary, with no errors in replication but there was a spike in replication lag on Feb 10:

Screenshot 2023-02-13 at 15.14.43.png (1×3 px, 485 KB)

I'm not too worried, but it might indicate we're likely to see some database bottleneck/slowdown after we promote tools-db-1 to become the new primary.

I will proceed anyway with T329521 (Set up tools-db-2 to replicate tools-db-1).

The Ceph outage 2 days ago (T329535) caused some data corruption in tools-db-1. Looking at the logs, the main events are the ones below.

MariaDB crashed at 14:32 (2 hours before the outage, probably unrelated?) and restarted automatically at 14:35.

Feb 13 14:32:40 tools-db-1 mysqld[3589396]: 2023-02-13 14:30:55 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
Feb 13 14:32:40 tools-db-1 mysqld[3589396]: 230213 14:30:55 [ERROR] mysqld got signal 6 ;
Feb 13 14:32:40 tools-db-1 mysqld[3589396]: This could be because you hit a bug. It is also possible that this binary
Feb 13 14:32:40 tools-db-1 mysqld[3589396]: or one of the libraries it was linked against is corrupt, improperly built,
Feb 13 14:32:40 tools-db-1 mysqld[3589396]: or misconfigured. This error can also be caused by malfunctioning hardware.

[...]

Feb 13 14:35:31 tools-db-1 systemd[1]: mariadb.service: Scheduled restart job, restart counter is at 1.
Feb 13 14:35:31 tools-db-1 systemd[1]: Stopped mariadb database server.
Feb 13 14:35:31 tools-db-1 systemd[1]: mariadb.service: Consumed 5d 3h 2min 35.613s CPU time.
Feb 13 14:35:31 tools-db-1 systemd[1]: Starting mariadb database server...
Feb 13 14:35:32 tools-db-1 mysqld[414272]: 2023-02-13 14:35:32 0 [Note] /opt/wmf-mariadb104/bin/mysqld (mysqld 10.4.26-MariaDB-log) starting as process 414272 ...
Feb 13 14:35:32 tools-db-1 mysqld[414272]: 2023-02-13 14:35:32 0 [ERROR] mysqld: Plugin 'unix_socket' already installed
Feb 13 14:35:32 tools-db-1 mysqld[414272]: 2023-02-13 14:35:32 0 [ERROR] Couldn't load plugin 'unix_socket' from 'auth_socket.so'.
Feb 13 14:35:32 tools-db-1 mysqld[414272]: 2023-02-13 14:35:32 0 [Note] mysqld: Aria engine: starting recovery
Feb 13 14:35:32 tools-db-1 mysqld[414272]: recovered pages: 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.0 seconds); tables to flush: 12 11 10 9 8 7 6 5 4 3 2 1 0
Feb 13 14:35:32 tools-db-1 mysqld[414272]:  (0.3 seconds);
Feb 13 14:35:32 tools-db-1 mysqld[414272]: 2023-02-13 14:35:32 0 [Note] mysqld: Aria engine: recovery done

The server was shut down cleanly at 15:59 (not sure why?), but did restart at 16:00 logging a bunch of lines saying "Table is fixed" (I'm only pasting one table, but the same message was logged for 12 different tables):

Feb 13 15:59:10 tools-db-1 systemd[1]: Stopping mariadb database server...
Feb 13 15:59:10 tools-db-1 mysqld[414272]: 2023-02-13 15:59:10 0 [Note] /opt/wmf-mariadb104/bin/mysqld (initiated by: unknown): Normal shutdown

[...]

Feb 13 16:00:14 tools-db-1 systemd[1]: Starting mariadb database server...

[...]

Feb 13 16:00:42 tools-db-1 mysqld[419395]: 2023-02-13 16:00:42 14 [ERROR] mysqld: Table './s51138__heritage_p/image' is marked as crashed and should be repaired
Feb 13 16:00:42 tools-db-1 mysqld[419395]: 2023-02-13 16:00:42 14 [Warning] Checking table:   './s51138__heritage_p/image'
Feb 13 16:00:42 tools-db-1 mysqld[419395]: 2023-02-13 16:00:42 14 [Warning] s51138__heritage_p.image: 1 client is using or hasn't closed the table properly
Feb 13 16:00:49 tools-db-1 mysqld[419395]: 2023-02-13 16:00:49 14 [Note] s51138__heritage_p.image: Table is fixed

A few minutes later the log stops abruptly (presumably when the hypervisor was powered down as part of the outage remediation in T329535), and resumes a few hours later when the hypervisor was restarted:

Feb 13 16:13:16 tools-db-1 mysqld[419395]: 2023-02-13 16:13:16 14 [Note] s51138__heritage_p.prox_search: Table is fixed
-- Boot 38072bc55dbd4b56b27319b926027fce --
Feb 13 18:55:59 tools-db-1 systemd[1]: mariadb database server is not active.
Feb 13 19:07:07 tools-db-1 systemd[1]: Starting mariadb database server...

There are a few more "Table is fixed" after the restart (this time on 21 different tables, I'm only pasting one):

Feb 13 19:09:51 tools-db-1 mysqld[2292]: 2023-02-13 19:09:51 12 [ERROR] mysqld: Table './s51138__heritage_p/prox_search' is marked as crashed and should be repaired
Feb 13 19:09:51 tools-db-1 mysqld[2292]: 2023-02-13 19:09:51 12 [Warning] Checking table:   './s51138__heritage_p/prox_search'
Feb 13 19:09:51 tools-db-1 mysqld[2292]: 2023-02-13 19:09:51 12 [Warning] s51138__heritage_p.prox_search: 1 client is using or hasn't closed the table properly
Feb 13 19:09:59 tools-db-1 mysqld[2292]: 2023-02-13 19:09:59 12 [Warning] s51138__heritage_p.prox_search: Auto-increment value: 1988455865 is smaller than max used value: 1988455867
Feb 13 19:10:04 tools-db-1 mysqld[2292]: 2023-02-13 19:10:04 12 [Note] s51138__heritage_p.prox_search: Table is fixed

I logged in to the server around that time, and restarted replication:

MariaDB [(none)]> SET GLOBAL replicate_wild_ignore_table="s51230__linkwatcher.%,s54518__mw.%";
MariaDB [(none)]> SET GLOBAL innodb_flush_log_at_trx_commit=2;
MariaDB [(none)]> START SLAVE;

Replication worked correctly for a few hours, then suddenly failed with an error on the table s51138__heritage_p/prox_search:

Feb 14 15:49:36 tools-db-1 mysqld[2292]: 2023-02-14 15:49:36 24 [ERROR] Slave SQL: Could not execute Update_rows_v1 event on table s51138__heritage_p.prox_search; Index for table './s51138__heritage_p/prox_search.MYI' is corrupt; try to repair it, Error_code: 126; Index for table './s51138__heritage_p/prox_search.MYI' is corrupt; try to repair it, Error_code: 126; handler error HA_ERR_CRASHED; the event's master log log.314364, end_log_pos 71514849, Gtid 2886731673-2886731673-3880764139, Internal MariaDB error code: 126

I tried running REPAIR TABLE manually, which seemed to repair the table correctly:

MariaDB [s51138__heritage_p]> CHECK TABLE prox_search;
+--------------------------------+-------+----------+---------------------------------------------------------+
| Table                          | Op    | Msg_type | Msg_text                                                |
+--------------------------------+-------+----------+---------------------------------------------------------+
| s51138__heritage_p.prox_search | check | warning  | Table is marked as crashed                              |
| s51138__heritage_p.prox_search | check | error    | Checksum for key:  1 doesn't match checksum for records |
| s51138__heritage_p.prox_search | check | error    | Checksum for key:  3 doesn't match checksum for records |
| s51138__heritage_p.prox_search | check | error    | Checksum for key:  4 doesn't match checksum for records |
| s51138__heritage_p.prox_search | check | error    | Checksum for key:  5 doesn't match checksum for records |
| s51138__heritage_p.prox_search | check | error    | Checksum for key:  6 doesn't match checksum for records |
| s51138__heritage_p.prox_search | check | error    | Corrupt                                                 |
+--------------------------------+-------+----------+---------------------------------------------------------+
7 rows in set (2.032 sec)

MariaDB [s51138__heritage_p]> REPAIR TABLE prox_search;
+--------------------------------+--------+----------+----------+
| Table                          | Op     | Msg_type | Msg_text |
+--------------------------------+--------+----------+----------+
| s51138__heritage_p.prox_search | repair | status   | OK       |
+--------------------------------+--------+----------+----------+
1 row in set (9.704 sec)

But replication immediately failed with an error on the same table:

Feb 15 10:10:23 tools-db-1 mysqld[2292]: 2023-02-15 10:10:23 136874 [ERROR] Slave SQL: Could not execute Update_rows_v1 event on table s51138__heritage_p.prox_search; Can't find record in 'prox_search', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log log.314364, end_log_pos 71514849, Gtid 2886731673-2886731673-3880764139, Internal MariaDB error code: 1032
Feb 15 10:10:23 tools-db-1 mysqld[2292]: 2023-02-15 10:10:23 136874 [Warning] Slave: Can't find record in 'prox_search' Error_code: 1032
Feb 15 10:10:23 tools-db-1 mysqld[2292]: 2023-02-15 10:10:23 136874 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'log.314364' position 71514591; GTID position '0-2886731673-33522724637,2886731673-2886731673-3880764138'
Feb 15 10:10:23 tools-db-1 mysqld[2292]: 2023-02-15 10:10:23 136874 [Note] Slave SQL thread exiting, replication stopped in log 'log.314364' at position 71514591; GTID position '0-2886731673-33522724637,2886731673-2886731673-3880764138'

It looks like the Ceph outage did cause some data corruption in the tools-db-1 data volume, I think the safest way to proceed is to create a new LVM snapshot from the primary (that was not involved in the outage as it's not using Ceph).

I do wonder if this is something we should worry about when tools-db-1 will become the new primary. If a similar Ceph outage was to happen in the future, it might cause some data corruption in the database, that would have to be handled manually.

Worth noting that all the tables where corruption was found after the outage seem to be MyISAM tables. We should prioritize T306455, but anyway even if all our tables were InnoDB a similar Ceph outage in the future would still mean losing some data, because we use innodb_flush_log_at_trx_commit=2 and that setting means you can lose up to 1 second of transactions in case of a crash.

Based on what happened on tools-db-1, I expect a similar Ceph outage after we promote tools-db-1 to become the new primary would cause some table corruption (MyISAM) and some lost transactions (InnoDB), but the server would restart without any unrecoverable error, we would only have some missing data that was being inserted/updated right at the moment of the crash. The replica server tools-db-2 might have to be rebuilt from scratch in such a scenario because it could end up out of sync with the primary.

I believe this is an acceptable risk for ToolsDB and we can continue with the plan of making tools-db-1 (with Ceph storage) the new primary. As noted above, I will have to rebuild it from scratch because it's no longer in sync with the current primary clouddb1001.

As noted above, I will have to rebuild it from scratch because it's no longer in sync with the current primary clouddb1001.

To document this procedure (and also because I do not trust myself to repeat all the steps without doing some stupid mistake) I created T329970. I will work on that task on Monday.

After some hiccups you can read about in T329970, tools-db-1 and tools-db-2 are in sync with the primary.

We plan to promote tools-db-1 to become the new primary and point all tools to use that host on Apr, 6 2023 at 17:00 UTC.

I created T333471 to discuss all the steps we need to perform on that day.

fnegri changed the task status from Open to In Progress.Apr 7 2023, 5:42 PM

Yesterday we promoted tools-db-1 to become the new primary (T333471), and tools are successfully using the new host. 🎉

I'm not resolving this tracking task yet, because there are a few related activities that must be completed (at least T328691, and checking monitoring is working correctly).

Mentioned in SAL (#wikimedia-cloud) [2023-04-14T09:21:44Z] <dhinus> shutdown old ToolsDB VMs (clouddb1001, clouddb1002) T301949

fnegri changed the subtype of this task from "Task" to "Goal".Apr 14 2023, 3:00 PM
fnegri changed the subtype of this task from "Goal" to "Task".Apr 19 2023, 2:43 PM

I removed T328691 from the subtasks, as we're handling the linkwatcher migration independently.

All the other subtasks are now complete, so I'm finally marking this task as Resolved.

Hearftelt thanks to everyone who chimed in and offered ideas and support. It was lovely to see so many people helping out, including different WMF teams and several volunteers.

This would probably make a great blog post, I will try to draft one in the coming weeks.

Thats_all_folks.svg.png (480×637 px, 160 KB)