Page MenuHomePhabricator

[toolsdb] Migrate linkwatcher db to Trove
Closed, ResolvedPublic

Description

The database s51230__linkwatcher is the largest database in ToolsDB, running it in Trove would make things much easier operationally:

  • ToolsDB would become much smaller and easier to backup/upgrade/migrate
  • A separate instance for Linkwatcher could be replicated/migrated/upgraded without impacting other dbs

I'm not sure if an extended read-only time would be acceptable while we migrate data from the current ToolsDB host (clouddb1001) to a new Trove instance. I will try to take a logical dump of the database with mydumper to estimate how much read-only time we would need.

There's also a small risk that the performance in Trove would be worse than the current performance in ToolsDB, but anyway we don't plan on maintaining the current setup for much longer, and a separate Trove instance is likely to be more performant than a Cinder-based VM shared with other dbs.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
fnegri triaged this task as High priority.Mar 6 2023, 8:57 AM

Sent today:

Hello!

Your linkwatcher tool is on our short list of tools that have outgrown our one-size-fits-all toolforge database solution. It's big enough, and busy enough, that replicating it is proving difficult, and there are some 'noisy neighbor' issues that degrade service for other smaller users of toolforge.

Cloud-vps provides an alternate database-as-a-service feature, Trove, [0] and would like to move your database to a dedicated server running on Trove. Your tool itself will run in the same place as before but use a different connection string to access the new database.

Ideally we'd just issue you a project and trove quota and let you do all the setup and administration of this new server, but if you're not feeling up to that Francesco will set it up for you. Either way you'll ultimately be responsible for the (light) long-term administration of the service; such is the price of success 😄

Please confirm that you're getting this email and are available to coordinate, and also how much of the DIY process you're up for.

Thank you!

-Andrew



[0] https://wikitech.wikimedia.org/wiki/Help:Trove_database_user_guide

I'm testing how long it takes to back up the s51230__linkwatcher using mydumper, running at clouddb1002 and writing to an attached Cinder volume:

root@clouddb1002:~# apt install mydumper
root@clouddb1002:~# mydumper --database s51230__linkwatcher -o /mnt/linkwatcher-backup/backup_linkwatcher_test1

mydumper completed in 2 hours and 30 minutes, and the Cinder volume was not of type "high-iops" so it's possible that time could be even lower. I will test another run using a "high-iops" volume, and then test the import time to a Trove db.

fnegri changed the task status from Open to In Progress.Mar 13 2023, 6:15 PM
fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2022/2023-Q3) board.

Same time using the high-iops volume: 2 hours and 30 minutes for a full sql dump with mydumper.

Setting up a Trove db with a size of 1500 GB is failing with a timeout. @Andrew tried to increase the timeout in https://gerrit.wikimedia.org/r/c/operations/puppet/+/898811 but that didn't work, creation is still failing with Polling request timed out after exactly 30 minutes.

Screenshot 2023-03-15 at 16.35.16.png (572×736 px, 62 KB)

The trove database was created successfully.

I'm trying a test import from the "mydumper" backup (see comment above):

root@clouddb1002:~# myloader -h g7j5kstuvp4.svc.trove.eqiad1.wikimedia.cloud -u root -p XXXXXX -d /mnt/linkwatcher-backup/backup_linkwatcher_test2/

I cancelled the import as it was very slow, it only imported 6GB in 1 hour, which means the full database would take 180 hours.

I tried running SET GLOBAL innodb_flush_log_at_trx_commit=2; but that did not improve things.

After discussing with @Andrew in IRC, I modified the volume type used by Trove, changing it from standard to high-iops.

After changing the volume type, I tried running DROP DATABASE s51230__linkwatcher; and that is also taking a long time. The first attempt yielded a lot of [Warning] InnoDB: A long semaphore wait: in the logs (docker logs database in linkwatcher-db.trove.eqiad1.wikimedia.cloud), and also one instance of [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

top shows that mysqld is using 99% of the CPU and 34% of memory while the DROP DATABASE query is running.

Output of SHOW PROCESSLIST while the DROP DATABASE query is running:

MariaDB [(none)]> SHOW PROCESSLIST;
+----+----------+-----------------------------------------------------------+------+---------+------+----------------------------------+-----------------------------------+----------+
| Id | User     | Host                                                      | db   | Command | Time | State                            | Info                              | Progress |
+----+----------+-----------------------------------------------------------+------+---------+------+----------------------------------+-----------------------------------+----------+
|  6 | root     | clouddb1002.clouddb-services.eqiad1.wikimedia.cloud:33128 | NULL | Killed  | 1343 | closing tables                   | DROP DATABASE s51230__linkwatcher |    0.000 |
| 20 | root     | clouddb1002.clouddb-services.eqiad1.wikimedia.cloud:33146 | NULL | Query   |  480 | Waiting for schema metadata lock | DROP DATABASE s51230__linkwatcher |    0.000 |
| 25 | os_admin | localhost                                                 | NULL | Sleep   |   33 |                                  | NULL                              |    0.000 |
| 26 | root     | clouddb1002.clouddb-services.eqiad1.wikimedia.cloud:33164 | NULL | Query   |    0 | starting                         | SHOW PROCESSLIST                  |    0.000 |
+----+----------+-----------------------------------------------------------+------+---------+------+----------------------------------+-----------------------------------+----------+
4 rows in set (0.00 sec)

I restarted the Trove instance (sudo reboot), then restarted it from Horizon, after that the logs showed InnoDB: To roll back: 2 transactions, 1930918 rows. I waited a few minutes, then tried running DROP DATABASE s51230__linkwatcher; and this time it took less than a second.

I restarted the myloader import and this time (with the high-iops volume) is around 60MB/sec, which looks quite reasonable. The import should complete in about 5 hours.

I did the maths wrong 🤦🏻, the speed was actually about 6MB/sec and not 60. It also slowed down a bit over time, and it only imported 12GB in 1 hour, so we're looking at about 90 hours to complete the import, which is too slow.

New idea: I'll try attaching the Cinder volume containing the db dump directly to the Trove VM, then run myloader there.

Interestingly, attaching the volume containing the db dump to the host linkwatcher-db.trove.eqiad1.wikimedia.cloud and running myloader from there (no network involved) yields a similar speed of about 6MB/sec 🤔

I suspect the bottleneck could be the speed at which myloader can execute the SQL statements contained in the dump.

I tried upgrading the Trove instance from 1 to 4 CPU cores, to see if that made a difference, but apparently in doing so I borked the attached Cinder volume (linkwatcher-backup) that weirdly is no longer showing any mountable partitions, which means I lost my backup... (not true: see comment below) And also the Trove instance is now showing with status "error" and is still showing the old flavor with 1 CPU.

The backup volume is actually ok but the names got swapped (/dev/sdb was pointing to the backup volume, and /dev/sdc to the data volume). I unattached and reattached both volumes, and the names are now correct, but the Trove data volume shows as non-formatted (no partitions). I wonder if changing the instance flavor caused Trove to somehow mess up the data volume? Anyway, I can just recreate a new Trove instance from scratch, but I will do that on Monday. :)

I deleted the misbehaving Trove instance and create a new one, this time with 2 CPUs. Instead of using mydumper/myloader, I'll try doing a partial backup and restore with Mariabackup.

Created a new backup from clouddb1002:

root@clouddb1002:~# mariabackup --backup --target-dir=/mnt/linkwatcher-backup/mariabackup --databases='s51230__linkwatcher' --user root
root@clouddb1002:~# mariabackup --prepare --export --target-dir=/mnt/linkwatcher-backup/mariabackup

Now attempting to restore it in the Trove instance using the IMPORT TABLESPACE method.

The "IMPORT TABLESPACE" worked on all tables _except_ on the biggest one (linkwatcher_linklog).

On all other tables, these steps worked successfully:

CREATE TABLE `table_name`.... // (create an empty table using the schema from the old db)

ALTER TABLE `table_name` ROW_FORMAT=COMPACT;
ALTER TABLE `table_name` DISCARD TABLESPACE;

-- Copy the .ibd and .cfg files from the backup

ALTER TABLE `table_name` IMPORT TABLESPACE;

Setting ROW_FORMAT=COMPACT is required because MariaDB 10.5 defaults to a different ROW_FORMAT, as explained here.

The above commands let me import all tables successfully, except for linkwatcher_linklog where the IMPORT TABLESPACE commands failed with:

ERROR 1808 (HY000): Schema mismatch (Column timestamp precise type mismatch, it's 0X80707 in the table and 0X707 in the tablespace meta file)

Some investigation showed that the error is caused by the old db using a different timestamp type, and I could work around it by setting set global mysql56_temporal_format = OFF;.

After that, the IMPORT TABLESPACE no longer failed immediately, but after some time if failed with a different error:

ERROR 1815 (HY000): Internal error: Cannot reset LSNs in table `s51230__linkwatcher`.`linkwatcher_linklog` : Data structure corruption

The logs show additional details:

2023-04-05 11:47:58 566 [Warning] InnoDB: ./s51230__linkwatcher/linkwatcher_linklog.ibd: Page 8002688 at offset 131116040192 looks corrupted.
2023-04-05 11:47:58 566 [Note] InnoDB: Discarding tablespace of table `s51230__linkwatcher`.`linkwatcher_linklog`: Data structure corruption
2023-04-05 11:47:58 566 [ERROR] mysqld: Index for table 'linkwatcher_linklog' is corrupt; try to repair it

I believe this is also caused by the different Mariadb version in the new server (10.5 instead of 10.1), but fixing it seems more difficult...

@Beetstra @TheresNoTime as you can see from the comments in this task I have spent a significant amount of time trying to migrate the Linkwatcher database in a painless way, but I am almost out ideas and we can no longer postpone the migration of all the other databases that are currently in the same host (clouddb1001).

The migration of the other databases is planned for today at 17:00 UTC (T333471). I am doing some last minute attempts to migrate Linkwatcher data to Trove, but be prepared for a Linkwatcher outage. The old database with all the Linkwatcher data will still be available but will be set to read-only. It could take several days before I get all the Linkwatcher data migrated to Trove in the "slow way", using myloader as described in previous comments.

I tried creating a new backup using mariabackup, this time from the primary clouddb1001. Then I started a MariaDB 10.5 Docker container using the mariabackup folder as the data folder, and tried doing IMPORT TABLESPACE on the linkwatcher_linklog table. The operation failed with exactly the same error I saw yesterday:

2023-04-06 13:44:37 19 [Warning] InnoDB: ./s51230__linkwatcher/linkwatcher_linklog.ibd: Page 8002688 at offset 131116040192 looks corrupted.
2023-04-06 13:44:37 19 [Note] InnoDB: Discarding tablespace of table `s51230__linkwatcher`.`linkwatcher_linklog`: Data structure corruption
2023-04-06 13:44:39 19 [ERROR] mysqld: Index for table 'linkwatcher_linklog' is corrupt; try to repair it

I suspect there is data corruption on the primary, only affecting an index that is not used often so it never causes an error on the primary.

I considered dropping and recreating indexes, but even dropping takes forever (it triggers a COPY operation). I also tried running OPTIMIZE TABLE but that also takes forever (multiple days).

One last option: creating a non-Trove Nova instance in the linkwatcher project, and start a new copy of MariaDB using the mariabackup folder as the data folder. That seems to yield a usable database, although it doesn't fix the data corruption in the index, the indexes will still need to be rebuilt at some point.

I will try this as a final attempt to avoid a multiple-day read-only time for Linkwatcher.

After further discussion with @Andrew and @Ladsgroup, I decided to discard the idea of using the mariabackup folder without running IMPORT TABLESPACE, and I am instead trying to restore a mydumper backup to Trove without the indexes (they can be created later). That should make the import with myloader much faster (but still probably >24 hours).

I have started a new mydumper backup from clouddb1002:

root@clouddb1002:~# mydumper --database s51230__linkwatcher -o /mnt/linkwatcher-backup/linkwatcher_mydumper_20230406

This should complete overnight, I'll try loading that backup into Trove tomorrow.

I'm importing the backup into Trove, from the current speed I expect it to take at least 10 hours, but hopefully less than 24 hours.

I created a script import.sh that runs mysql -h 172.17.0.2 -p $PASSWORD s51230__linkwatcher < $FILENAME for each of the files generated by mydumper. I'm running this script from a temporary Docker container I started in the Trove instance with:

docker run -it --rm -v /mnt/linkwatcher-backup:/linkwatcher-backup quay.io/wikimedia-cloud-services/mariadb:10.5.10 bash

I manually modified the schema file s51230__linkwatcher.linkwatcher_linklog-schema.sql generated by linkwatcher, commenting out the indexes to make the import faster:

/*!40101 SET NAMES binary*/;
/*!40014 SET FOREIGN_KEY_CHECKS=0*/;

CREATE TABLE `linkwatcher_linklog` (
  `ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `edit_id` int(11) DEFAULT NULL,
  `lang` tinytext,
  `pagename` text CHARACTER SET utf8 COLLATE utf8_bin,
  `namespace` tinytext,
  `diff` text,
  `revid` bigint(20) unsigned DEFAULT NULL,
  `oldid` bigint(20) unsigned DEFAULT NULL,
  `wikidomain` tinytext,
  `user` text CHARACTER SET utf8 COLLATE utf8_bin,
  `fullurl` text CHARACTER SET utf8 COLLATE utf8_bin,
  `domain` text CHARACTER SET utf8 COLLATE utf8_bin,
  `resolved` tinytext,
  `ip` int(11) DEFAULT NULL,
  `date` date DEFAULT NULL,
  `time` time DEFAULT NULL,
  UNIQUE KEY `ID` (`ID`)/*,
  KEY `user` (`user`(128)),
  KEY `resolved` (`resolved`(128)),
  KEY `pagename` (`pagename`(128)),
  KEY `revid` (`revid`),
  KEY `diff` (`diff`(255)),
  KEY `wikidomain` (`wikidomain`(10)),
  KEY `domain` (`domain`(128))*/
) ENGINE=InnoDB AUTO_INCREMENT=1602350507 DEFAULT CHARSET=latin1;

Looking at the indexes, they clearly need a redesign, depends on how the queries look like, you might collate some to one index and drop some others for good. Just saying.

The import completed on 8 Apr at 00:40 UTC. It should be possible to point the Linkwatcher tool to use the new Trove database, though until the indexes are added it might be very slow. I would discuss this with the tool maintainers, as @Ladsgroup pointed out, maybe we don't need all of those indexes?

My suggestion would be to start the tool first, then identify the queries that are too slow and discuss together which indexes should be added. /cc @Beetstra @TheresNoTime

Instructions on how to connect to Trove are available here and should work both from Toolforge and Cloud VPS. Let me know if you need help, or ask in #wikimedia-cloud in IRC. :)

Suggestion: Before bringing back this db to rotation, drop the diff column and set fullurl to varchar(511).

Dropping the diff column may give problems with the current coding, which I
do not have time to change at the moment.

If I can see the source code, I might be able to make a patch but don't know much about it so it might be a lot of work.

Change 909695 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] ToolsDB: remove replication filters

https://gerrit.wikimedia.org/r/909695

I checked what's happening in the new database and I see lots of queries taking a long time to complete (or not completing at all):

SELECT COUNT(*) as total_record FROM linkwatcher_linklog WHERE user='....'
SELECT COUNT(*) as total_record FROM linkwatcher_linklog WHERE domain='....'

Adding indexes on those two columns (user and domain) might improve things, even if it will take a long time to create the index.

It also looks like no new information is being added to https://meta.wikimedia.org/wiki/User:COIBot/LinkReports since Apr 5 (when we disabled the old ToolsDB server).

If that's the only condition (there is nothing after WHERE user='....'), sure, sounds good to me.

If there is a source code somewhere, I can take a look to fix diff column part depending on how complicated it's going to be.

@Ladsgroup I found this but I'm not sure it's up to date: https://meta.wikimedia.org/wiki/User:COIBot/COIBot

It also doesn't seem to include any query like the ones I posted above (SELECT COUNT(*) as total_record...).

If you ssh into login.toolforge.org and look at /data/project/linkwatcher/ there are two Perl files that include similar queries (LinkAnalyser2.pl and LinkReporter2.pl).

@Beetstra is the source code available on any public website?

I set a global query timeout to 10 seconds to prevent the database from getting overloaded by queries that take a very long time to complete.

MariaDB [s51230__linkwatcher]> SET GLOBAL max_statement_time=10;

Then I started a CREATE INDEX command on the user column, disabling the timeout only for that session:

MariaDB [s51230__linkwatcher]> SET SESSION max_statement_time=0;
MariaDB [s51230__linkwatcher]> CREATE INDEX user ON linkwatcher_linklog(user);
Stage: 1 of 2 'copy to tmp table'  0.135% of stage done

This is running in a root tmux session in linkwatcher-db.trove.eqiad1.wikimedia.cloud and will take at least a few hours.

Change 909695 merged by FNegri:

[operations/puppet@production] ToolsDB: remove replication filters

https://gerrit.wikimedia.org/r/909695

The index creation failed with the following error:

ERROR 1206 (HY000): The total number of locks exceeds the lock table size

I cannot understand why creating an index would lead to that error... That error is usually linked to innodb_buffer_pool_size being to small, but it's already set quite high at 1,25 GB:

| innodb_buffer_pool_size             | 1342177280     |

I'm trying one more time with the same CREATE INDEX command, to check if I get the same error.

An alternative might be using pt-online-schema-change.

The problem is reproducible: it failed again with the same error

ERROR 1206 (HY000): The total number of locks exceeds the lock table size

I am now trying using pt-online-schema-change:

apt install percona-toolkit
pt-online-schema-change --execute -h localhost --ask-pass --alter "ADD INDEX user(user),ADD INDEX domain(domain)" D=s51230__linkwatcher,t=linkwatcher_linklog

pt-online-schema change ran for several days, but eventually failed with:

...

Copying `s51230__linkwatcher`.`linkwatcher_linklog`:  78% 23:24:05 remain
Copying `s51230__linkwatcher`.`linkwatcher_linklog`:  78% 23:23:57 remain
Copying `s51230__linkwatcher`.`linkwatcher_linklog`:  78% 23:23:58 remain
2023-05-01T21:20:12 Dropping triggers...
2023-05-01T21:20:12 Dropped triggers OK.
2023-05-01T21:20:12 Dropping new table...
2023-05-01T21:22:12 Dropped new table OK.
`s51230__linkwatcher`.`linkwatcher_linklog` was not altered.
2023-05-01T21:20:12 Error copying rows from `s51230__linkwatcher`.`linkwatcher_linklog` to `s51230__linkwatcher`.`_linkwatcher_linklog_new`: Error executing EXPLAIN SELECT /*!40001 SQL_NO_CACHE */ `id` FROM `s51230__linkwatcher`.`linkwatcher_linklog` FORCE INDEX(`id`) WHERE ((`id` >= ?)) ORDER BY `id` LIMIT ?, 2 /*next chunk boundary*/: DBD::mysql::st execute failed: Query execution was interrupted (max_statement_time exceeded) [for Statement "EXPLAIN SELECT /*!40001 SQL_NO_CACHE */ `id` FROM `s51230__linkwatcher`.`linkwatcher_linklog` FORCE INDEX(`id`) WHERE ((`id` >= ?)) ORDER BY `id` LIMIT ?, 2 /*next chunk boundary*/" with ParamValues: 0=1162334414, 1=115] at /usr/bin/pt-online-schema-change line 11581, <STDIN> line 1.

I'm now retrying with an additional option --set-vars max_statement_time=0 to bypass the global variable that I set at 10 (see previous comments):

root@linkwatcher-db:/var/lib/mysql/data# pt-online-schema-change --execute -h localhost --ask-pass --set-vars max_statement_time=0 --alter "ADD INDEX user(user),ADD INDEX domain(domain)" D=s51230__linkwatcher,t=linkwatcher_linklog

Some good news: insertion of new links seems to be working fine, new records are constantly being added to the linkwatcher_linklog table:

MariaDB [s51230__linkwatcher]> SELECT * FROM linkwatcher_linklog ORDER BY ID DESC LIMIT 1\G
*************************** 1. row ***************************
        ID: 1603752677
 timestamp: 2023-05-02 12:02:28
...

After one week, pt-online-schema-change is still running. The output is stuck since yesterday at this point

...

Copying `s51230__linkwatcher`.`linkwatcher_linklog`:  99% 00:58 remain
Copying `s51230__linkwatcher`.`linkwatcher_linklog`:  99% 00:43 remain
Copying `s51230__linkwatcher`.`linkwatcher_linklog`:  99% 00:26 remain
Copying `s51230__linkwatcher`.`linkwatcher_linklog`:  99% 00:09 remain

But the ibd file keeps on growing, so I think the index creation is still in progress:

root@linkwatcher-db:~# ls -lhSr /var/lib/mysql/data/s51230__linkwatcher/ |tail -n 2
-rw-rw---- 1 database database 489G May  9 12:21 linkwatcher_linklog.ibd
-rw-rw---- 1 database database 614G May  9 12:21 _linkwatcher_linklog_new.ibd

Aaaaaand.... the index creation completed successfully:

...
Copying `s51230__linkwatcher`.`linkwatcher_linklog`:  99% 00:09 remain
2023-05-09T13:35:46 Copied rows OK.
2023-05-09T13:35:46 Analyzing new table...
2023-05-09T13:35:47 Swapping tables...
2023-05-09T13:35:55 Swapped original and new tables OK.
2023-05-09T13:35:55 Dropping old table...
2023-05-09T13:37:17 Dropped old table `s51230__linkwatcher`.`_linkwatcher_linklog_old` OK.
2023-05-09T13:37:17 Dropping triggers...
2023-05-09T13:37:17 Dropped triggers OK.
Successfully altered `s51230__linkwatcher`.`linkwatcher_linklog`.

The ibd file with the indexes is using 616G, compared to 489G with no indexes, and compared to 1.1T with all the indexes we had before the migration.

We now have indexes on the user and domain columns. We need to identify if we need to add other indexes or not.

A quick glance at SHOW PROCESSLIST indicates the only frequent queries that are timing out because they take more than 10 seconds to complete are of this kind:

SELECT COUNT(*) as total_record FROM linkwatcher_linklog WHERE resolved='18.165.98.65'

So we might need to add another index on the resolved column.

It looks like now that the table has been rewritten, MariaDB is able to add a new index "online" without resorting to the "COPY" method, so I think I don't need to use pt-online-schema-change anymore.

I have started this query instead, to add the additional index on the resolved column:

CREATE INDEX resolved ON linkwatcher_linklog(resolved)

I expect it to take a long time (hours, maybe days) but I hope it will be faster than pt-online-schema-change.

The query failed because it's using /var/tmp as the temporary directory, and it doesn't have enough space.

MariaDB [s51230__linkwatcher]> CREATE INDEX resolved ON linkwatcher_linklog(resolved);
ERROR 1878 (HY000): Temporary file write failure

MariaDB [s51230__linkwatcher]> SHOW VARIABLES LIKE 'tmpdir';
+---------------+----------+
| Variable_name | Value    |
+---------------+----------+
| tmpdir        | /var/tmp |
+---------------+----------+

I'll try again after changing the tmpdir.

I modified innodb_tmpdir because it can be set dynamically, while tmpdir can only be set at start time.

SET VARIABLE innodb_tmpdir='/var/lib/mysql/tmp';
CREATE INDEX resolved ON linkwatcher_linklog(resolved);

It's probably a good idea to find a way to override tmpdir in the default settings for all Trove databases, so I created T336285.

This was very fast compared to the previous index creation:

MariaDB [s51230__linkwatcher]> CREATE INDEX resolved ON linkwatcher_linklog(resolved);

Query OK, 0 rows affected (3 hours 53 min 33.781 sec)
Records: 0  Duplicates: 0  Warnings: 0

I no longer see any queries timing out. @Beetstra @TheresNoTime can you confirm if the tool is now working correctly or not?

fnegri changed the task status from In Progress to Stalled.May 12 2023, 12:03 PM
fnegri moved this task from In progress to Blocked on the cloud-services-team (FY2022/2023-Q4) board.

This was very fast compared to the previous index creation:

MariaDB [s51230__linkwatcher]> CREATE INDEX resolved ON linkwatcher_linklog(resolved);

Query OK, 0 rows affected (3 hours 53 min 33.781 sec)
Records: 0  Duplicates: 0  Warnings: 0

I no longer see any queries timing out. @Beetstra @TheresNoTime can you confirm if the tool is now working correctly or not?

All working afaict (now that T334491: Update Linkwatcher tool to use new Trove database is resolved)

Now that @TheresNoTime has updated the connection string in LinkReporter.pl (T334491) I see some new queries timing out (max_statement_time is still set to 10 seconds to prevent any long query from hogging the server). The queries timing out are all similar to the following one:

SELECT * FROM linkwatcher_linklog WHERE diff='https://commons.wikimedia.org/w/index.php?diff=260677260&oldid=260653921&rcid=965761270';

Adding an index to the diff column doesn't seem the best way to go, modifying the WHERE to use the revid and wikidomain column (and adding indexes to those) would be much more efficient.

I understand that we don't have many people who are interested in modifying the current Perl scripts, so unless I hear from anyone in the next week, I will try adding an index to the diff column to see if that can get the current queries under 10 seconds, at the cost of using a lot more disk space.

Someone can rework the query not to use diff = 'X'. It can be turned into wiki='foo' and so on. The data already exists in other columns. I might take a look but it's perl and I have never done anything in perl.

Today seeing:

2023-6-08 @ 13:49:06 - ------- COIBot starting ----------------
* Startup routinesDBI connect('s51230__linkwatcher;aw3ejyogbmc.svc.trove.eqiad1.wikimedia.cloud','s51229',...) failed:
  Can't connect to MySQL server on 'aw3ejyogbmc.svc.trove.eqiad1.wikimedia.cloud'

Has anything changed?

The Trove VM apparently crashed after some OOM errors. I can no longer SSH to it, I'll try rebooting it.

[5356951.128880] Out of memory: Killed process 343724 (mysqld) total-vm:3545328kB, anon-rss:2185976kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:4568kB oom_score_adj:0
[5443708.006181] Out of memory: Killed process 343963 (mysqld) total-vm:3554720kB, anon-rss:2184752kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:4580kB oom_score_adj:0
[5493238.127064] Out of memory: Killed process 345284 (mysqld) total-vm:3538328kB, anon-rss:2184340kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:4564kB oom_score_adj:0
[5493279.323856] Out of memory: Killed process 345769 (mysqld) total-vm:3538328kB, anon-rss:2184340kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:4564kB oom_score_adj:0
[5493335.464370] Out of memory: Killed process 345770 (mysqld) total-vm:3538328kB, anon-rss:2184340kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:4564kB oom_score_adj:0
[5493382.208297] Out of memory: Killed process 345771 (mysqld) total-vm:3538328kB, anon-rss:2184340kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:4564kB oom_score_adj:0
[5493387.810932] Out of memory: Killed process 290663 (mysqld) total-vm:3538328kB, anon-rss:2184684kB, file-rss:4kB, shmem-rss:0kB, UID:1001 pgtables:4580kB oom_score_adj:0
[5557289.099027] Out of memory: Killed process 586 (guest-agent) total-vm:3897664kB, anon-rss:3761344kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:7524kB oom_score_adj:0

Rebooted the VM, then the Docker container was not restarting correctly

2023-06-08 14:57:59 0 [ERROR] Can't start server : Bind on unix socket: Permission denied
2023-06-08 14:57:59 0 [ERROR] Do you already have another mysqld server running on socket: /var/run/mysqld/mysqld.sock ?

I ran docker stop database, then restarted the container from Horizon, clicking "Restart instance" from the Database page in Horizon.

It looks now it's again running correctly, I manually reset max_statement_time to 10 seconds.

MariaDB [(none)]> SET GLOBAL max_statement_time=10;

Rebooted the VM, then the Docker container was not restarting correctly

2023-06-08 14:57:59 0 [ERROR] Can't start server : Bind on unix socket: Permission denied
2023-06-08 14:57:59 0 [ERROR] Do you already have another mysqld server running on socket: /var/run/mysqld/mysqld.sock ?

I ran docker stop database, then restarted the container from Horizon, clicking "Restart instance" from the Database page in Horizon.

It looks now it's again running correctly, I manually reset max_statement_time to 10 seconds.

MariaDB [(none)]> SET GLOBAL max_statement_time=10;

Confirmed working, thank you :-)

FYI, the thing with docker not starting is upstream bug https://storyboard.openstack.org/#!/story/2010599 which could use a comment or two in support

Seeing some:

DBD::mysql::st bind_columns failed: Statement has no result columns to bind (perhaps you need to successfully call execute first, or again) at LinkSaver.pl line 1775.
DBD::mysql::st execute failed: Query execution was interrupted (max_statement_time exceeded) at LinkSaver.pl line 1764.
DBD::mysql::st bind_columns failed: Statement has no result columns to bind (perhaps you need to successfully call execute first, or again) at LinkSaver.pl line 1775.
DBD::mysql::st execute failed: Query execution was interrupted (max_statement_time exceeded) at LinkSaver.pl line 1764.
DBD::mysql::st bind_columns failed: Statement has no result columns to bind (perhaps you need to successfully call execute first, or again) at LinkSaver.pl line 1775.

I'm going to try to get my head around Perl a bit more, but to be honest this looks like more of a SQL (query) issue?

In the mariadb logs I see a bunch of

2023-06-13 13:44:52 10778 [Warning] Sort aborted, host: coibot.linkwatcher.eqiad1.wikimedia.cloud, user: s51229, thread: 10778, query: SELECT * FROM linkwatcher_linklog WHERE diff='https://ru.wikipedia.org/w/index.php?diff=119877578&oldid=119061739' ORDER BY `timestamp` DESC

but the same errors were present last week, and they relate to the missing index on the diff column that we discussed above... I don't see anything different compared to last week. When did the LinkSaver errors start to appear?

Looking again I see the errors in LinkSaver.pl clearly point out to max_statement_time exceeded, so I think they also relate to the missing diff column that is making those queries slow...

If nobody finds a way to modify the queries in the Perl, I can try adding the index on diff, even if it's ugly, but it might fix things for a while.

As nobody seems to be able to find their way through the Perl code, I am adding the index on the diff column, hoping that it will resolve the remaining issue with queries timing out at 10 seconds:

MariaDB [s51230__linkwatcher]> SET SESSION max_statement_time=0;
MariaDB [s51230__linkwatcher]> CREATE INDEX diff ON linkwatcher_linklog(diff);

This is running in a root user tmux session and will take a few hours.

The index creation failed with

ERROR 1878 (HY000): Temporary file write failure

This is again T336285 and I forgot the fix I did for the other index (see this comment).

Retrying with a similar fix:

MariaDB [s51230__linkwatcher]> SET innodb_tmpdir='/var/lib/mysql/tmp';
MariaDB [s51230__linkwatcher]> CREATE INDEX diff ON linkwatcher_linklog(diff);

The index was created succesfully:

MariaDB [s51230__linkwatcher]> CREATE INDEX diff ON linkwatcher_linklog(diff);
Query OK, 0 rows affected, 1 warning (8 hours 42 min 28.207 sec)
Records: 0  Duplicates: 0  Warnings: 1

That seems to have fixed the query timeouts, but I now see lots of warnings like the following one:

2023-07-03 12:12:32 57224 [Warning] Aborted connection 57224 to db: 's51230__linkwatcher' user: 's51229' host: 'coibot.linkwatcher.eqiad1.wikimedia.cloud' (Got timeout reading communication packets)

Using SHOW PROCESSLIST I verified that connections get terminated when they remain idle for exactly 120 seconds, which matches the value of wait_timeout in /etc/mysql/my.cnf. For some reason Trove defaults to 120 instead of the MariaDB default which is 28800 (8 hours). I have increased the value to 28800:

MariaDB [s51230__linkwatcher]> SET GLOBAL wait_timeout=28800;

I will add that value and also max_statement_time to a custom Trove database configuration to persist them even in case of a server restart.

Unfortunately, Trove doesn't let me set max_statement_time in a configuration group (I think it's a Trove bug/limitation):

root@cloudcontrol1005:~# openstack database configuration create linkwatcher-db '{"max_statement_time": 10, "wait_timeout": 28800}' --datastore mariadb --datastore-version "10.5.10"
The configuration parameter max_statement_time is not supported for this datastore: mariadb 10.5.10. (HTTP 422)

At least I managed to set wait_timeout:

root@cloudcontrol1005:~# openstack database configuration create linkwatcher-db '{"wait_timeout": 28800}' --datastore mariadb --datastore-version "10.5.10"
root@cloudcontrol1005:~# openstack database configuration attach 7dd3189c-2554-414a-8b3b-a1b220edaaea 2a526b92-8540-4907-99c8-091535a8e2b9

Found another (hopefully the last!) batch of slow queries, on a different table:

SELECT (webbug) FROM linkwatcher_webbug WHERE domain='.....'

I added an index there:

MariaDB [s51230__linkwatcher]> SET SESSION max_statement_time=0;
MariaDB [s51230__linkwatcher]> SET innodb_tmpdir='/var/lib/mysql/tmp';
MariaDB [s51230__linkwatcher]> CREATE INDEX domain ON linkwatcher_webbug(domain);

Query OK, 0 rows affected, 1 warning (4 min 20.570 sec)
Records: 0  Duplicates: 0  Warnings: 1

I have enabled slow query logging with a threshold of 3 seconds:

MariaDB [s51230__linkwatcher]> SET GLOBAL slow_query_log=1;
MariaDB [s51230__linkwatcher]> SET GLOBAL long_query_time=3.0;

If I don't see any slow queries being logged in the next 12 hours, I will finally mark this task as "Resolved".

A few slow queries were logged, this is the output of mariadb-dumpslow: https://phabricator.wikimedia.org/P49500

However, the slowest ones are using columns where an index is already present, so I think that's the best performance we can get with the current huge table size.

Some queries are still taking more than 10 seconds and getting killed because of max_statement_time=10. Looking at /var/lib/mysql/data/742b1e8ff15c-slow.log we have on average 10 queries timing out per hour.

max_statement_time could be increased from 10 to 20 or more... but at the risk of slowing down other operations on the server, so I will leave it at 10 for now. Please note that I haven't found a way to set this value permanently in Trove, so a server restart would reset it to 0 (no timeout).

Performance is likely to degrade further if the table keeps growing, so I'm resolving this ticket but T224154 is still very much valid.