Page MenuHomePhabricator

[toolsdb] test creating a new replica host
Closed, ResolvedPublic

Description

I wrote some docs on how to create a new replica host from scratch in https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Toolsdb#Creating_a_new_replica_host

I based it on the info in T329521, but some details are missing and we should really test it by creating a new replica and replacing the current one.

Event Timeline

JJMC89 moved this task from Backlog to ToolsDB on the Data-Services board.
fnegri renamed this task from toolsdb: test creating a new replica host to [toolsdb] test creating a new replica host.Aug 22 2023, 3:17 PM
fnegri changed the task status from Open to In Progress.Sep 5 2023, 1:48 PM

I'd like to also resolve T334929: [toolsdb] merge primary and secondary puppet profiles as that slightly simplifies the creation of a new replica.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-21T15:23:19Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-db' (T344717)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-21T15:24:05Z] <fnegri@cloudcumin1001> END (FAIL) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=99) with prefix 'tools-db' (T344717)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-21T15:45:38Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-db' (T344717)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-21T15:46:40Z] <fnegri@cloudcumin1001> END (FAIL) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=99) with prefix 'tools-db' (T344717)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-21T16:03:03Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-db' (T344717)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-21T16:16:14Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=0) with prefix 'tools-db' (T344717)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-01-17T18:13:59Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.openstack.quota_increase (T344717)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-01-17T18:14:04Z] <fnegri@cloudcumin1001> END (FAIL) - Cookbook wmcs.openstack.quota_increase (exit_code=99) (T344717)

Mentioned in SAL (#wikimedia-cloud) [2024-01-17T18:16:43Z] <dhinus> increase volume quotas for toolsdb T344717

I'm setting up tools-db-3 to be a new replica, and improving the wiki as I do it.

The procedure worked (with some adjustments I already added to the wiki), and tools-db-3 started replicating from tools-db-1!

To be overly cautious as I was testing the procedure, I created tools-db-3 using a snapshot of the existing replica (tools-db-2) instead of using a snapshot of the primary. I was worried that generating a Cinder snapshot of the primary might cause some issues (indeed there was an issue with snapshots: T356904: [cinder] [toolsdb] Deleting snapshot does not work).

Now that I proved the procedure works, I have stopped MariaDB on tools-db-3 and I can repeat the procedure with a snapshot from the primary (tools-db-1). Using a snapshot from the primary, I will also be able to clean the list of tables that are excluded from replication (the table s51698\_\_yetkin.wanted\_items is currently not being replicated, see T344420).

Taking a Cinder snapshot while MariaDB is running seems to work (MariaDB will fix corrupted tables when restoring the snapshot), but the official MariaDB docs say that "During the snapshot, the table must be locked."

It's probably wiser to follow that advice and either use FLUSH TABLES WITH READ LOCK or BACKUP STAGE before taking the Cinder snapshot. If we had an existing replica host that we trust (we currently don't, as the replica is missing one table), the best procedure would be to create the snapshot from the replica host, where we can stop MariaDB completely before taking the Cinder snapshot.

I also realized that we're currently changing the value of gtid_domain_id every time we create a new tools-db host, we can probably simplify that: T357341: [toolsdb] set gtid_domain_id to 0.

fnegri changed the task status from In Progress to Stalled.Mar 5 2024, 4:25 PM
fnegri changed the status of subtask T356904: [cinder] [toolsdb] Deleting snapshot does not work from In Progress to Stalled.
fnegri changed the task status from Stalled to In Progress.Mar 26 2024, 2:13 PM

Change #1015580 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] R:wmcs::db::toolsdb: remove unnecessary config

https://gerrit.wikimedia.org/r/1015580

After a few attempts, the procedure at https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Toolsdb#Creating_a_new_replica_host should now list all the required steps. I have used it to create tools-db-3 that is currently replicating from tools-db-1.

The new replica is currently lagging behind the primary (it started from a snapshot I took yesterday), I expect it to catch up in a few hours.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-04-02T12:38:44Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.remove_instance for instance tools-db-2 (T344717)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-04-02T12:39:34Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-db-2 (T344717)

Change #1015580 merged by FNegri:

[operations/puppet@production] R:wmcs::db::toolsdb: remove unnecessary config

https://gerrit.wikimedia.org/r/1015580

The new replica tools-db-3 is now in sync with the primary. I deleted the old replica tools-db-2.