Page MenuHomePhabricator

Move all tools from clouddb1001 to tools-db-1
Closed, ResolvedPublic

Description

tools-db-1.tools.eqiad1.wikimedia.cloud will become the new ToolsDB primary, replacing clouddb1001.clouddb-services.eqiad1.wikimedia.cloud.

We plan to make the switch on Apr, 6 2023 at 17:00 UTC.

The following checklist is a work-in-progress and contains all the steps that need to be performed. We can use the comments below to discuss.

  • Make clouddb1001 read-only
SET SESSION sql_log_bin=0;
FLUSH TABLES WITH READ LOCK;
SET GLOBAL read_only = 1;
  • Point the DNS tools.db.svc.eqiad.wmflabs to tools-db-1 (via wikireplica_dns.yaml)
$ # Merge  https://gerrit.wikimedia.org/r/c/operations/puppet/+/906053
$ ssh cloudcontrol1005.wikimedia.org
$ source <(sudo cat ~root/novaenv.sh)
$ wmcs-wikireplica-dns
  • Check that tools are starting to connect to the new host
  • Promote tools-db-1 to primary, by stopping replication and enabling read-write mode (official docs)
STOP ALL SLAVES;
RESET SLAVE ALL;
SHOW MASTER STATUS;
SET GLOBAL read_only = 0;
  • move floating ip (185.15.56.15) to tools-db-1 OR update the floating IP in puppet (used by maintain-dbusers) (Moving IPs between projects isn't supported, so my current plan is to release the ip from clouddb-services and then allocate a whole lot of new ones in 'tools' and hope to get lucky.)
  • Stop mariadb on clouddb1001 to close all remaining connections to the old db
  • Change puppet class in horizon from role::wmcs::db::toolsdb_secondary to role::wmcs::db::toolsdb_primary (this doesn't change much as the two roles are almost identical and should probably be merged into one).

Event Timeline

JJMC89 moved this task from Backlog to ToolsDB on the Data-Services board.

tools.db.svc.eqiad.wmflabs and tools-db.tools.eqiad.wmflabs should be CNAMEs for tools.db.svc.wikimedia.cloud to support legacy naming schemes. It appears that currently tools.db.svc.eqiad.wmflabs is an A record based on my dig tools.db.svc.eqiad.wmflabs lookup. I think this might be fixable by changing the wikireplica_dns.yaml config in Puppet and running wmcs-wikireplica-dns.

tools.db.svc.eqiad.wmflabs and tools-db.tools.eqiad.wmflabs should be CNAMEs for tools.db.svc.wikimedia.cloud to support legacy naming schemes. It appears that currently tools.db.svc.eqiad.wmflabs is an A record based on my dig tools.db.svc.eqiad.wmflabs lookup. I think this might be fixable by changing the wikireplica_dns.yaml config in Puppet and running wmcs-wikireplica-dns.

The canonical tools.db.svc.wikimedia.cloud name does not seem to be managed by wmcs-wikireplica-dns, but that script does manage the one-version older tools.db.svc.eqiad.wmflabs A record and the tools-db.tools.eqiad.wmflabs CNAME that points to it.

Change 905760 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wikireplica_dns.yaml: make legacy tools-db names cnames for the wmcloud domain

https://gerrit.wikimedia.org/r/905760

Weirdly, tools.db.svc.wikimedia.cloud is a cname pointing to tools.db.svc.eqiad.wmflabs. I'd love to untangle this spaghetti but I'm unclear on the relationship between wikireplica_dns.yaml and designate.

My instinct is to make an A record for tools.db.svc.wikimedia.cloud in designate/horizon, then update the yaml to define the .wmflabs domains as cnames pointing to tools.db.svc.wikimedia.cloud. That would still leave us with the A record in the clouddb-services project, though, a project that is soon to be defunct. So maybe the db.svc.wikimedia.cloud domain should move to cloudinfra.... I have no real opinion here.

The good news is that everything is already funneled through the one A record, so rearranging all of this is not at all required for the migration.

A record in the clouddb-services project, though, a project that is soon to be defunct.

This is probably incorrect since there are e.g. replica proxies in that project that are likely to live on.

Also, note that tools.db.svc.eqiad.wmflabs points to 172.16.7.153. That is the static IP for clouddb1001, not the floating IP (which is 185.15.56.15). Maintain-dbusers uses the floating IP, does anything at all use the floating IP?

Change 906053 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wikireplica_dns.yaml: move toolsdb DNS to new server in 'tools' project

https://gerrit.wikimedia.org/r/906053

Andrew updated the task description. (Show Details)

Also, note that tools.db.svc.eqiad.wmflabs points to 172.16.7.153. That is the static IP for clouddb1001, not the floating IP (which is 185.15.56.15). Maintain-dbusers uses the floating IP, does anything at all use the floating IP?

I think the floating IP was there so that maintain-dbusers could talk to ToolsDB from labstore1004.

Off the top of my head I can't think of any other production origin traffic that should end up at ToolsDB. We certainly should not be exposing MariaDB ports to the open internet.

Change 906053 merged by Andrew Bogott:

[operations/puppet@production] wikireplica_dns.yaml: move toolsdb DNS to new server in 'tools' project

https://gerrit.wikimedia.org/r/906053

tools-db-1 is now the new primary and tools are successfully writing to it.

The MariaDB logs in tools-db-1 do not show any warning or error, just a message confirming the replication stopped:

Apr 06 17:09:31 tools-db-1 mysqld[1945486]: 2023-04-06 17:09:31 974499 [Note] Slave SQL thread exiting, replication stopped in log 'log.321910' at position 59975043; GTID position '0-2886731673-33522724637,2886731673-2886731673-4887243158'
Apr 06 17:09:31 tools-db-1 mysqld[1945486]: 2023-04-06 17:09:31 974499 [Note] master was clouddb1001.clouddb-services.eqiad1.wikimedia.cloud:3306
Apr 06 17:09:31 tools-db-1 mysqld[1945486]: 2023-04-06 17:09:31 11 [Note] Slave I/O thread exiting, read up to log 'log.321910', position 59975043; GTID position 0-2886731673-33522724637,2886731673-2886731673-4887243158
Apr 06 17:09:31 tools-db-1 mysqld[1945486]: 2023-04-06 17:09:31 11 [Note] master was clouddb1001.clouddb-services.eqiad1.wikimedia.cloud:3306
fnegri updated the task description. (Show Details)

Something to note -- apparently tools-db-1 now isn't as accepting of formats for datetime columns. CopyPatrol used to INSERT using i.e. 2023-04-06T22:53:08+00:00 but now that throws SQLSTATE[22007]: Invalid datetime format: 1292 Incorrect datetime value. Using the format 2023-04-06 22:53 fixed it. Easy fix but this may effect other Toolforge tools.

Something to note -- apparently tools-db-1 now isn't as accepting of formats for datetime columns. CopyPatrol used to INSERT using i.e. 2023-04-06T22:53:08+00:00 but now that throws SQLSTATE[22007]: Invalid datetime format: 1292 Incorrect datetime value. Using the format 2023-04-06 22:53 fixed it. Easy fix but this may effect other Toolforge tools.

The old server was running MariaDB 10.1.44. The new server is MariaDB 10.4.28. The datetime literal format that copypatrol was using is not documented on https://mariadb.com/kb/en/date-and-time-literals/. Searching for "mariadb ISO 8601 datetime" leads me to believe that the big difference is likely that STRICT_TRANS_TABLES (strict mode, default from MariaDB 10.2.4) is turned on in the new server. I think it is quite likely that the old server was recording a warning for the non-standard datetime format, and the new server is turning that warning into an error. See https://mariadb.com/kb/en/sql-mode/#strict-mode for more info on strict mode.

@doctaxon I have restarted the tool with the commands below, and it's working again

ssh login.toolforge.org
become stimmberechtigung
webservice stop
webservice start

Change 905760 abandoned by Andrew Bogott:

[operations/puppet@production] wikireplica_dns.yaml: make legacy tools-db names cnames for the wmcloud domain

Reason:

https://gerrit.wikimedia.org/r/905760