The following shards needs to be imported and sanitized into db1095 (sanitarium2) and labsdb1009,1010 and 1011
- s2
- s4
- s5
- s6
- s7
The following shards needs to be imported and sanitized into db1095 (sanitarium2) and labsdb1009,1010 and 1011
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | jcrespo | T140788 Labs databases rearchitecture (tracking) | |||
Resolved | jcrespo | T153058 LabsDB infrastructure pending work | |||
Resolved | None | T159423 Meta ticket: Migrate multi-source database hosts to multi-instance | |||
Resolved | Marostegui | T153743 Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts | |||
Resolved | Marostegui | T157931 s5: db1070 not using file per table | |||
Resolved | Marostegui | T168021 setup dewiki and wikidatawiki on the labsdb1009, 1010 and 1011 |
Change 363137 merged by jenkins-bot:
[operations/software@master] s2.hosts: Add db1102
Change 363138 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] redact_sanitarium: Add db1102 to the allowed hosts
Change 363138 merged by Marostegui:
[operations/puppet@production] redact_sanitarium: Add db1102 to the allowed hosts
Mentioned in SAL (#wikimedia-operations) [2017-07-04T08:54:30Z] <marostegui> Run redact_sanitarium on db1102 (sanitarium3) - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-04T09:54:37Z] <marostegui> Stop replication on db1095 for maintenance - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-04T09:58:51Z] <marostegui> Move labsdb1009 main general replication thread to a named replication thread called db1095 - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-04T10:33:17Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1060 - T153743 (duration: 02m 49s)
Change 363155 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1085
Mentioned in SAL (#wikimedia-operations) [2017-07-04T10:40:28Z] <marostegui> Stop replication on db1102 (sanitarium3) on s2 shard for maintenance - T153743
Change 363155 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1085
Mentioned in SAL (#wikimedia-operations) [2017-07-05T09:17:30Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1085 - T153743 (duration: 02m 50s)
Change 363299 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1085.yaml: Add ROW as binlog format
Mentioned in SAL (#wikimedia-operations) [2017-07-05T09:27:01Z] <marostegui> Stop MySQL on db1085 for maintenance - T153743
Change 363299 merged by Marostegui:
[operations/puppet@production] db1085.yaml: Add ROW as binlog format
Mentioned in SAL (#wikimedia-operations) [2017-07-05T10:41:56Z] <marostegui> Run redact_sanitarium on s6 databases db1102 - T153743
Change 363335 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1085
Change 363335 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1085
Mentioned in SAL (#wikimedia-operations) [2017-07-05T12:32:59Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1085 - T153743 (duration: 02m 49s)
Mentioned in SAL (#wikimedia-operations) [2017-07-05T12:33:09Z] <marostegui> Stop all replication threads on db1095 for maintenance - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-05T12:36:45Z] <marostegui> Move labsdb1010 main general replication thread to a named replication thread called db1095 - T153743
Change 363339 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s6.hosts: Add db1102
Change 363339 merged by jenkins-bot:
[operations/software@master] s6.hosts: Add db1102
I thought I would update with the latest news from this task:
db1102 is the new sanitarium 3 running multi-instance with GTID and SSL enabled (jcrespo, feel free to test the new role there if you want, there is no other process, apart from the replication threads, which can be stopped anytime, running).
It has s2 and s6 now sanitized there and replicating with the replication filters and all that stuff.
labsdb1009 and labsdb1010 got their replication thread renamed, so instead of using a general one, they have had the general replication one moved to a replication thread called 'db1095' (as that is the master for that thread which replicates s1,s3,s4 and s5).
The reason for that is because they will need to replicate s2,s6 and s7 from db1102. So they will have two threads.
Change 363815 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: db1079 as sanitarium3 master for s7
Before starting any replication for s2 or s6 on labs servers, we need to either upgrade db1102 to 10.1 (which will require a bit of work, as it is running the sanitarium role for the multi instance, but that one doesn't have 10.1) or change the binlog from their intermediate masters of s2 and s6 to STATEMENT temporarily. Otherwise triggers will not on db1102.
Mentioned in SAL (#wikimedia-operations) [2017-07-10T11:36:13Z] <marostegui> Stop MySQL on db1102 for maintenance - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-10T12:02:54Z] <marostegui> Upgrade db1102 to 10.1 and enable rbr triggers - T153743
Change 363815 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: db1079 as sanitarium3 master for s7
Mentioned in SAL (#wikimedia-operations) [2017-07-10T12:26:11Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: db1079 to become master for sanitarium3 - T153743 (duration: 00m 41s)
Mentioned in SAL (#wikimedia-operations) [2017-07-10T12:58:28Z] <marostegui> Run redact_sanitarium on s2 and s6 - db1102 - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-10T13:28:08Z] <marostegui> Disable puppet on db1102 to run check_private_data - T153743
I have upgraded db1102 to 10.1 so we are now using rbr triggers there.
After sanitizing s2 and s6, I ran the check_private_data script and it reported nothing, so that looks good (I did that process twice)
I am going to leave both shards just replicating and I will run the check_private_data script tomorrow to make sure the triggers are working fine. I will manually also check the users table.
If it all went fine, I will import s7 into sanitarium3 tomorrow too.
Change 364247 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1079.yaml: Specify ROW as binlog format
Change 364372 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1079
Change 364372 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1079
Mentioned in SAL (#wikimedia-operations) [2017-07-11T07:15:28Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1079 - T153743 (duration: 00m 41s)
Mentioned in SAL (#wikimedia-operations) [2017-07-11T07:26:26Z] <marostegui> Stop MySQL on db1079 for maintenance - T153743
Change 364247 merged by Marostegui:
[operations/puppet@production] db1079.yaml: Specify ROW as binlog format
Mentioned in SAL (#wikimedia-operations) [2017-07-11T07:38:25Z] <marostegui> Stop MySQL db1102 for maintenance - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-11T09:40:14Z] <marostegui> Stop slave s6 on db1102 for exporting its content - T153743
Change 364393 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s7.hosts: db1102 now replicates s7
Change 364393 merged by jenkins-bot:
[operations/software@master] s7.hosts: db1102 now replicates s7
Change 364395 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1079
Change 364395 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1079
Mentioned in SAL (#wikimedia-operations) [2017-07-11T09:54:00Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1079 with low weight - T153743 (duration: 00m 42s)
Mentioned in SAL (#wikimedia-operations) [2017-07-11T15:25:06Z] <marostegui> Stop replication labsdb1009 for maintenance - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-11T15:29:31Z] <marostegui> Stop replication labsdb1010 for maintenance - T153743
Change 364659 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] sanitarium3.my.cnf: Save binlogs 30 days
Change 364659 merged by Marostegui:
[operations/puppet@production] sanitarium3.my.cnf: Save binlogs 30 days
Change 365036 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s6.hosts: Add labsdb1009 and labsdb1010
Change 365036 merged by jenkins-bot:
[operations/software@master] s6.hosts: Add labsdb1009 and labsdb1010
Mentioned in SAL (#wikimedia-operations) [2017-07-14T07:22:45Z] <marostegui> Stop replication on labsdb1011 for maintenance - T153743
Change 365233 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s6.hosts: Add labsdb1011
Mentioned in SAL (#wikimedia-operations) [2017-07-17T06:50:32Z] <marostegui> Stop replication on db1095 for maintenance - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-17T07:00:16Z] <marostegui> Rename labsdb1011 main replication thread to an specific one - T153743
Change 365233 merged by jenkins-bot:
[operations/software@master] s6.hosts: Add labsdb1011
Mentioned in SAL (#wikimedia-operations) [2017-07-17T07:12:39Z] <marostegui> Stop slave s2 on db1102 for maintenance - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-17T09:05:07Z] <marostegui> Disable puppet on labsdb1009 for maintenance - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-17T09:22:19Z] <marostegui> Stop replication on labsdb1009 and labsdb1010 for maintenance - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-17T09:24:38Z] <marostegui> Disable puppet on labsdb1010 for maintenance - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-17T14:34:31Z] <marostegui> Run maintain-views on labsdb1009,10 and 11 for s6 - T153743
s6 is now available on the new labs servers. I have created the views too.
s2 is being imported now
Mentioned in SAL (#wikimedia-operations) [2017-07-20T05:05:18Z] <marostegui> Configure replication for s2 on labsdb1009 and labsdb1010 - T153743
Change 366510 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s2.hosts: Add labsdb1009 and labsdb1010
s2 has been imported to labsdb1009 and labsdb1010. I will start with labsdb1011 in a bit.
The reason I don't do all the hosts at the same time, is basically for safety. If for whatever reason the imports make one the hosts fails, we'd still have labsdb1011 as an intact clone source to reclone the other two.
Views will only be created once the import is done on all the hosts and once another sanitize check has been run.
Mentioned in SAL (#wikimedia-operations) [2017-07-20T07:55:44Z] <marostegui> Start importing s2 into labsdb1011 - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-20T08:01:47Z] <marostegui> Stop replication on labsdb1011 for maintenance - T153743
Change 366510 merged by jenkins-bot:
[operations/software@master] s2.hosts: Add labsdb1009 and labsdb1010
Mentioned in SAL (#wikimedia-operations) [2017-07-24T05:40:34Z] <marostegui> Configure and start s2 replication on labsdb1011 - T153743
Change 367357 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s2.hosts: Add labsdb1011 to s2 list of hosts
Change 367357 merged by jenkins-bot:
[operations/software@master] s2.hosts: Add labsdb1011 to s2 list of hosts
Mentioned in SAL (#wikimedia-operations) [2017-07-24T14:29:50Z] <marostegui> Run maintain-views on labsdb1009, labsdb1010 and labsdb1011 for s2 wikis - T153743
s2 has been imported on labsdb1009, labsdb1010 and labsdb1011. Views have been created so these wikis are now fully available.
labsdb1011 is catching up with replication, it is around 2 days delayed as it had replication stopped while the import was running.
Since around 6AM UTC, it has gone from 6 days delayed to 2 days, so I reckon it should be up to date later today.
Change 367650 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] check_private_data.py: Add socket parameter
Change 367650 merged by Marostegui:
[operations/puppet@production] check_private_data.py: Add socket parameter
Change 368391 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] labsdb-replicas: Update new labsdb hosts to stretch/systemd
Change 368391 merged by Jcrespo:
[operations/puppet@production] labsdb-replicas: Update new labsdb hosts to stretch/systemd
Change 368408 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] labsdb: Rename sanitarium2 to sanitarium multisource
Change 368408 merged by Jcrespo:
[operations/puppet@production] labsdb: Rename sanitarium2 to sanitarium multisource
Mentioned in SAL (#wikimedia-operations) [2017-07-31T07:17:35Z] <marostegui> Stop replication on s7 on db1102 for maintenance - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-31T09:28:49Z] <marostegui> Stop replication on labsdb1009 and labsdb1010 for maintenance - T153743
Mentioned in SAL (#wikimedia-operations) [2017-07-31T11:12:18Z] <marostegui> Compress s6 on db1102 - T153743
Mentioned in SAL (#wikimedia-operations) [2017-08-03T05:17:48Z] <marostegui> Stop replication on labsdb1011 for maintenance - T153743
Hi,
So the last shard pending to be imported (s7) is now on the new labs hosts, that means that they now hold all production shards!
The views have been created and the grants added, so this is all done now finally!
cloud-services-team the new hosts are now ready (and so are the views), we might still find issues (ie: some grants missing) or stuff like that, but we'll only find those once we get more and more users to use them.
We' ll probably need to also keep an eye on how they perform and see if we have to set query killers and those measures as we have in the old lab servers.
Keep in mind that these hosts are read-only and they do not hold user databases - that is a pending discussion that needs to happen to see how we can address that problem.
Thanks Jaime, cloud-services-team and everyone for all the help and work to get this milestone finally achieved!