Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts
Closed, ResolvedPublic

Description

The following shards needs to be imported and sanitized into db1095 (sanitarium2) and labsdb1009,1010 and 1011

  • s2
  • s4
  • s5
  • s6
  • s7

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2017-07-04T05:47:11Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1060 - T153743 (duration: 02m 51s)

Mentioned in SAL (#wikimedia-operations) [2017-07-04T06:05:52Z] <marostegui> Stop MySQL on db1060 for maintenance - T153743

Change 363122 merged by Marostegui:
[operations/puppet@production] db1060.yaml: Change to ROW binlog format

https://gerrit.wikimedia.org/r/363122

Change 363137 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s2.hosts: Add db1102

https://gerrit.wikimedia.org/r/363137

Change 363137 merged by jenkins-bot:
[operations/software@master] s2.hosts: Add db1102

https://gerrit.wikimedia.org/r/363137

Change 363138 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] redact_sanitarium: Add db1102 to the allowed hosts

https://gerrit.wikimedia.org/r/363138

Change 363138 merged by Marostegui:
[operations/puppet@production] redact_sanitarium: Add db1102 to the allowed hosts

https://gerrit.wikimedia.org/r/363138

Mentioned in SAL (#wikimedia-operations) [2017-07-04T08:54:30Z] <marostegui> Run redact_sanitarium on db1102 (sanitarium3) - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-04T09:54:37Z] <marostegui> Stop replication on db1095 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-04T09:58:51Z] <marostegui> Move labsdb1009 main general replication thread to a named replication thread called db1095 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-04T10:33:17Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1060 - T153743 (duration: 02m 49s)

Change 363155 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1085

https://gerrit.wikimedia.org/r/363155

Mentioned in SAL (#wikimedia-operations) [2017-07-04T10:40:28Z] <marostegui> Stop replication on db1102 (sanitarium3) on s2 shard for maintenance - T153743

Change 363155 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1085

https://gerrit.wikimedia.org/r/363155

Mentioned in SAL (#wikimedia-operations) [2017-07-05T09:17:30Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1085 - T153743 (duration: 02m 50s)

Change 363299 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1085.yaml: Add ROW as binlog format

https://gerrit.wikimedia.org/r/363299

Mentioned in SAL (#wikimedia-operations) [2017-07-05T09:27:01Z] <marostegui> Stop MySQL on db1085 for maintenance - T153743

Change 363299 merged by Marostegui:
[operations/puppet@production] db1085.yaml: Add ROW as binlog format

https://gerrit.wikimedia.org/r/363299

Mentioned in SAL (#wikimedia-operations) [2017-07-05T10:41:56Z] <marostegui> Run redact_sanitarium on s6 databases db1102 - T153743

Change 363335 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1085

https://gerrit.wikimedia.org/r/363335

Change 363335 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1085

https://gerrit.wikimedia.org/r/363335

Mentioned in SAL (#wikimedia-operations) [2017-07-05T12:32:59Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1085 - T153743 (duration: 02m 49s)

Mentioned in SAL (#wikimedia-operations) [2017-07-05T12:33:09Z] <marostegui> Stop all replication threads on db1095 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-05T12:36:45Z] <marostegui> Move labsdb1010 main general replication thread to a named replication thread called db1095 - T153743

Change 363339 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s6.hosts: Add db1102

https://gerrit.wikimedia.org/r/363339

Change 363339 merged by jenkins-bot:
[operations/software@master] s6.hosts: Add db1102

https://gerrit.wikimedia.org/r/363339

Marostegui added a comment.EditedJul 5 2017, 2:12 PM

I thought I would update with the latest news from this task:

db1102 is the new sanitarium 3 running multi-instance with GTID and SSL enabled (jcrespo, feel free to test the new role there if you want, there is no other process, apart from the replication threads, which can be stopped anytime, running).
It has s2 and s6 now sanitized there and replicating with the replication filters and all that stuff.

labsdb1009 and labsdb1010 got their replication thread renamed, so instead of using a general one, they have had the general replication one moved to a replication thread called 'db1095' (as that is the master for that thread which replicates s1,s3,s4 and s5).
The reason for that is because they will need to replicate s2,s6 and s7 from db1102. So they will have two threads.

Change 363815 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: db1079 as sanitarium3 master for s7

https://gerrit.wikimedia.org/r/363815

Before starting any replication for s2 or s6 on labs servers, we need to either upgrade db1102 to 10.1 (which will require a bit of work, as it is running the sanitarium role for the multi instance, but that one doesn't have 10.1) or change the binlog from their intermediate masters of s2 and s6 to STATEMENT temporarily. Otherwise triggers will not on db1102.

Mentioned in SAL (#wikimedia-operations) [2017-07-10T11:36:13Z] <marostegui> Stop MySQL on db1102 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-10T12:02:54Z] <marostegui> Upgrade db1102 to 10.1 and enable rbr triggers - T153743

Change 363815 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: db1079 as sanitarium3 master for s7

https://gerrit.wikimedia.org/r/363815

Mentioned in SAL (#wikimedia-operations) [2017-07-10T12:26:11Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: db1079 to become master for sanitarium3 - T153743 (duration: 00m 41s)

Mentioned in SAL (#wikimedia-operations) [2017-07-10T12:58:28Z] <marostegui> Run redact_sanitarium on s2 and s6 - db1102 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-10T13:28:08Z] <marostegui> Disable puppet on db1102 to run check_private_data - T153743

I have upgraded db1102 to 10.1 so we are now using rbr triggers there.
After sanitizing s2 and s6, I ran the check_private_data script and it reported nothing, so that looks good (I did that process twice)
I am going to leave both shards just replicating and I will run the check_private_data script tomorrow to make sure the triggers are working fine. I will manually also check the users table.

If it all went fine, I will import s7 into sanitarium3 tomorrow too.

Change 364247 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1079.yaml: Specify ROW as binlog format

https://gerrit.wikimedia.org/r/364247

Change 364372 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1079

https://gerrit.wikimedia.org/r/364372

Change 364372 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1079

https://gerrit.wikimedia.org/r/364372

Mentioned in SAL (#wikimedia-operations) [2017-07-11T07:15:28Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1079 - T153743 (duration: 00m 41s)

Mentioned in SAL (#wikimedia-operations) [2017-07-11T07:26:26Z] <marostegui> Stop MySQL on db1079 for maintenance - T153743

Change 364247 merged by Marostegui:
[operations/puppet@production] db1079.yaml: Specify ROW as binlog format

https://gerrit.wikimedia.org/r/364247

Mentioned in SAL (#wikimedia-operations) [2017-07-11T07:38:25Z] <marostegui> Stop MySQL db1102 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-11T09:40:14Z] <marostegui> Stop slave s6 on db1102 for exporting its content - T153743

Change 364393 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s7.hosts: db1102 now replicates s7

https://gerrit.wikimedia.org/r/364393

Change 364393 merged by jenkins-bot:
[operations/software@master] s7.hosts: db1102 now replicates s7

https://gerrit.wikimedia.org/r/364393

Change 364395 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1079

https://gerrit.wikimedia.org/r/364395

Change 364395 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1079

https://gerrit.wikimedia.org/r/364395

Mentioned in SAL (#wikimedia-operations) [2017-07-11T09:54:00Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1079 with low weight - T153743 (duration: 00m 42s)

Mentioned in SAL (#wikimedia-operations) [2017-07-11T15:25:06Z] <marostegui> Stop replication labsdb1009 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-11T15:29:31Z] <marostegui> Stop replication labsdb1010 for maintenance - T153743

Change 364659 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] sanitarium3.my.cnf: Save binlogs 30 days

https://gerrit.wikimedia.org/r/364659

Change 364659 merged by Marostegui:
[operations/puppet@production] sanitarium3.my.cnf: Save binlogs 30 days

https://gerrit.wikimedia.org/r/364659

Change 365036 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s6.hosts: Add labsdb1009 and labsdb1010

https://gerrit.wikimedia.org/r/365036

Change 365036 merged by jenkins-bot:
[operations/software@master] s6.hosts: Add labsdb1009 and labsdb1010

https://gerrit.wikimedia.org/r/365036

Mentioned in SAL (#wikimedia-operations) [2017-07-14T07:22:45Z] <marostegui> Stop replication on labsdb1011 for maintenance - T153743

Change 365233 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s6.hosts: Add labsdb1011

https://gerrit.wikimedia.org/r/365233

Mentioned in SAL (#wikimedia-operations) [2017-07-17T06:50:32Z] <marostegui> Stop replication on db1095 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-17T07:00:16Z] <marostegui> Rename labsdb1011 main replication thread to an specific one - T153743

Change 365233 merged by jenkins-bot:
[operations/software@master] s6.hosts: Add labsdb1011

https://gerrit.wikimedia.org/r/365233

Mentioned in SAL (#wikimedia-operations) [2017-07-17T07:12:39Z] <marostegui> Stop slave s2 on db1102 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-17T09:05:07Z] <marostegui> Disable puppet on labsdb1009 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-17T09:22:19Z] <marostegui> Stop replication on labsdb1009 and labsdb1010 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-17T09:24:38Z] <marostegui> Disable puppet on labsdb1010 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-17T14:34:31Z] <marostegui> Run maintain-views on labsdb1009,10 and 11 for s6 - T153743

s6 is now available on the new labs servers. I have created the views too.

s2 is being imported now

Marostegui updated the task description. (Show Details)Jul 17 2017, 2:37 PM

Mentioned in SAL (#wikimedia-operations) [2017-07-20T05:05:18Z] <marostegui> Configure replication for s2 on labsdb1009 and labsdb1010 - T153743

Change 366510 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s2.hosts: Add labsdb1009 and labsdb1010

https://gerrit.wikimedia.org/r/366510

s2 has been imported to labsdb1009 and labsdb1010. I will start with labsdb1011 in a bit.
The reason I don't do all the hosts at the same time, is basically for safety. If for whatever reason the imports make one the hosts fails, we'd still have labsdb1011 as an intact clone source to reclone the other two.

Views will only be created once the import is done on all the hosts and once another sanitize check has been run.

Mentioned in SAL (#wikimedia-operations) [2017-07-20T07:55:44Z] <marostegui> Start importing s2 into labsdb1011 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-20T08:01:47Z] <marostegui> Stop replication on labsdb1011 for maintenance - T153743

Change 366510 merged by jenkins-bot:
[operations/software@master] s2.hosts: Add labsdb1009 and labsdb1010

https://gerrit.wikimedia.org/r/366510

Mentioned in SAL (#wikimedia-operations) [2017-07-24T05:40:34Z] <marostegui> Configure and start s2 replication on labsdb1011 - T153743

Change 367357 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s2.hosts: Add labsdb1011 to s2 list of hosts

https://gerrit.wikimedia.org/r/367357

Change 367357 merged by jenkins-bot:
[operations/software@master] s2.hosts: Add labsdb1011 to s2 list of hosts

https://gerrit.wikimedia.org/r/367357

Mentioned in SAL (#wikimedia-operations) [2017-07-24T14:29:50Z] <marostegui> Run maintain-views on labsdb1009, labsdb1010 and labsdb1011 for s2 wikis - T153743

s2 has been imported on labsdb1009, labsdb1010 and labsdb1011. Views have been created so these wikis are now fully available.
labsdb1011 is catching up with replication, it is around 2 days delayed as it had replication stopped while the import was running.
Since around 6AM UTC, it has gone from 6 days delayed to 2 days, so I reckon it should be up to date later today.

Marostegui updated the task description. (Show Details)EditedJul 24 2017, 2:57 PM
Marostegui claimed this task.

I will start with s7, the last pending shard, on Monday as I am off from tomorrow.

Change 367650 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] check_private_data.py: Add socket parameter

https://gerrit.wikimedia.org/r/367650

Change 367650 merged by Marostegui:
[operations/puppet@production] check_private_data.py: Add socket parameter

https://gerrit.wikimedia.org/r/367650

Change 368391 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] labsdb-replicas: Update new labsdb hosts to stretch/systemd

https://gerrit.wikimedia.org/r/368391

Change 368391 merged by Jcrespo:
[operations/puppet@production] labsdb-replicas: Update new labsdb hosts to stretch/systemd

https://gerrit.wikimedia.org/r/368391

Change 368408 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] labsdb: Rename sanitarium2 to sanitarium multisource

https://gerrit.wikimedia.org/r/368408

Change 368408 merged by Jcrespo:
[operations/puppet@production] labsdb: Rename sanitarium2 to sanitarium multisource

https://gerrit.wikimedia.org/r/368408

Mentioned in SAL (#wikimedia-operations) [2017-07-31T07:17:35Z] <marostegui> Stop replication on s7 on db1102 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-31T09:28:49Z] <marostegui> Stop replication on labsdb1009 and labsdb1010 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-31T11:12:18Z] <marostegui> Compress s6 on db1102 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-08-03T05:17:48Z] <marostegui> Stop replication on labsdb1011 for maintenance - T153743

Marostegui updated the task description. (Show Details)Mon, Aug 7, 11:32 AM
Marostegui closed this task as Resolved.Mon, Aug 7, 11:37 AM

Hi,

So the last shard pending to be imported (s7) is now on the new labs hosts, that means that they now hold all production shards!
The views have been created and the grants added, so this is all done now finally!

cloud-services-team the new hosts are now ready (and so are the views), we might still find issues (ie: some grants missing) or stuff like that, but we'll only find those once we get more and more users to use them.
We' ll probably need to also keep an eye on how they perform and see if we have to set query killers and those measures as we have in the old lab servers.

Keep in mind that these hosts are read-only and they do not hold user databases - that is a pending discussion that needs to happen to see how we can address that problem.

Thanks Jaime, cloud-services-team and everyone for all the help and work to get this milestone finally achieved!