Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Dec 20 2016, 10:09 AM

Details

Subject	Repo	Branch	Lines +/-
labsdb: Rename sanitarium2 to sanitarium multisource	operations/puppet	production	+57 -75
labsdb-replicas: Update new labsdb hosts to stretch/systemd	operations/puppet	production	+8 -26
check_private_data.py: Add socket parameter	operations/puppet	production	+9 -3
s2.hosts: Add labsdb1011 to s2 list of hosts	operations/software	master	+1 -0
s2.hosts: Add labsdb1009 and labsdb1010	operations/software	master	+2 -0
s6.hosts: Add labsdb1011	operations/software	master	+1 -0
s6.hosts: Add labsdb1009 and labsdb1010	operations/software	master	+2 -0
sanitarium3.my.cnf: Save binlogs 30 days	operations/puppet	production	+1 -1
db-eqiad.php: Repool db1079	operations/mediawiki-config	master	+3 -3
s7.hosts: db1102 now replicates s7	operations/software	master	+1 -0
db1079.yaml: Specify ROW as binlog format	operations/puppet	production	+1 -1
db-eqiad.php: Depool db1079	operations/mediawiki-config	master	+3 -3
db-eqiad.php: db1079 as sanitarium3 master for s7	operations/mediawiki-config	master	+1 -1
s6.hosts: Add db1102	operations/software	master	+1 -0
db-eqiad.php: Repool db1085	operations/mediawiki-config	master	+3 -3
db1085.yaml: Add ROW as binlog format	operations/puppet	production	+1 -1
db-eqiad.php: Depool db1085	operations/mediawiki-config	master	+3 -3
db-eqiad.php: Depool db1060	operations/mediawiki-config	master	+2 -2
redact_sanitarium: Add db1102 to the allowed hosts	operations/puppet	production	+1 -1
s2.hosts: Add db1102	operations/software	master	+1 -0
db1060.yaml: Change to ROW binlog format	operations/puppet	production	+1 -1
site.pp: Add db1102 sanitarium role	operations/puppet	production	+6 -0
db-eqiad.php: Depool db1070	operations/mediawiki-config	master	+4 -4
s5.hosts: Add new labs infra hosts	operations/software	master	+4 -0
db-eqiad.php: Depool db1070	operations/mediawiki-config	master	+6 -6
db-eqiad.php: Add a few comments	operations/mediawiki-config	master	+1 -1
db-eqiad.php: Depool db1070	operations/mediawiki-config	master	+5 -5
site.pp: Enable ROW binlog for db1070	operations/puppet	production	+9 -1
db-eqiad.php: Add comment for db1064	operations/mediawiki-config	master	+1 -1
db-eqiad.php: Depool db1064	operations/mediawiki-config	master	+6 -6
check_private_data.py: Add missing quote	operations/puppet	production	+1 -1
db-eqiad.php: Depool db1064	operations/mediawiki-config	master	+6 -6
site.pp: Change db1064 to ROW	operations/puppet	production	+9 -1

Related Objects
Search...

Status	Assigned	Task
Resolved	jcrespo	T140788 Labs databases rearchitecture (tracking)
Resolved	jcrespo	T153058 LabsDB infrastructure pending work
Resolved	None	T159423 Meta ticket: Migrate multi-source database hosts to multi-instance
Resolved	• Marostegui	T153743 Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts
Resolved	• Marostegui	T157931 s5: db1070 not using file per table
Resolved	• Marostegui	T168021 setup dewiki and wikidatawiki on the labsdb1009, 1010 and 1011

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 363137 merged by jenkins-bot:
[operations/software@master] s2.hosts: Add db1102

https://gerrit.wikimedia.org/r/363137

Change 363138 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] redact_sanitarium: Add db1102 to the allowed hosts

https://gerrit.wikimedia.org/r/363138

Change 363138 merged by Marostegui:
[operations/puppet@production] redact_sanitarium: Add db1102 to the allowed hosts

https://gerrit.wikimedia.org/r/363138

Mentioned in SAL (#wikimedia-operations) [2017-07-04T08:54:30Z] <marostegui> Run redact_sanitarium on db1102 (sanitarium3) - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-04T09:54:37Z] <marostegui> Stop replication on db1095 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-04T09:58:51Z] <marostegui> Move labsdb1009 main general replication thread to a named replication thread called db1095 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-04T10:33:17Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1060 - T153743 (duration: 02m 49s)

Change 363155 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1085

https://gerrit.wikimedia.org/r/363155

Mentioned in SAL (#wikimedia-operations) [2017-07-04T10:40:28Z] <marostegui> Stop replication on db1102 (sanitarium3) on s2 shard for maintenance - T153743

Change 363155 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1085

https://gerrit.wikimedia.org/r/363155

Mentioned in SAL (#wikimedia-operations) [2017-07-05T09:17:30Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1085 - T153743 (duration: 02m 50s)

Change 363299 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1085.yaml: Add ROW as binlog format

https://gerrit.wikimedia.org/r/363299

Mentioned in SAL (#wikimedia-operations) [2017-07-05T09:27:01Z] <marostegui> Stop MySQL on db1085 for maintenance - T153743

Change 363299 merged by Marostegui:
[operations/puppet@production] db1085.yaml: Add ROW as binlog format

https://gerrit.wikimedia.org/r/363299

Mentioned in SAL (#wikimedia-operations) [2017-07-05T10:41:56Z] <marostegui> Run redact_sanitarium on s6 databases db1102 - T153743

Change 363335 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1085

https://gerrit.wikimedia.org/r/363335

Change 363335 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1085

https://gerrit.wikimedia.org/r/363335

Mentioned in SAL (#wikimedia-operations) [2017-07-05T12:32:59Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1085 - T153743 (duration: 02m 49s)

Mentioned in SAL (#wikimedia-operations) [2017-07-05T12:33:09Z] <marostegui> Stop all replication threads on db1095 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-05T12:36:45Z] <marostegui> Move labsdb1010 main general replication thread to a named replication thread called db1095 - T153743

Change 363339 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s6.hosts: Add db1102

https://gerrit.wikimedia.org/r/363339

Change 363339 merged by jenkins-bot:
[operations/software@master] s6.hosts: Add db1102

https://gerrit.wikimedia.org/r/363339

I thought I would update with the latest news from this task:

db1102 is the new sanitarium 3 running multi-instance with GTID and SSL enabled (jcrespo, feel free to test the new role there if you want, there is no other process, apart from the replication threads, which can be stopped anytime, running).
It has s2 and s6 now sanitized there and replicating with the replication filters and all that stuff.

labsdb1009 and labsdb1010 got their replication thread renamed, so instead of using a general one, they have had the general replication one moved to a replication thread called 'db1095' (as that is the master for that thread which replicates s1,s3,s4 and s5).
The reason for that is because they will need to replicate s2,s6 and s7 from db1102. So they will have two threads.

Krinkle mentioned this in T169486: 2017-07-03 Save Timing spike (300% increase).Jul 6 2017, 5:27 AM

Change 363815 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: db1079 as sanitarium3 master for s7

https://gerrit.wikimedia.org/r/363815

Before starting any replication for s2 or s6 on labs servers, we need to either upgrade db1102 to 10.1 (which will require a bit of work, as it is running the sanitarium role for the multi instance, but that one doesn't have 10.1) or change the binlog from their intermediate masters of s2 and s6 to STATEMENT temporarily. Otherwise triggers will not on db1102.

Mentioned in SAL (#wikimedia-operations) [2017-07-10T11:36:13Z] <marostegui> Stop MySQL on db1102 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-10T12:02:54Z] <marostegui> Upgrade db1102 to 10.1 and enable rbr triggers - T153743

Change 363815 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: db1079 as sanitarium3 master for s7

https://gerrit.wikimedia.org/r/363815

Mentioned in SAL (#wikimedia-operations) [2017-07-10T12:26:11Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: db1079 to become master for sanitarium3 - T153743 (duration: 00m 41s)

Mentioned in SAL (#wikimedia-operations) [2017-07-10T12:58:28Z] <marostegui> Run redact_sanitarium on s2 and s6 - db1102 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-10T13:28:08Z] <marostegui> Disable puppet on db1102 to run check_private_data - T153743

I have upgraded db1102 to 10.1 so we are now using rbr triggers there.
After sanitizing s2 and s6, I ran the check_private_data script and it reported nothing, so that looks good (I did that process twice)
I am going to leave both shards just replicating and I will run the check_private_data script tomorrow to make sure the triggers are working fine. I will manually also check the users table.

If it all went fine, I will import s7 into sanitarium3 tomorrow too.

Change 364247 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1079.yaml: Specify ROW as binlog format

https://gerrit.wikimedia.org/r/364247

Change 364372 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1079

https://gerrit.wikimedia.org/r/364372

Change 364372 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1079

https://gerrit.wikimedia.org/r/364372

Mentioned in SAL (#wikimedia-operations) [2017-07-11T07:15:28Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1079 - T153743 (duration: 00m 41s)

Mentioned in SAL (#wikimedia-operations) [2017-07-11T07:26:26Z] <marostegui> Stop MySQL on db1079 for maintenance - T153743

Change 364247 merged by Marostegui:
[operations/puppet@production] db1079.yaml: Specify ROW as binlog format

https://gerrit.wikimedia.org/r/364247

Mentioned in SAL (#wikimedia-operations) [2017-07-11T07:38:25Z] <marostegui> Stop MySQL db1102 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-11T09:40:14Z] <marostegui> Stop slave s6 on db1102 for exporting its content - T153743

Change 364393 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s7.hosts: db1102 now replicates s7

https://gerrit.wikimedia.org/r/364393

Change 364393 merged by jenkins-bot:
[operations/software@master] s7.hosts: db1102 now replicates s7

https://gerrit.wikimedia.org/r/364393

Change 364395 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1079

https://gerrit.wikimedia.org/r/364395

Change 364395 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1079

https://gerrit.wikimedia.org/r/364395

Mentioned in SAL (#wikimedia-operations) [2017-07-11T09:54:00Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1079 with low weight - T153743 (duration: 00m 42s)

Mentioned in SAL (#wikimedia-operations) [2017-07-11T15:25:06Z] <marostegui> Stop replication labsdb1009 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-11T15:29:31Z] <marostegui> Stop replication labsdb1010 for maintenance - T153743

Change 364659 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] sanitarium3.my.cnf: Save binlogs 30 days

https://gerrit.wikimedia.org/r/364659

Change 364659 merged by Marostegui:
[operations/puppet@production] sanitarium3.my.cnf: Save binlogs 30 days

https://gerrit.wikimedia.org/r/364659

Change 365036 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s6.hosts: Add labsdb1009 and labsdb1010

https://gerrit.wikimedia.org/r/365036

Change 365036 merged by jenkins-bot:
[operations/software@master] s6.hosts: Add labsdb1009 and labsdb1010

https://gerrit.wikimedia.org/r/365036

Mentioned in SAL (#wikimedia-operations) [2017-07-14T07:22:45Z] <marostegui> Stop replication on labsdb1011 for maintenance - T153743

Change 365233 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s6.hosts: Add labsdb1011

https://gerrit.wikimedia.org/r/365233

Mentioned in SAL (#wikimedia-operations) [2017-07-17T06:50:32Z] <marostegui> Stop replication on db1095 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-17T07:00:16Z] <marostegui> Rename labsdb1011 main replication thread to an specific one - T153743

Change 365233 merged by jenkins-bot:
[operations/software@master] s6.hosts: Add labsdb1011

https://gerrit.wikimedia.org/r/365233

Mentioned in SAL (#wikimedia-operations) [2017-07-17T07:12:39Z] <marostegui> Stop slave s2 on db1102 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-17T09:05:07Z] <marostegui> Disable puppet on labsdb1009 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-17T09:22:19Z] <marostegui> Stop replication on labsdb1009 and labsdb1010 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-17T09:24:38Z] <marostegui> Disable puppet on labsdb1010 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-17T14:34:31Z] <marostegui> Run maintain-views on labsdb1009,10 and 11 for s6 - T153743

s6 is now available on the new labs servers. I have created the views too.

s2 is being imported now

• Marostegui updated the task description. (Show Details)Jul 17 2017, 2:37 PM

Mentioned in SAL (#wikimedia-operations) [2017-07-20T05:05:18Z] <marostegui> Configure replication for s2 on labsdb1009 and labsdb1010 - T153743

Change 366510 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s2.hosts: Add labsdb1009 and labsdb1010

https://gerrit.wikimedia.org/r/366510

s2 has been imported to labsdb1009 and labsdb1010. I will start with labsdb1011 in a bit.
The reason I don't do all the hosts at the same time, is basically for safety. If for whatever reason the imports make one the hosts fails, we'd still have labsdb1011 as an intact clone source to reclone the other two.

Views will only be created once the import is done on all the hosts and once another sanitize check has been run.

Mentioned in SAL (#wikimedia-operations) [2017-07-20T07:55:44Z] <marostegui> Start importing s2 into labsdb1011 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-20T08:01:47Z] <marostegui> Stop replication on labsdb1011 for maintenance - T153743

Change 366510 merged by jenkins-bot:
[operations/software@master] s2.hosts: Add labsdb1009 and labsdb1010

https://gerrit.wikimedia.org/r/366510

Mentioned in SAL (#wikimedia-operations) [2017-07-24T05:40:34Z] <marostegui> Configure and start s2 replication on labsdb1011 - T153743

Change 367357 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s2.hosts: Add labsdb1011 to s2 list of hosts

https://gerrit.wikimedia.org/r/367357

Change 367357 merged by jenkins-bot:
[operations/software@master] s2.hosts: Add labsdb1011 to s2 list of hosts

https://gerrit.wikimedia.org/r/367357

Mentioned in SAL (#wikimedia-operations) [2017-07-24T14:29:50Z] <marostegui> Run maintain-views on labsdb1009, labsdb1010 and labsdb1011 for s2 wikis - T153743

s2 has been imported on labsdb1009, labsdb1010 and labsdb1011. Views have been created so these wikis are now fully available.
labsdb1011 is catching up with replication, it is around 2 days delayed as it had replication stopped while the import was running.
Since around 6AM UTC, it has gone from 6 days delayed to 2 days, so I reckon it should be up to date later today.

I will start with s7, the last pending shard, on Monday as I am off from tomorrow.

Change 367650 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] check_private_data.py: Add socket parameter

https://gerrit.wikimedia.org/r/367650

Change 367650 merged by Marostegui:
[operations/puppet@production] check_private_data.py: Add socket parameter

https://gerrit.wikimedia.org/r/367650

Change 368391 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] labsdb-replicas: Update new labsdb hosts to stretch/systemd

https://gerrit.wikimedia.org/r/368391

Change 368391 merged by Jcrespo:
[operations/puppet@production] labsdb-replicas: Update new labsdb hosts to stretch/systemd

https://gerrit.wikimedia.org/r/368391

Change 368408 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] labsdb: Rename sanitarium2 to sanitarium multisource

https://gerrit.wikimedia.org/r/368408

Change 368408 merged by Jcrespo:
[operations/puppet@production] labsdb: Rename sanitarium2 to sanitarium multisource

https://gerrit.wikimedia.org/r/368408

Mentioned in SAL (#wikimedia-operations) [2017-07-31T07:17:35Z] <marostegui> Stop replication on s7 on db1102 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-31T09:28:49Z] <marostegui> Stop replication on labsdb1009 and labsdb1010 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-07-31T11:12:18Z] <marostegui> Compress s6 on db1102 - T153743

• Marostegui mentioned this in T165233: Data Lake edit data missing for many wikis.Jul 31 2017, 3:22 PM

Mentioned in SAL (#wikimedia-operations) [2017-08-03T05:17:48Z] <marostegui> Stop replication on labsdb1011 for maintenance - T153743

• Marostegui mentioned this in T155041: Prepare and check storage layer for wikimania2018wiki.Aug 4 2017, 9:01 AM

• Marostegui mentioned this in T166344: db1016 m1 master: Possibly faulty BBU.Aug 7 2017, 8:18 AM

• Marostegui updated the task description. (Show Details)Aug 7 2017, 11:32 AM

Hi,

So the last shard pending to be imported (s7) is now on the new labs hosts, that means that they now hold all production shards!
The views have been created and the grants added, so this is all done now finally!

cloud-services-team the new hosts are now ready (and so are the views), we might still find issues (ie: some grants missing) or stuff like that, but we'll only find those once we get more and more users to use them.
We' ll probably need to also keep an eye on how they perform and see if we have to set query killers and those measures as we have in the old lab servers.

Keep in mind that these hosts are read-only and they do not hold user databases - that is a pending discussion that needs to happen to see how we can address that problem.

Thanks Jaime, cloud-services-team and everyone for all the help and work to get this milestone finally achieved!

bd808 mentioned this in T172704: Promote initial use of new Wiki Replica servers.Aug 7 2017, 3:56 PM

bd808 mentioned this in T173513: Create a database on the wikireplica servers called "datasets_p".Aug 21 2017, 3:50 PM

bd808 mentioned this in T173511: Implement technical details and process for "datasets_p" on wikireplica hosts.Aug 27 2017, 12:26 AM

bd808 mentioned this in Blog Post: New Wiki Replica servers ready for use.Sep 1 2017, 12:58 AM

jcrespo removed a subtask: T159423: Meta ticket: Migrate multi-source database hosts to multi-instance.Mar 26 2018, 7:59 PM

jcrespo added a parent task: T159423: Meta ticket: Migrate multi-source database hosts to multi-instance.