Page MenuHomePhabricator

Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095
Closed, ResolvedPublic

Description

Once db1095 is done we should copy data over the new labs servers. Also remember to drop the triggers and enable the events: https://phabricator.wikimedia.org/diffusion/OSOF/browse/master/dbtools/events_labsdb.sql

These are the new labs servers

labsdb1009.eqiad.wmnet
labsdb1010.eqiad.wmnet
labsdb1011.eqiad.wmnet

Better to leave 1009 till the end as it is being used now for testing.

The accounts are being handled on T149933

Event Timeline

The transfer from db1095 to labsdb1010 has just started

Change 324899 had a related patch set uploaded (by Jcrespo):
Enable new TLS certs on labsdb hosts

https://gerrit.wikimedia.org/r/324899

Change 324899 merged by Jcrespo:
Enable new TLS certs on labsdb hosts

https://gerrit.wikimedia.org/r/324899

Change 324900 had a related patch set uploaded (by Jcrespo):
Update new labsdb configuration template

https://gerrit.wikimedia.org/r/324900

Change 324900 merged by Jcrespo:
Update new labsdb configuration template

https://gerrit.wikimedia.org/r/324900

labsdb1010 has now data from db1095 and it is catching up.
I have applied the events too: https://phabricator.wikimedia.org/diffusion/OSOF/browse/master/dbtools/events_labsdb.sql

Pending: drop triggers

Replication had broken because events from both sanitarium, production slaves and labs were running there, conflicting on the information_schema_p tables- we need to drop all production events on all these 4 servers, and make sure the one populating i_s_p is only running on sanitariums.

Additionally, I think it would be better if we setup one separate replication channel per shard, so if something goes wrong or it is slow, it only affects one shard (we can do maintenance independently on each shard).

Replication had broken because events from both sanitarium, production slaves and labs were running there, conflicting on the information_schema_p tables- we need to drop all production events on all these 4 servers, and make sure the one populating i_s_p is only running on sanitariums.

Ah, that is because the events were loaded on db1095 and of course were included when I did the transfer. Too bad

Additionally, I think it would be better if we setup one separate replication channel per shard, so if something goes wrong or it is slow, it only affects one shard (we can do maintenance independently on each shard).

I am not sure if we can set one replication channel per shard on labs as both channels would be replicating from the same master-id and I am unsure (off the top of my head) how is that going to behave.

Change 325176 had a related patch set uploaded (by Jcrespo):
mariadb: Update check private data script to handle BINARY fields

https://gerrit.wikimedia.org/r/325176

Change 325255 had a related patch set uploaded (by Marostegui):
labsdb-replica: Disable parallel replication

https://gerrit.wikimedia.org/r/325255

Mentioned in SAL (#wikimedia-operations) [2016-12-05T08:01:13Z] <marostegui> Stop MySQL labsdb1010 - maintenance T152194

I have started transferring data from labsdb1010 to labsdb1011. Once we have both up we can try to test having replication in different channels with one of the servers.

Change 325255 merged by Marostegui:
labsdb-replica: Disable parallel replication

https://gerrit.wikimedia.org/r/325255

labsdb1011 is up and running with a single channel. I will test two of them too

Change 325176 merged by Jcrespo:
mariadb: Update check private data script to handle BINARY fields

https://gerrit.wikimedia.org/r/325176

db1095$ check_private_data.py 
-- Non-public databases that are present:
DROP DATABASE IF EXISTS `test`;
-- Non-public tables that are present:
-- Unfiltered columns that are present:

:-)

db1095$ check_private_data.py 
-- Non-public databases that are present:
DROP DATABASE IF EXISTS `test`;
-- Non-public tables that are present:
-- Unfiltered columns that are present:

:-)

Nice job!!!! :-)

Mentioned in SAL (#wikimedia-operations) [2016-12-05T13:04:00Z] <marostegui> Stopping mysql labsdb1010 and labsdb1009 for maintenance - T152194

I have started transferring the data to labsdb1009.
Also took a backup of the existing data, just in case @chasemp needs it: labsdb1009:/srv/tmp/labsdb1009.sql - if not, let me know and I will nuke the file.

Change 325303 had a related patch set uploaded (by Marostegui):
mariadb: Added gtid_domain_id variable

https://gerrit.wikimedia.org/r/325303

labsdb1009 is now up and running. The three servers are replicating fine.

I have also enabled SSL.

Regarding the innodb_buffer_pool_size it is currently 75% of the total memory. Shall we descrease it to...50% to start with?
I have checked the current labs servers and they have set 25% RAM for the buffer pool size.

I have checked the current labs servers and they have set 25% RAM for the buffer pool size.

That was back when TokuDB was the main engine.

I would put it to 75% on the webrequests, and maybe increase it later, and 60% on analytics, and maybe decrease it later.

I have added the 3 labsdb hosts to tendril, cleaned up its accounts, added the admin ones that labs host has.

Change 325303 merged by Marostegui:
mariadb: Added gtid_domain_id variable

https://gerrit.wikimedia.org/r/325303

The 3 new labsdb hosts and sanitarium2 have now gtid_domain_id variable deployed and enabled.

I have seen that: modules/role/manifests/labs/db/replica.pp already includes the firewall class:

include role::mariadb::monitor
include base::firewall
include role::mariadb::ferm
include passwords::misc::scripts

We probably need custom rules, rather than role::mariadb::ferm.

jcrespo claimed this task.