Page MenuHomePhabricator

Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy
Closed, ResolvedPublic

Description

db1095 is still in multi-source format, hosting s1, s3, s5 and s8. We may purchase soon an extra host to handle the growing capacity of the different sections. This is the main standalone goal for #DBAs for 2017-2018 Q4 (there could be other more important tasks and goals, but this should have no hard external dependencies, other than hardware provisioning.

  • Design a strategy (do we setup a db1095-equivalent to prevent downtime? Distribution of sections?)
  • Define the purchases needed for the new host(s) and request them (T189590) [Note the decision seems firm, but the purchases have not yet gone through)
  • Setup new hardware needed. If it doesn't arrive on time, maybe use one of the new 8 core host as a temporary measure
    • eqiad T194780
    • codfw T194781
    • Copied to a temporary new host until the new definitive hardware arrives (db1116, db1120, db2075, db2092)
  • Fix sanitarium_multiinstance puppetization https://gerrit.wikimedia.org/r/#/c/425087/
  • Copy data needed for the service
    • eqiad
      • db1124
      • db1125
    • codfw
      • db2094
      • db2095
  • Change codfw sanitariums (db2094 and db2095) to replicate from codfw hosts instead from eqiad ones.
    • db2094
      • s1
      • s3
      • s5
      • s8
    • db2095
      • s2
      • s4
      • s6
      • s7
  • Switchover replicas (labsdb hosts) to use the new hosts
    • labsdb1009
    • labsdb1010
    • labsdb1011

Follow up / clean up tasks:
Clean up sanitarium_multisource related code T196376
Productionize old/temporary eqiad sanitariums T196527
Implement a script to facilitate sanitarium failovers between DCs T196367

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['db2092.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201805310831_marostegui_27000.log.

Completed auto-reimage of hosts:

['db2092.codfw.wmnet']

and were ALL successful.

Change 436506 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool sanitarium masters

https://gerrit.wikimedia.org/r/436506

Change 436506 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool sanitarium masters

https://gerrit.wikimedia.org/r/436506

Mentioned in SAL (#wikimedia-operations) [2018-05-31T11:14:22Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool sanitarium masters - T190704 (duration: 01m 22s)

Mentioned in SAL (#wikimedia-operations) [2018-05-31T12:30:21Z] <marostegui> Stop replication on all sanitarium masters - T190704

labsdb1009 has been moved under the new sanitarium hosts. We will leave it replicating till Monday before proceeding with labsdb1010

Mentioned in SAL (#wikimedia-operations) [2018-05-31T13:41:44Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool sanitarium masters - T190704 (duration: 01m 21s)

Change 436724 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Repool db2092 and db2062

https://gerrit.wikimedia.org/r/436724

Change 436724 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Repool db2092 and db2062

https://gerrit.wikimedia.org/r/436724

Change 436725 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Move db2075 back to s5

https://gerrit.wikimedia.org/r/436725

Change 436725 merged by Marostegui:
[operations/puppet@production] mariadb: Move db2075 back to s5

https://gerrit.wikimedia.org/r/436725

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['db2075.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201806010545_marostegui_5586.log.

Completed auto-reimage of hosts:

['db2075.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2018-06-01T06:15:38Z] <marostegui> Stop MySQL on db2059 to clone db2075 - T190704

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['db2059.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201806010723_marostegui_26798.log.

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['db2059.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201806010746_marostegui_8297.log.

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['db2059.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201806010813_marostegui_14194.log.

Completed auto-reimage of hosts:

['db2059.codfw.wmnet']

Of which those FAILED:

['db2059.codfw.wmnet']

Change 437165 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1121

https://gerrit.wikimedia.org/r/437165

Change 437165 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1121

https://gerrit.wikimedia.org/r/437165

Mentioned in SAL (#wikimedia-operations) [2018-06-04T05:50:34Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1121 - T190704 (duration: 00m 49s)

Mentioned in SAL (#wikimedia-operations) [2018-06-04T05:52:12Z] <marostegui> Stop replication in sync on db1121 and db2051 - T190704

db2095:s4 has been finally moved under db2073 as db2051 (codfw master) already caught up with eqiad and so did db2073 (the sanitarium master for that section)

Mentioned in SAL (#wikimedia-operations) [2018-06-04T06:05:26Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool db1121 - T190704 (duration: 00m 49s)

Change 437172 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Repool db2059 and db2075

https://gerrit.wikimedia.org/r/437172

Change 437172 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Repool db2059 and db2075

https://gerrit.wikimedia.org/r/437172

Mentioned in SAL (#wikimedia-operations) [2018-06-04T06:18:41Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool db2059, db2075 - T190704 (duration: 00m 49s)

Change 437179 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] redact_sanitarium.sh: Add db1124,db1125

https://gerrit.wikimedia.org/r/437179

Change 437179 merged by Marostegui:
[operations/puppet@production] redact_sanitarium.sh: Add db1124,db1125

https://gerrit.wikimedia.org/r/437179

Change 437204 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1010: Depool labsdb1010

https://gerrit.wikimedia.org/r/437204

Change 437204 merged by Marostegui:
[operations/puppet@production] dbproxy1010: Depool labsdb1010

https://gerrit.wikimedia.org/r/437204

Mentioned in SAL (#wikimedia-operations) [2018-06-04T09:39:34Z] <marostegui> Reload haproxy on dbproxy1010 to depool labsdb1010 - https://phabricator.wikimedia.org/T190704

Change 437235 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool all sanitarium masters

https://gerrit.wikimedia.org/r/437235

Change 437235 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool all sanitarium masters

https://gerrit.wikimedia.org/r/437235

Mentioned in SAL (#wikimedia-operations) [2018-06-04T14:05:45Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool all sanitariums masters - T190704 (duration: 00m 49s)

Mentioned in SAL (#wikimedia-operations) [2018-06-04T14:09:59Z] <marostegui> Stop replication on all sanitarium masters to move labsdb1010 to another sanitarium host - T190704

labsdb1010 was switched over to the new sanitarium hosts.

Mentioned in SAL (#wikimedia-operations) [2018-06-04T14:34:05Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool all sanitariums masters - T190704 (duration: 00m 49s)

Mentioned in SAL (#wikimedia-operations) [2018-06-04T14:34:16Z] <marostegui> Reload haproxy on dbproxy1010 to repool labsdb1010 - T190704

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['db2059.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201806050544_marostegui_25872.log.

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['db2059.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201806050556_marostegui_28049.log.

Completed auto-reimage of hosts:

['db2059.codfw.wmnet']

and were ALL successful.

Change 437670 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1010: Depool labsdb1010

https://gerrit.wikimedia.org/r/437670

Change 437670 merged by Marostegui:
[operations/puppet@production] dbproxy1010: Depool labsdb1010

https://gerrit.wikimedia.org/r/437670

Mentioned in SAL (#wikimedia-operations) [2018-06-06T05:24:07Z] <marostegui> Reload haproxy on dbproxy1010 to depool labsdb1010 - T190704

Mentioned in SAL (#wikimedia-operations) [2018-06-06T05:31:56Z] <marostegui> Reload haproxy on dbproxy1010 to repool labsdb1010 - T190704

Change 437674 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1010: Depool labsdb1011

https://gerrit.wikimedia.org/r/437674

Change 437674 merged by Marostegui:
[operations/puppet@production] dbproxy1010: Depool labsdb1011

https://gerrit.wikimedia.org/r/437674

Mentioned in SAL (#wikimedia-operations) [2018-06-06T05:46:29Z] <marostegui> Reload haproxy on dbproxy1010 to depool labsdb1011 - T190704

Change 437676 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool all sanitarium masters

https://gerrit.wikimedia.org/r/437676

Change 437676 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool all sanitarium masters

https://gerrit.wikimedia.org/r/437676

Mentioned in SAL (#wikimedia-operations) [2018-06-06T06:04:46Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool all sanitariums masters - T190704 (duration: 01m 09s)

Mentioned in SAL (#wikimedia-operations) [2018-06-06T07:48:54Z] <marostegui> Stop replication on all sanitarium masters to move labsdb1011 - T190704

Mentioned in SAL (#wikimedia-operations) [2018-06-06T08:14:57Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool all sanitariums masters - T190704 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2018-06-06T08:25:17Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool all sanitariums masters - T190704 (duration: 00m 56s)

labsdb1011 has been moved over the new sanitarium.
This was the last host to be moved.
Let's wait to make sure it goes fine.

Mentioned in SAL (#wikimedia-operations) [2018-06-06T08:29:05Z] <marostegui> Reload haproxy on dbproxy1010 to repool labsdb1011 - T190704

There is a script operations/software/dbtools/events_sanitarium.sql that should be checked, updated and deployed (?) to all sanitarium hosts.

I do see it is deployed on db1095 and on db1102 on the ops database
It needs some checking, but I guess there is lots of historical reasons involved on those triggers, so I am not sure why they were created and if they are really needed anymore.

if they are really needed anymore

They are needed, a different thing is how much changes they need, but cloud users relay on information_schema_p (and so we do so they don't bring down production trying to query information_schema).

I definitely think we do not need the ops database one on sanitarium hosts, those are probably entries from where this particular instance in db1102 was build and ops database was copied over, pretty much the same for db1095 one:

mysql:root@localhost [ops]> select * from event_log;
+-----------+---------------------+-------------------------------+------------------------------------------------------------------------------------------------------------------------+
| server_id | stamp               | event                         | content                                                                                                                |
+-----------+---------------------+-------------------------------+------------------------------------------------------------------------------------------------------------------------+
|    104822 | 2015-01-12 12:22:33 | wmf_slave_wikiuser_slow       | kill 12225880154; SELECT /* SpecialWhatLinksHere::showIndirectLinks 178.255.215.84 */  page_id,page_namespace,page_tit |
| 171974686 | 2017-07-03 15:00:05 | wmf_slave_wikiuser_sleep      | kill 3804646727                                                                                                        |
| 171974686 | 2017-07-03 16:21:05 | wmf_slave_wikiuser_sleep      | kill 3806982004                                                                                                        |
| 171974686 | 2017-07-04 00:16:33 | wmf_slave_wikiuser_slow (>60) | kill 3819361168; SELECT /* ApiQueryRecentChanges::run  */  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type  |
| 171974686 | 2017-07-04 00:16:33 | wmf_slave_wikiuser_slow (>60) | kill 3819361168; SELECT /* ApiQueryRecentChanges::run  */  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type  |
| 171974686 | 2017-07-04 00:18:03 | wmf_slave_wikiuser_slow (>60) | kill 3819392154; SELECT /* ApiQueryRecentChanges::run  */  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type  |
| 171974686 | 2017-07-04 00:18:03 | wmf_slave_wikiuser_slow (>60) | kill 3819392154; SELECT /* ApiQueryRecentChanges::run  */  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type  |
+-----------+---------------------+-------------------------------+------------------------------------------------------------------------------------------------------------------------+
7 rows in set (0.00 sec)

mysql:root@localhost [ops]> show events;
+-----+--------------------------+----------------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
| Db  | Name                     | Definer        | Time zone | Type      | Execute at | Interval value | Interval field | Starts              | Ends | Status  | Originator | character_set_client | collation_connection | Database Collation |
+-----+--------------------------+----------------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
| ops | wmf_slave_overload       | root@localhost | SYSTEM    | RECURRING | NULL       | 10             | SECOND         | 2017-04-28 00:00:01 | NULL | ENABLED |  171974686 | utf8                 | utf8_general_ci      | binary             |
| ops | wmf_slave_purge          | root@localhost | SYSTEM    | RECURRING | NULL       | 15             | MINUTE         | 2017-04-28 00:00:00 | NULL | ENABLED |  171974686 | utf8                 | utf8_general_ci      | binary             |
| ops | wmf_slave_wikiuser_sleep | root@localhost | SYSTEM    | RECURRING | NULL       | 30             | SECOND         | 2017-04-28 00:00:05 | NULL | ENABLED |  171974686 | utf8                 | utf8_general_ci      | binary             |
| ops | wmf_slave_wikiuser_slow  | root@localhost | SYSTEM    | RECURRING | NULL       | 30             | SECOND         | 2017-04-28 00:00:03 | NULL | ENABLED |  171974686 | utf8                 | utf8_general_ci      | binary             |
+-----+--------------------------+----------------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
4 rows in set (0.00 sec)
mysql:root@localhost [ops]> select @@hostname;
+------------+
| @@hostname |
+------------+
| db1095     |
+------------+
1 row in set (0.00 sec)

mysql:root@localhost [ops]> select * from event_log;
+-----------+---------------------+--------------------------+------------------------------------------------------------------------------------------------------------------------+
| server_id | stamp               | event                    | content                                                                                                                |
+-----------+---------------------+--------------------------+------------------------------------------------------------------------------------------------------------------------+
|    101616 | 2014-08-31 10:03:35 | wmf_slave_wikiuser_sleep | kill 6537719837                                                                                                        |
|    101616 | 2014-08-31 10:04:05 | wmf_slave_wikiuser_sleep | kill 6537740876                                                                                                        |
|    101616 | 2014-08-31 10:04:35 | wmf_slave_wikiuser_sleep | kill 6537748124                                                                                                        |
|    101616 | 2014-08-31 18:05:35 | wmf_slave_wikiuser_sleep | kill 6553774149                                                                                                        |
|    101616 | 2014-08-31 20:32:35 | wmf_slave_wikiuser_sleep | kill 6558617326                                                                                                        |
|    101616 | 2014-08-31 20:38:05 | wmf_slave_wikiuser_sleep | kill 6558790024                                                                                                        |
|    101616 | 2014-08-31 20:39:05 | wmf_slave_wikiuser_sleep | kill 6558806160                                                                                                        |
|    101616 | 2014-08-31 23:29:05 | wmf_slave_wikiuser_sleep | kill 6564200174                                                                                                        |
|    101633 | 2015-03-29 09:23:03 | wmf_slave_wikiuser_slow  | kill 10384582902; SELECT /* SpecialWhatLinksHere::showIndirectLinks redacted */  page_id,page_namespace,page_titl |
|    101633 | 2015-03-29 09:23:03 | wmf_slave_wikiuser_slow  | kill 10384582902; SELECT /* SpecialWhatLinksHere::showIndirectLinks redacted */  page_id,page_namespace,page_titl |
|    101633 | 2015-03-29 09:24:03 | wmf_slave_wikiuser_slow  | kill 10384607000; SELECT /* SpecialWhatLinksHere::showIndirectLinks redacted */  page_id,page_namespace,page_titl |
|    101633 | 2015-03-29 09:24:03 | wmf_slave_wikiuser_slow  | kill 10384607000; SELECT /* SpecialWhatLinksHere::showIndirectLinks redacted */  page_id,page_namespace,page_titl |
|    101633 | 2015-03-30 00:49:33 | wmf_slave_wikiuser_slow  | kill 10409920681; SELECT /* SpecialWhatLinksHere::showIndirectLinks redacted ... */  page_id,page_namespace,page |
|    101633 | 2015-03-30 00:49:33 | wmf_slave_wikiuser_slow  | kill 10409920681; SELECT /* SpecialWhatLinksHere::showIndirectLinks redacted ... */  page_id,page_namespace,page |
+-----------+---------------------+--------------------------+------------------------------------------------------------------------------------------------------------------------+
14 rows in set (0.02

We can probably do a clean up on the ones for the ops database and deploy the information_schema ones

Nevermind my comments above. They have nothing to do with the sanitarium events.
The ones on the file are indeed needed. I will deploy them tomorrow.

Change 437802 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] events_sanitarium: Update sanitarium hosts

https://gerrit.wikimedia.org/r/437802

Change 437802 merged by jenkins-bot:
[operations/software@master] events_sanitarium: Update sanitarium hosts

https://gerrit.wikimedia.org/r/437802

Mentioned in SAL (#wikimedia-operations) [2018-06-07T05:15:06Z] <marostegui> Deploy event_sanitarium on codfw sanitariums - T190704

Mentioned in SAL (#wikimedia-operations) [2018-06-07T06:26:26Z] <marostegui> Deploy sanitarium events on db1125 - T190704

Mentioned in SAL (#wikimedia-operations) [2018-06-08T05:34:28Z] <marostegui> Deploy sanitarium events on db1124 - T190704

Change 438215 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db.codfw.php: Unify and update sanitarium comments

https://gerrit.wikimedia.org/r/438215

Change 438215 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db.codfw.php: Unify and update sanitarium comments

https://gerrit.wikimedia.org/r/438215

Mentioned in SAL (#wikimedia-operations) [2018-06-08T09:18:21Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Unify and update sanitarium comments - T190704 (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2018-06-08T09:19:30Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Unify and update sanitarium comments - T190704 (duration: 00m 50s)

Marostegui closed this task as Resolved.Jun 11 2018, 5:53 AM
Marostegui claimed this task.

Everything has been fine for more than a week now (including the events)
Today I even restarted MySQL on all the sanitariums to pick up a new filter. I think we can consider the scope of the goal now completed.
There are some follow up/clean up tasks which already have their own task to be followed up at (T196527 T196376 T196367)

Marostegui updated the task description. (Show Details)