Page MenuHomePhabricator

Productionize clouddb10[13-20]
Open, MediumPublic

Description

The following hosts will be part of the new wiki replicas infrastructure, that will end up replacing the existing labsdb hosts.

  • clouddb1013
  • clouddb1014
  • clouddb1015
  • clouddb1016
  • clouddb1017
  • clouddb1018
  • clouddb1019
  • clouddb1020
  • Apply: lvextend -L+1100G /dev/mapper/tank-data && xfs_growfs /srv to each host.
  • All hosts added to Tendril and Zarcillo
    • clouddb1013:3311
    • clouddb1013:3313
    • clouddb1014:3312
    • clouddb1014:3317
    • clouddb1015:3314
    • clouddb1015:3316
    • clouddb1016:3315
    • clouddb1016:3318
    • clouddb1017:3311
    • clouddb1017:3313
    • clouddb1018:3312
    • clouddb1018:3317
    • clouddb1019:3314
    • clouddb1019:3316

Change root password to use the same as labsdb rather than the ones from sanitariums (where they are being cloned from):

  • clouddb1013:3311
  • clouddb1013:3313
  • clouddb1014:3312
  • clouddb1014:3317
  • clouddb1015:3314
  • clouddb1015:3316
  • clouddb1016:3315
  • clouddb1016:3318
  • clouddb1017:3311
  • clouddb1017:3313
  • clouddb1018:3312
  • clouddb1018:3317
  • clouddb1019:3314
  • clouddb1019:3316
  • clouddb1020:3315
  • clouddb1020:3318

Double check private data on all hosts before considering them as fully data populated

  • clouddb1013:3311
  • clouddb1013:3313
  • clouddb1014:3312
  • clouddb1014:3317
  • clouddb1015:3314
  • clouddb1015:3316
  • clouddb1016:3315
  • clouddb1016:3318
  • clouddb1017:3311
  • clouddb1017:3313
  • clouddb1018:3312
  • clouddb1018:3317
  • clouddb1019:3314
  • clouddb1019:3316
  • clouddb1020:3315
  • clouddb1020:3318

Check GTID enabled on all the instances:

  • clouddb1013:3311
  • clouddb1013:3313
  • clouddb1014:3312
  • clouddb1014:3317
  • clouddb1015:3314
  • clouddb1015:3316
  • clouddb1016:3315
  • clouddb1016:3318
  • clouddb1017:3311
  • clouddb1017:3313
  • clouddb1018:3312
  • clouddb1018:3317
  • clouddb1019:3314
  • clouddb1019:3316
  • clouddb1020:3315
  • clouddb1020:3318

Sections per host to be decided, see proposal at: T265135#6598952
The puppet role to be applied is: wmcs::db::wikireplicas::web_multiinstance and wmcs::db::wikireplicas::analytics_multiinstance

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Attempting the same on s6 with:

Running a check on s6 tables on db1125

clouddb1015:3316 innodb_change_buffering=none and event_scheduler=OFF (make sure all the triggers are removed)
clouddb1019:3319 innodb_change_buffering=none and event_scheduler=ON (leave triggers)

Attempting the same on s6 with:

Running a check on s6 tables on db1125

This came back clean, tomorrow I will do the transfer for s6 for the above hosts

clouddb1015:3316 innodb_change_buffering=none and event_scheduler=OFF (make sure all the triggers are removed)
clouddb1019:3319 innodb_change_buffering=none and event_scheduler=ON (leave triggers)

Mentioned in SAL (#wikimedia-operations) [2020-11-19T06:08:10Z] <marostegui> Stop mysql on db1125:3316 to clone clouddb1015 and clouddb1019, there will be lag on s6 on wikireplicas - T267090

Change 641874 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] check_private_data: Add clouddb1015 and clouddb1019

https://gerrit.wikimedia.org/r/641874

Change 641874 merged by Marostegui:
[operations/puppet@production] check_private_data: Add clouddb1015 and clouddb1019

https://gerrit.wikimedia.org/r/641874

Attempting the same on s6 with:

Running a check on s6 tables on db1125

clouddb1015:3316 innodb_change_buffering=none and event_scheduler=OFF (make sure all the triggers are removed)
clouddb1019:3319 innodb_change_buffering=none and event_scheduler=ON (leave triggers)

Data has been transferred to those two hosts.

  • Triggers cleaned
  • mysql_upgrade done
  • Now I am running a check tables on both hosts

Once the check is done, I will configure replication using:

1root@db1125:~# mysql -S /run/mysqld/mysqld.s6.sock -e "show master status\G"
2*************************** 1. row ***************************
3 File: db1125-bin.001921
4 Position: 197854833
5 Binlog_Do_DB:
6Binlog_Ignore_DB:

Marostegui updated the task description. (Show Details)Thu, Nov 19, 7:40 AM

This is very bad news.
clouddb1013:3311 (s1) and clouddb1017:3311 (s1) crashed at the same time with the same error (the ones we've seen before) with:

Nov 17 22:48:56 clouddb1013 mysqld[31534]: 2020-11-17 22:48:56 1 [ERROR] InnoDB: Unable to find a record to delete-mark

They were both cloned from db1124:3311 (sanitarium).
I am going to update mariadb bug, copy the data again after running a check table on db1124:3311 and start it with innodb_change_buffering=none

db1124:3311 checks came clean. So I am going to transfer the data back to:

clouddb1013: innodb_change_buffering=none and event_scheduler=OFF (make sure all the triggers are removed)
clouddb1017: innodb_change_buffering=none and event_scheduler=ON (leave triggers)

Will run a check once transferred

Mentioned in SAL (#wikimedia-operations) [2020-11-19T09:40:05Z] <marostegui> Stop mysql on db1124:3311 to clone clouddb1013 and clouddb1017, there will be lag on s1 on wikireplicas - T267090

Attempting the same on s6 with:

Running a check on s6 tables on db1125

clouddb1015:3316 innodb_change_buffering=none and event_scheduler=OFF (make sure all the triggers are removed)
clouddb1019:3319 innodb_change_buffering=none and event_scheduler=ON (leave triggers)

Data has been transferred to those two hosts.

  • Triggers cleaned
  • mysql_upgrade done
  • Now I am running a check tables on both hosts

Once the check is done, I will configure replication using:

1root@db1125:~# mysql -S /run/mysqld/mysqld.s6.sock -e "show master status\G"
2*************************** 1. row ***************************
3 File: db1125-bin.001921
4 Position: 197854833
5 Binlog_Do_DB:
6Binlog_Ignore_DB:

Tables check came clean on db1125:3316, clouddb1015:3316, clouddb1019:3316.
I have started replication on both hosts.

Configuration flags:

# for i in clouddb1015:3316 clouddb1019:3316; do echo "###$i###"; mysql.py -h$i -e "show global variables like 'event_scheduler'; show global variables like 'innodb_change_buffering'";done
###clouddb1015:3316###
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| event_scheduler | OFF   |
+-----------------+-------+
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| innodb_change_buffering | none  |
+-------------------------+-------+
###clouddb1019:3316###
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| event_scheduler | ON    |
+-----------------+-------+
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| innodb_change_buffering | none  |
+-------------------------+-------+
Marostegui updated the task description. (Show Details)Thu, Nov 19, 10:27 AM

So, enwiki transfer from db1124:3311 to clouddb1013 and clouddb1017 finished. And as soon as I started mysql on them, they returned errors. So I am going to go the option B, avoid the transfer from sanitarium and transfer directly from sanitarium master + sanitizing.

Some of the tables are looking as corrupted by CHECK tables, but they are not the ones reporting InnoDB errors. Interestingly, those tables never returned errors when checked on db1124.
The errors are exactly the same on both hosts.

Mentioned in SAL (#wikimedia-operations) [2020-11-19T12:25:00Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1106 T267090', diff saved to https://phabricator.wikimedia.org/P13334 and previous config saved to /var/cache/conftool/dbconfig/20201119-122459-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-11-19T12:38:00Z] <marostegui> Stop mysql on db1106 to clone clouddb1013 and clouddb1017 T267090

Change 641992 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] wikireplicas_multiinstance.my.cnf: Disable event scheduler

https://gerrit.wikimedia.org/r/641992

Mentioned in SAL (#wikimedia-operations) [2020-11-19T14:41:22Z] <marostegui> Sanitize enwiki on clouddb1013 T267090

Mentioned in SAL (#wikimedia-operations) [2020-11-19T14:47:25Z] <marostegui> Sanitize enwiki on clouddb1017 T267090

Change 641992 merged by Marostegui:
[operations/puppet@production] wikireplicas_multiinstance.my.cnf: Disable event scheduler

https://gerrit.wikimedia.org/r/641992

Marostegui updated the task description. (Show Details)Fri, Nov 20, 6:48 AM
Marostegui added a comment.EditedFri, Nov 20, 6:52 AM

So, enwiki transfer from db1124:3311 to clouddb1013 and clouddb1017 finished. And as soon as I started mysql on them, they returned errors. So I am going to go the option B, avoid the transfer from sanitarium and transfer directly from sanitarium master + sanitizing.

Some of the tables are looking as corrupted by CHECK tables, but they are not the ones reporting InnoDB errors. Interestingly, those tables never returned errors when checked on db1124.
The errors are exactly the same on both hosts.

  • Transfer from db1106 (sanitarium master) to clouddb1013:3311 and clouddb1017:3311 completed successfully.
  • Sanitization on clouddb1017:3311 and clouddb1017:3311 was done.
root@clouddb1013:~# check_private_data.py -S /run/mysqld/mysqld.s1.sock
-- Non-public databases that are present:
-- Non-public tables that are present:
-- Unfiltered columns that are present:

root@clouddb1017:~# check_private_data.py -S /run/mysqld/mysqld.s1.sock
-- Non-public databases that are present:
-- Non-public tables that are present:
-- Unfiltered columns that are present:
  • Root passwords changed
  • Mysqldump from db1124:3311 of information_schema_p was imported into clouddb1013:3311 and clouddb1017:3311
  • Replication configured and started on:
master_host='db1124.eqiad.wmnet', master_port=3311 master_log_file='db1124-bin.003382', master_log_pos=752182450;
  • Current configuration:
# for i in clouddb1013:3311 clouddb1017:3311; do echo "###$i###"; mysql.py -h$i -e "show global variables like 'event_scheduler'; show global variables like 'innodb_change_buffering'";done
###clouddb1013:3311###
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| event_scheduler | OFF   |
+-----------------+-------+
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| innodb_change_buffering | none  |
+-------------------------+-------+
###clouddb1017:3311###
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| event_scheduler | ON    |
+-----------------+-------+
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| innodb_change_buffering | none  |
+-------------------------+-------+
  • Replication is flowing:
# for i in clouddb1013:3311  clouddb1017:3311; do echo $i; mysql.py -h$i -e "show slave status\G" | grep Seconds ; done
clouddb1013:3311
         Seconds_Behind_Master: 63718
clouddb1017:3311
         Seconds_Behind_Master: 63718
  • As of now error log looks clean of InnoDB errors

Mentioned in SAL (#wikimedia-operations) [2020-11-20T08:12:59Z] <marostegui> Enable GTID on clouddb1015:3316 clouddb1019:3316 - T267090

clouddb1013:3313 and clouddb1017:3313 have been cloned from db1124:3313.

  • Root passwords changed
  • Triggers removed on all the wikis
  • Configuration:
# for i in clouddb1013:3313 clouddb1017:3313; do echo "###$i###"; mysql.py -h$i -e "show global variables like 'event_scheduler'; show global variables like 'innodb_change_buffering'";done
###clouddb1013:3313###
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| event_scheduler | OFF   |
+-----------------+-------+
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| innodb_change_buffering | none  |
+-------------------------+-------+
###clouddb1017:3313###
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| event_scheduler | ON    |
+-----------------+-------+
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| innodb_change_buffering | none  |
+-------------------------+-------+
  • Replication is flowing:
# for i in clouddb1013:3313  clouddb1017:3313; do echo $i; mysql.py -h$i -e "show slave status\G" | grep Seconds ; done
clouddb1013:3313
         Seconds_Behind_Master: 0
clouddb1017:3313
         Seconds_Behind_Master: 0
  • GTID enabled
  • So far no InnoDB errors on logs.
Marostegui updated the task description. (Show Details)Fri, Nov 20, 11:09 AM

Change 642371 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Do not reimage clouddb1013

https://gerrit.wikimedia.org/r/642371

Change 642371 merged by Marostegui:
[operations/puppet@production] install_server: Do not reimage clouddb1013

https://gerrit.wikimedia.org/r/642371

Mentioned in SAL (#wikimedia-operations) [2020-11-20T12:14:30Z] <marostegui> Run check private data on clouddb1013:3311 clouddb1013:3313 clouddb1015:3316 clouddb1017:3311 clouddb1017:3313 clouddb1019:3316 T267090

Marostegui updated the task description. (Show Details)Fri, Nov 20, 12:24 PM
Marostegui updated the task description. (Show Details)Fri, Nov 20, 12:27 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-23T06:46:16Z] <marostegui> Restart clouddb1013 clouddb1015 clouddb1017 clouddb1019 for testing T267090

The following hosts have been serving fine during the weekend, no crashes, no InnoDB errors.
clouddb1013:3311
clouddb1013:3313
clouddb1015:3316
clouddb1017:3311
clouddb1017:3313
clouddb1019:3316

I have restarted mysql on all of them to see if InnoDB logs would arise, as we've seen that in the past. So far so good.

Marostegui updated the task description. (Show Details)Mon, Nov 23, 7:26 AM

Mentioned in SAL (#wikimedia-operations) [2020-11-23T07:27:00Z] <marostegui> Stop MySQL on db1125:3314 to clone clouddb1015 and clouddb1019 - lag will appear on Commosnwiki on wikireplicas - T267090

Marostegui added a comment.EditedMon, Nov 23, 7:30 AM

On-going transfers:

db1125:3314 -> clouddb1017
db1125:3314 -> clouddb1015

Marostegui updated the task description. (Show Details)Mon, Nov 23, 11:15 AM
Marostegui updated the task description. (Show Details)Mon, Nov 23, 11:19 AM
Marostegui added a comment.EditedMon, Nov 23, 12:23 PM

On-going transfers:

db1125:3314 -> clouddb1015
db1125:3314 -> clouddb1019

These two crashed after being started with the same corruption errors, so I am going to try the approach of copying the data from the sanitarium master instead.

Marostegui updated the task description. (Show Details)Mon, Nov 23, 12:24 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-23T12:25:50Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1121 to clone clouddb1017:3314 clouddb1019:3314 T267090', diff saved to https://phabricator.wikimedia.org/P13366 and previous config saved to /var/cache/conftool/dbconfig/20201123-122549-marostegui.json

Change 643027 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1121: Disable notifications

https://gerrit.wikimedia.org/r/643027

Change 643027 merged by Marostegui:
[operations/puppet@production] db1121: Disable notifications

https://gerrit.wikimedia.org/r/643027

Mentioned in SAL (#wikimedia-operations) [2020-11-24T06:28:21Z] <marostegui> Sanitize clouddb1015:3314 T267090

Mentioned in SAL (#wikimedia-operations) [2020-11-24T06:31:14Z] <marostegui> Sanitize clouddb1019:3314 T267090

s4 situation

  • Transfer from db1121 (sanitarium master) to clouddb1015:3314 and clouddb1019:3314 completed successfully.
  • Sanitization on clouddb1015:3314 and clouddb1019:3314 was done.
root@clouddb1015:~# check_private_data.py  -S /run/mysqld/mysqld.s4.sock
-- Non-public databases that are present:
-- Non-public tables that are present:
-- Unfiltered columns that are present:

root@clouddb1019:~# check_private_data.py  -S /run/mysqld/mysqld.s4.sock
-- Non-public databases that are present:
-- Non-public tables that are present:
-- Unfiltered columns that are present:
  • Root password changed
  • Triggers removed from commonswiki and testcommonswiki
  • Mysqldump from db1125:3314 of information_schema_p was imported into clouddb1015:3314 and clouddb1019:3314
  • Added prometheus grants
  • Replication configured and started on both hosts:
master_log_file='db1125-bin.005025', master_log_pos=117873515;

No InnoDB errors so far.

Marostegui updated the task description. (Show Details)Tue, Nov 24, 10:32 AM
Marostegui updated the task description. (Show Details)Tue, Nov 24, 11:11 AM

Mentioned in SAL (#wikimedia-operations) [2020-11-24T11:12:22Z] <marostegui> Stop mysql on db1125:3312 to clone clouddb1014:3312 and clouddb1018:3312 - T267090

On-going transfers:

db1125:3312 -> clouddb1014
db1125:3312 -> clouddb1018

Marostegui updated the task description. (Show Details)Tue, Nov 24, 1:22 PM
Marostegui updated the task description. (Show Details)Tue, Nov 24, 1:24 PM

The transfer finished on clouddb1014:3312 and clouddb1018:3312 but as soon as replication was started they both showed:

Nov 24 13:23:57 clouddb1014 mysqld[7223]: 2020-11-24 13:23:57 111 [ERROR] InnoDB: Record in index `pl_backlinks_namespace` of table `trwiki`.`pagelinks` was not found on update: TUPLE (info_bits=0, 4 fields): {[4]    (0x80000004),[4]    (0x8000000C),[13]KB1_hatalar  (0x4B

So I am going to go for the sanitarium master copy approach

Marostegui updated the task description. (Show Details)Tue, Nov 24, 1:26 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-24T13:37:09Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1074 to clone clouddb1018 and clouddb1014 T267090', diff saved to https://phabricator.wikimedia.org/P13388 and previous config saved to /var/cache/conftool/dbconfig/20201124-133709-marostegui.json

Change 643254 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1074: Disable notifications

https://gerrit.wikimedia.org/r/643254

Change 643254 merged by Marostegui:
[operations/puppet@production] db1074: Disable notifications

https://gerrit.wikimedia.org/r/643254

Mentioned in SAL (#wikimedia-operations) [2020-11-24T13:40:08Z] <marostegui> Stop MySQL on db1074 to clone clouddb1018 and clouddb1014 T267090

On-going transfers:

db1074 -> clouddb1014
db1074 -> clouddb1018

Marostegui updated the task description. (Show Details)Tue, Nov 24, 3:01 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-24T15:01:44Z] <marostegui> Enable GTID on clouddb1013:3311 clouddb1015:3314 clouddb1017:3311 clouddb1019:3314 T267090

Marostegui updated the task description. (Show Details)Tue, Nov 24, 3:03 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-25T05:48:55Z] <marostegui> Sanitize clouddb1014:3312 and clouddb1018:3312 T267090

Change 643399 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] check_private_data: Add clouddb1014, clouddb1018

https://gerrit.wikimedia.org/r/643399

Change 643399 merged by Marostegui:
[operations/puppet@production] check_private_data: Add clouddb1014, clouddb1018

https://gerrit.wikimedia.org/r/643399

s2 situation:

  • Transfer from db1074 (sanitarium master) to clouddb1014:3312 and clouddb1018:3312 completed successfully.
  • Sanitization on clouddb1014:3312 and clouddb1018:3312 was done.
  • Root password changed
  • Triggers removed from all s2.dblist
  • Added prometheus grants
  • Mysqldump from db1125:3312 of information_schema_p was imported into clouddb1014:3312 and clouddb1018:3312
  • Configured replication on:
master_log_file='db1125-bin.003001', master_log_pos=279469845
  • Added clouddb1014:3312 and clouddb1018:3312 to tendril and zarcillo

No InnoDB errors so far.

Marostegui updated the task description. (Show Details)Wed, Nov 25, 6:20 AM

Mentioned in SAL (#wikimedia-operations) [2020-11-25T06:28:49Z] <marostegui> Check private data on clouddb1014:3312 and clouddb1018:3312 T267090

Marostegui updated the task description. (Show Details)Wed, Nov 25, 6:31 AM

Restarted clouddb1015:3314, clouddb1015:3316 and clouddb1019:3314, clouddb1019:3316 (they had no errors for a day) let's give them another 24h to see if they keep clean.

Mentioned in SAL (#wikimedia-operations) [2020-11-25T06:38:10Z] <marostegui> Stop mysql on db1125:3317 to clone clouddb1014:3317 clouddb1018:3317 T267090

On-going transfers:

db1125:3317 -> clouddb1014:3317
db1125:3317 -> clouddb1018:3317

Marostegui updated the task description. (Show Details)Wed, Nov 25, 8:49 AM
Marostegui updated the task description. (Show Details)

On-going transfers:

db1125:3317 -> clouddb1014:3317
db1125:3317 -> clouddb1018:3317

This finished, replication started at:

master_log_file='db1125-bin.002695', master_log_pos=494341008;
  • Root password changed
  • Triggers removed

So far no InnoDB errors.

Marostegui updated the task description. (Show Details)Wed, Nov 25, 9:01 AM
Marostegui updated the task description. (Show Details)Wed, Nov 25, 9:19 AM
Marostegui updated the task description. (Show Details)Wed, Nov 25, 11:49 AM
Marostegui updated the task description. (Show Details)Wed, Nov 25, 11:58 AM
Marostegui updated the task description. (Show Details)Thu, Nov 26, 6:10 AM

Mentioned in SAL (#wikimedia-operations) [2020-11-26T06:17:16Z] <marostegui> Stop mysql on db1124:3315 to clone clouddb1016:3315 T267090

Marostegui updated the task description. (Show Details)Thu, Nov 26, 7:07 AM

clouddb1016:3315:

  • Data copied from db1124:3315
  • Host added to tendril and zarcillo
  • Root password changed
  • Replication started from:
master_log_file='db1124-bin.001558', master_log_pos=103503868;

Mentioned in SAL (#wikimedia-operations) [2020-11-26T07:12:20Z] <marostegui> Enable GTID on clouddb1018:3317 clouddb1014:3317 T267090

Marostegui updated the task description. (Show Details)Thu, Nov 26, 7:13 AM
Marostegui updated the task description. (Show Details)