Page MenuHomePhabricator

Productionize clouddb10[13-20]
Closed, ResolvedPublic

Description

The following hosts will be part of the new wiki replicas infrastructure, that will end up replacing the existing labsdb hosts.

  • clouddb1013
  • clouddb1014
  • clouddb1015
  • clouddb1016
  • clouddb1017
  • clouddb1018
  • clouddb1019
  • clouddb1020
  • Apply: lvextend -L+1100G /dev/mapper/tank-data && xfs_growfs /srv to each host.
  • All hosts added to Tendril and Zarcillo
    • clouddb1013:3311
    • clouddb1013:3313
    • clouddb1014:3312
    • clouddb1014:3317
    • clouddb1015:3314
    • clouddb1015:3316
    • clouddb1016:3315
    • clouddb1016:3318
    • clouddb1017:3311
    • clouddb1017:3313
    • clouddb1018:3312
    • clouddb1018:3317
    • clouddb1019:3314
    • clouddb1019:3316
    • clouddb1020:3315
    • clouddb1020:3318

Change root password to use the same as labsdb rather than the ones from sanitariums (where they are being cloned from):

  • clouddb1013:3311
  • clouddb1013:3313
  • clouddb1014:3312
  • clouddb1014:3317
  • clouddb1015:3314
  • clouddb1015:3316
  • clouddb1016:3315
  • clouddb1016:3318
  • clouddb1017:3311
  • clouddb1017:3313
  • clouddb1018:3312
  • clouddb1018:3317
  • clouddb1019:3314
  • clouddb1019:3316
  • clouddb1020:3315
  • clouddb1020:3318

Double check private data on all hosts before considering them as fully data populated

  • clouddb1013:3311
  • clouddb1013:3313
  • clouddb1014:3312
  • clouddb1014:3317
  • clouddb1015:3314
  • clouddb1015:3316
  • clouddb1016:3315
  • clouddb1016:3318
  • clouddb1017:3311
  • clouddb1017:3313
  • clouddb1018:3312
  • clouddb1018:3317
  • clouddb1019:3314
  • clouddb1019:3316
  • clouddb1020:3315
  • clouddb1020:3318

Check GTID enabled on all the instances:

  • clouddb1013:3311
  • clouddb1013:3313
  • clouddb1014:3312
  • clouddb1014:3317
  • clouddb1015:3314
  • clouddb1015:3316
  • clouddb1016:3315
  • clouddb1016:3318
  • clouddb1017:3311
  • clouddb1017:3313
  • clouddb1018:3312
  • clouddb1018:3317
  • clouddb1019:3314
  • clouddb1019:3316
  • clouddb1020:3315
  • clouddb1020:3318

Compare data on all the instances:

  • clouddb1013:3311
  • clouddb1013:3313
  • clouddb1014:3312
  • clouddb1014:3317
  • clouddb1015:3314
  • clouddb1015:3316
  • clouddb1016:3315
  • clouddb1016:3318
  • clouddb1017:3311
  • clouddb1017:3313
  • clouddb1018:3312
  • clouddb1018:3317
  • clouddb1019:3314
  • clouddb1019:3316
  • clouddb1020:3315
  • clouddb1020:3318

Sections per host to be decided, see proposal at: T265135#6598952
The puppet role to be applied is: wmcs::db::wikireplicas::web_multiinstance and wmcs::db::wikireplicas::analytics_multiinstance

Related Objects

StatusSubtypeAssignedTask
ResolvedMarostegui
OpenNone
OpenNone
ResolvedRobH
OpenBstorm
ResolvedBstorm
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedBstorm
ResolvedBstorm
ResolvedMoritzMuehlenhoff
ResolvedMarostegui
ResolvedMarostegui
ResolvedCmjohnson
Resolveddcaro
ResolvedMarostegui
ResolvedRequestwiki_willy
ResolvedRequestCmjohnson
ResolvedRequestCmjohnson
ResolvedRequestCmjohnson
ResolvedRequestCmjohnson

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

s2 situation:

  • Transfer from db1074 (sanitarium master) to clouddb1014:3312 and clouddb1018:3312 completed successfully.
  • Sanitization on clouddb1014:3312 and clouddb1018:3312 was done.
  • Root password changed
  • Triggers removed from all s2.dblist
  • Added prometheus grants
  • Mysqldump from db1125:3312 of information_schema_p was imported into clouddb1014:3312 and clouddb1018:3312
  • Configured replication on:
master_log_file='db1125-bin.003001', master_log_pos=279469845
  • Added clouddb1014:3312 and clouddb1018:3312 to tendril and zarcillo

No InnoDB errors so far.

Mentioned in SAL (#wikimedia-operations) [2020-11-25T06:28:49Z] <marostegui> Check private data on clouddb1014:3312 and clouddb1018:3312 T267090

Restarted clouddb1015:3314, clouddb1015:3316 and clouddb1019:3314, clouddb1019:3316 (they had no errors for a day) let's give them another 24h to see if they keep clean.

Mentioned in SAL (#wikimedia-operations) [2020-11-25T06:38:10Z] <marostegui> Stop mysql on db1125:3317 to clone clouddb1014:3317 clouddb1018:3317 T267090

On-going transfers:

db1125:3317 -> clouddb1014:3317
db1125:3317 -> clouddb1018:3317

On-going transfers:

db1125:3317 -> clouddb1014:3317
db1125:3317 -> clouddb1018:3317

This finished, replication started at:

master_log_file='db1125-bin.002695', master_log_pos=494341008;
  • Root password changed
  • Triggers removed

So far no InnoDB errors.

Mentioned in SAL (#wikimedia-operations) [2020-11-26T06:17:16Z] <marostegui> Stop mysql on db1124:3315 to clone clouddb1016:3315 T267090

clouddb1016:3315:

  • Data copied from db1124:3315
  • Host added to tendril and zarcillo
  • Root password changed
  • Replication started from:
master_log_file='db1124-bin.001558', master_log_pos=103503868;

Mentioned in SAL (#wikimedia-operations) [2020-11-26T07:12:20Z] <marostegui> Enable GTID on clouddb1018:3317 clouddb1014:3317 T267090

Change 643868 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Do not reimage clouddb1018

https://gerrit.wikimedia.org/r/643868

Change 643868 merged by Marostegui:
[operations/puppet@production] install_server: Do not reimage clouddb1018

https://gerrit.wikimedia.org/r/643868

Mentioned in SAL (#wikimedia-operations) [2020-11-30T07:05:48Z] <marostegui> Stop mysql on db1124:3318 to clone clouddb1016:3318, lag will show up on wikireplicas on s8 T267090

Change 644084 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] check_private_data: Add clouddb1016 and clouddb1020

https://gerrit.wikimedia.org/r/644084

Change 644084 merged by Marostegui:
[operations/puppet@production] check_private_data: Add clouddb1016 and clouddb1020

https://gerrit.wikimedia.org/r/644084

Mentioned in SAL (#wikimedia-operations) [2020-11-30T08:36:44Z] <marostegui> Compare data between clouddb1016:3315 labsdb1012 T267090

I did a transfer from db1124:3318 to clouddb1016:3318 and there are InnoDB errors right after I started replication:

Nov 30 09:34:36 clouddb1016 mysqld[27700]: 2020-11-30  9:34:36 51 [Note] Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3318',replication started in log 'db1124-bin.004478' at position 548141123
Nov 30 09:36:20 clouddb1016 mysqld[27700]: 2020-11-30  9:36:20 52 [ERROR] InnoDB: Record in index `pl_namespace` of table `wikidatawiki`.`pagelinks` was not found on update: TUPLE (info_bits=0, 3 fields): {[4]    (0x80000000),[9]Q17682262(0x513137363832323632),[4] ,  (0x012CAEC5)} at: COMPACT RECORD(info_bits=0, 3 fields): {[4]    (0x80000000),[9]Q17682262(0x513137363832323632),[4] ,uZ(0x012C755A)}

Going for the sanitarium master (db1087) -> clouddb1016 transfer approach instead

Mentioned in SAL (#wikimedia-operations) [2020-11-30T09:39:10Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1087 from s8 and pool db1092 instead temporarily on vslow T267090', diff saved to https://phabricator.wikimedia.org/P13466 and previous config saved to /var/cache/conftool/dbconfig/20201130-093909-marostegui.json

Change 644182 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1087: Disable notifications

https://gerrit.wikimedia.org/r/644182

Mentioned in SAL (#wikimedia-operations) [2020-11-30T09:40:39Z] <marostegui> Stop MySQL on db1087 to clone clouddb1016:3318 T267090)

Change 644182 merged by Marostegui:
[operations/puppet@production] db1087: Disable notifications

https://gerrit.wikimedia.org/r/644182

Mentioned in SAL (#wikimedia-operations) [2020-11-30T10:29:42Z] <marostegui> Compare data between clouddb1012:3312 clouddb1018:3312 labsdb1012 T267090

Mentioned in SAL (#wikimedia-operations) [2020-11-30T10:29:52Z] <marostegui> Compare data between clouddb1014:3312 clouddb1018:3312 labsdb1012 T267090

Mentioned in SAL (#wikimedia-operations) [2020-11-30T11:43:46Z] <marostegui> Sanitize clouddb1016:3318 - T267090

s8 situation:

  • Transfer from db1087 (sanitarium master) to clouddb1016:3318 completed successfully.
  • Sanitization on clouddb1016:3318 was done.
root@clouddb1016:/srv# check_private_data.py -S /run/mysqld/mysqld.s8.sock
-- Non-public databases that are present:
-- Non-public tables that are present:
-- Unfiltered columns that are present:
root@clouddb1016:/srv#
  • Root passwords changed
  • Triggers removed
  • Mysqldump from db1124:3318 of information_schema_p was imported into clouddb1016:3318
  • Replication configured and started on:
master_log_file='db1124-bin.004479', master_log_pos='434289975'.

No InnoDB errors so far.

Mentioned in SAL (#wikimedia-operations) [2020-12-01T11:48:53Z] <marostegui> Install bsd-mailx on the new clouddb hosts (needed for the check private data) T267090 T268725

Change 644519 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1106: Disable notifications

https://gerrit.wikimedia.org/r/644519

Change 644519 merged by Marostegui:
[operations/puppet@production] db1106: Disable notifications

https://gerrit.wikimedia.org/r/644519

Mentioned in SAL (#wikimedia-operations) [2020-12-01T17:19:31Z] <marostegui> Sanitize s1 on clouddb1013 and clouddb1017 - T267090

s1 situation:

Transfer from db1106 (sanitarium master) to clouddb1013:3311 and clouddb1017:3311 completed successfully.
Sanitization on clouddb1013:3311 and clouddb1017:3311 was done.

root@clouddb1017:~# check_private_data.py -S /run/mysqld/mysqld.s1.sock
-- Non-public databases that are present:
-- Non-public tables that are present:
-- Unfiltered columns that are present:

root@clouddb1013:~# check_private_data.py -S /run/mysqld/mysqld.s1.sock
-- Non-public databases that are present:
-- Non-public tables that are present:
-- Unfiltered columns that are present:
  • Root passwords changed
  • Triggers removed
  • Mysqldump from db1124:3311 of information_schema_p was imported into clouddb1013:3311 and clouddb1017:3311
  • Replication configured and started on:
master_log_file='db1124-bin.003446', master_log_pos=879160046

No InnoDB errors so far.

@Bstorm just for my own organization, any ETA on when clouddb1020 will be released from your side?
Thanks

Change 645114 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: let clouddb1020 join the party

https://gerrit.wikimedia.org/r/645114

Change 645227 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Do not reimage clouddb1019

https://gerrit.wikimedia.org/r/645227

Change 645227 merged by Marostegui:
[operations/puppet@production] install_server: Do not reimage clouddb1019

https://gerrit.wikimedia.org/r/645227

Change 645114 merged by Marostegui:
[operations/puppet@production] wikireplicas: let clouddb1020 join the party

https://gerrit.wikimedia.org/r/645114

Mentioned in SAL (#wikimedia-operations) [2020-12-04T07:09:30Z] <marostegui> Stop mysql on clouddb1016 to clone clouddb1020 T267090

Mentioned in SAL (#wikimedia-operations) [2020-12-15T11:09:50Z] <marostegui> Create fake db to trigger data checks alerts for clouddb hosts T267090

All this is pretty much done. The last thing I am testing is that all the hosts would properly send an email if there's private data detected.
For that I have created a test database on each instance, and I will wait for the weekly data check to see if all the instances report it correctly

All this is pretty much done. The last thing I am testing is that all the hosts would properly send an email if there's private data detected.
For that I have created a test database on each instance, and I will wait for the weekly data check to see if all the instances report it correctly

This worked fine and emails for all the hosts arrived.
I have dropped that empty test database.

For the record: notifications are disabled, and won't be enabled till the hosts are started to receive users.

Mentioned in SAL (#wikimedia-operations) [2021-01-14T20:17:32Z] <mutante> ACKing all unhandled crit alerts about systemd on clouddb hosts - notifications are disabled but this cleans up Icinga web UI noise - T267090

Thanks @Dzahn for the above!. I have fixed them, basically it was the old single instance pt-kill service, which has been replaced by a multi-instance one, the old one was a left over from the installation.

Mentioned in SAL (#wikimedia-operations) [2021-01-29T08:20:14Z] <marostegui> Change buffer pool sizes on clouddb1013,1015,1017,1019 T267090

I am starting to change buffer pool sizes on all the clouddb hosts to make sure we are using 403 out of 512GB of RAM (which is what we use at the moment). This is what I pushed today and will be doing next week for the other hosts too: https://gerrit.wikimedia.org/r/c/operations/puppet/+/659729

Change 660989 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] clouddb*: Enable notifications

https://gerrit.wikimedia.org/r/660989

Change 660989 merged by Marostegui:
[operations/puppet@production] clouddb*: Enable notifications

https://gerrit.wikimedia.org/r/660989