⚓ T267090 Productionize clouddb10[13-20]

Subject	Repo	Branch	Lines +/-
clouddb*: Enable notifications	operations/puppet	production	+0 -8
wikireplicas: let clouddb1020 join the party	operations/puppet	production	+1 -8
install_server: Do not reimage clouddb1019	operations/puppet	production	+1 -1
db1106: Disable notifications	operations/puppet	production	+1 -0
db1087: Disable notifications	operations/puppet	production	+1 -0
check_private_data: Add clouddb1016 and clouddb1020	operations/puppet	production	+14 -0
install_server: Do not reimage clouddb1018	operations/puppet	production	+1 -1
db1074: Disable notifications	operations/puppet	production	+1 -0
check_private_data: Add clouddb1014, clouddb1018	operations/puppet	production	+14 -0
db1121: Disable notifications	operations/puppet	production	+1 -0
install_server: Do not reimage clouddb1013	operations/puppet	production	+1 -1
wikireplicas_multiinstance.my.cnf: Disable event scheduler	operations/puppet	production	+1 -1
check_private_data: Add clouddb1015 and clouddb1019	operations/puppet	production	+14 -0
check_private_data_report: Add clouddb1013, clouddb1017	operations/puppet	production	+16 -0

s2 situation:

Transfer from db1074 (sanitarium master) to clouddb1014:3312 and clouddb1018:3312 completed successfully.
Sanitization on clouddb1014:3312 and clouddb1018:3312 was done.
Root password changed
Triggers removed from all s2.dblist
Added prometheus grants
Mysqldump from db1125:3312 of information_schema_p was imported into clouddb1014:3312 and clouddb1018:3312
Configured replication on:

master_log_file='db1125-bin.003001', master_log_pos=279469845

Added clouddb1014:3312 and clouddb1018:3312 to tendril and zarcillo

No InnoDB errors so far.

• Marostegui updated the task description. (Show Details)Nov 25 2020, 6:20 AM

Mentioned in SAL (#wikimedia-operations) [2020-11-25T06:28:49Z] <marostegui> Check private data on clouddb1014:3312 and clouddb1018:3312 T267090

• Marostegui updated the task description. (Show Details)Nov 25 2020, 6:31 AM

Restarted clouddb1015:3314, clouddb1015:3316 and clouddb1019:3314, clouddb1019:3316 (they had no errors for a day) let's give them another 24h to see if they keep clean.

Mentioned in SAL (#wikimedia-operations) [2020-11-25T06:38:10Z] <marostegui> Stop mysql on db1125:3317 to clone clouddb1014:3317 clouddb1018:3317 T267090

On-going transfers:

db1125:3317 -> clouddb1014:3317
db1125:3317 -> clouddb1018:3317

• Marostegui updated the task description. (Show Details)Nov 25 2020, 8:49 AM

• Marostegui updated the task description. (Show Details)

In T267090#6647511, @Marostegui wrote:

On-going transfers:

db1125:3317 -> clouddb1014:3317
db1125:3317 -> clouddb1018:3317

This finished, replication started at:

master_log_file='db1125-bin.002695', master_log_pos=494341008;

Root password changed
Triggers removed

So far no InnoDB errors.

• Marostegui updated the task description. (Show Details)Nov 25 2020, 9:01 AM

• Marostegui updated the task description. (Show Details)Nov 25 2020, 9:19 AM

• Marostegui added a subtask: T268725: Include mail on standard_packages.pp.Nov 25 2020, 9:36 AM

• Marostegui updated the task description. (Show Details)Nov 25 2020, 11:49 AM

• Marostegui updated the task description. (Show Details)Nov 25 2020, 11:58 AM

• Marostegui mentioned this in T268742: Test upgrading sanitarium hosts to Buster + 10.4.Nov 25 2020, 12:37 PM

• Marostegui added a subtask: T268742: Test upgrading sanitarium hosts to Buster + 10.4.

• Marostegui updated the task description. (Show Details)Nov 26 2020, 6:10 AM

Mentioned in SAL (#wikimedia-operations) [2020-11-26T06:17:16Z] <marostegui> Stop mysql on db1124:3315 to clone clouddb1016:3315 T267090

clouddb1016:3315:

Data copied from db1124:3315
Host added to tendril and zarcillo
Root password changed
Replication started from:

master_log_file='db1124-bin.001558', master_log_pos=103503868;

Mentioned in SAL (#wikimedia-operations) [2020-11-26T07:12:20Z] <marostegui> Enable GTID on clouddb1018:3317 clouddb1014:3317 T267090

• Marostegui updated the task description. (Show Details)Nov 26 2020, 7:13 AM

• Marostegui updated the task description. (Show Details)

Change 643868 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Do not reimage clouddb1018

https://gerrit.wikimedia.org/r/643868

Change 643868 merged by Marostegui:
[operations/puppet@production] install_server: Do not reimage clouddb1018

https://gerrit.wikimedia.org/r/643868

Mentioned in SAL (#wikimedia-operations) [2020-11-30T07:05:48Z] <marostegui> Stop mysql on db1124:3318 to clone clouddb1016:3318, lag will show up on wikireplicas on s8 T267090

Change 644084 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] check_private_data: Add clouddb1016 and clouddb1020

https://gerrit.wikimedia.org/r/644084

Change 644084 merged by Marostegui:
[operations/puppet@production] check_private_data: Add clouddb1016 and clouddb1020

https://gerrit.wikimedia.org/r/644084

• Marostegui updated the task description. (Show Details)Nov 30 2020, 8:25 AM

• Marostegui updated the task description. (Show Details)

• Marostegui updated the task description. (Show Details)Nov 30 2020, 8:27 AM

Mentioned in SAL (#wikimedia-operations) [2020-11-30T08:36:44Z] <marostegui> Compare data between clouddb1016:3315 labsdb1012 T267090

I did a transfer from db1124:3318 to clouddb1016:3318 and there are InnoDB errors right after I started replication:

Nov 30 09:34:36 clouddb1016 mysqld[27700]: 2020-11-30  9:34:36 51 [Note] Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3318',replication started in log 'db1124-bin.004478' at position 548141123
Nov 30 09:36:20 clouddb1016 mysqld[27700]: 2020-11-30  9:36:20 52 [ERROR] InnoDB: Record in index `pl_namespace` of table `wikidatawiki`.`pagelinks` was not found on update: TUPLE (info_bits=0, 3 fields): {[4]    (0x80000000),[9]Q17682262(0x513137363832323632),[4] ,  (0x012CAEC5)} at: COMPACT RECORD(info_bits=0, 3 fields): {[4]    (0x80000000),[9]Q17682262(0x513137363832323632),[4] ,uZ(0x012C755A)}

Going for the sanitarium master (db1087) -> clouddb1016 transfer approach instead

Mentioned in SAL (#wikimedia-operations) [2020-11-30T09:39:10Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1087 from s8 and pool db1092 instead temporarily on vslow T267090', diff saved to https://phabricator.wikimedia.org/P13466 and previous config saved to /var/cache/conftool/dbconfig/20201130-093909-marostegui.json

Change 644182 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1087: Disable notifications

https://gerrit.wikimedia.org/r/644182

Mentioned in SAL (#wikimedia-operations) [2020-11-30T09:40:39Z] <marostegui> Stop MySQL on db1087 to clone clouddb1016:3318 T267090)

Change 644182 merged by Marostegui:
[operations/puppet@production] db1087: Disable notifications

https://gerrit.wikimedia.org/r/644182

• Marostegui updated the task description. (Show Details)Nov 30 2020, 9:42 AM

• Marostegui updated the task description. (Show Details)Nov 30 2020, 10:29 AM

Mentioned in SAL (#wikimedia-operations) [2020-11-30T10:29:42Z] <marostegui> Compare data between clouddb1012:3312 clouddb1018:3312 labsdb1012 T267090

Mentioned in SAL (#wikimedia-operations) [2020-11-30T10:29:52Z] <marostegui> Compare data between clouddb1014:3312 clouddb1018:3312 labsdb1012 T267090

Mentioned in SAL (#wikimedia-operations) [2020-11-30T11:43:46Z] <marostegui> Sanitize clouddb1016:3318 - T267090

s8 situation:

Transfer from db1087 (sanitarium master) to clouddb1016:3318 completed successfully.
Sanitization on clouddb1016:3318 was done.

root@clouddb1016:/srv# check_private_data.py -S /run/mysqld/mysqld.s8.sock
-- Non-public databases that are present:
-- Non-public tables that are present:
-- Unfiltered columns that are present:
root@clouddb1016:/srv#

Root passwords changed
Triggers removed
Mysqldump from db1124:3318 of information_schema_p was imported into clouddb1016:3318
Replication configured and started on:

master_log_file='db1124-bin.004479', master_log_pos='434289975'.

No InnoDB errors so far.

• Marostegui updated the task description. (Show Details)Nov 30 2020, 1:17 PM

• Marostegui updated the task description. (Show Details)Nov 30 2020, 1:42 PM

• Marostegui updated the task description. (Show Details)Nov 30 2020, 2:45 PM

• Marostegui updated the task description. (Show Details)Dec 1 2020, 11:44 AM

Mentioned in SAL (#wikimedia-operations) [2020-12-01T11:48:53Z] <marostegui> Install bsd-mailx on the new clouddb hosts (needed for the check private data) T267090 T268725

Stashbot mentioned this in T268725: Include mail on standard_packages.pp.Dec 1 2020, 11:48 AM

• Marostegui updated the task description. (Show Details)Dec 1 2020, 12:14 PM

• Marostegui updated the task description. (Show Details)Dec 1 2020, 1:38 PM

Change 644519 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1106: Disable notifications

https://gerrit.wikimedia.org/r/644519

Change 644519 merged by Marostegui:
[operations/puppet@production] db1106: Disable notifications

https://gerrit.wikimedia.org/r/644519

• Marostegui updated the task description. (Show Details)Dec 1 2020, 2:49 PM

Mentioned in SAL (#wikimedia-operations) [2020-12-01T17:19:31Z] <marostegui> Sanitize s1 on clouddb1013 and clouddb1017 - T267090

s1 situation:

Transfer from db1106 (sanitarium master) to clouddb1013:3311 and clouddb1017:3311 completed successfully.
Sanitization on clouddb1013:3311 and clouddb1017:3311 was done.

root@clouddb1017:~# check_private_data.py -S /run/mysqld/mysqld.s1.sock
-- Non-public databases that are present:
-- Non-public tables that are present:
-- Unfiltered columns that are present:

root@clouddb1013:~# check_private_data.py -S /run/mysqld/mysqld.s1.sock
-- Non-public databases that are present:
-- Non-public tables that are present:
-- Unfiltered columns that are present:

Root passwords changed
Triggers removed
Mysqldump from db1124:3311 of information_schema_p was imported into clouddb1013:3311 and clouddb1017:3311
Replication configured and started on:

master_log_file='db1124-bin.003446', master_log_pos=879160046

No InnoDB errors so far.

• Marostegui updated the task description. (Show Details)Dec 2 2020, 5:56 AM

• Marostegui updated the task description. (Show Details)Dec 2 2020, 9:39 AM

• Marostegui updated the task description. (Show Details)Dec 2 2020, 9:54 AM

• Marostegui updated the task description. (Show Details)

• Marostegui mentioned this in T269211: Convert labsdb1012 from multi-source to multi-instance.Dec 2 2020, 10:01 AM

@Bstorm just for my own organization, any ETA on when clouddb1020 will be released from your side?
Thanks

• Marostegui updated the task description. (Show Details)Dec 2 2020, 2:39 PM

• Marostegui updated the task description. (Show Details)Dec 3 2020, 6:22 AM

• Marostegui updated the task description. (Show Details)Dec 3 2020, 1:19 PM

• Marostegui updated the task description. (Show Details)

Change 645114 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: let clouddb1020 join the party

https://gerrit.wikimedia.org/r/645114

• Marostegui updated the task description. (Show Details)Dec 4 2020, 6:20 AM

Change 645227 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Do not reimage clouddb1019

https://gerrit.wikimedia.org/r/645227

Change 645227 merged by Marostegui:
[operations/puppet@production] install_server: Do not reimage clouddb1019

https://gerrit.wikimedia.org/r/645227

Change 645114 merged by Marostegui:
[operations/puppet@production] wikireplicas: let clouddb1020 join the party

https://gerrit.wikimedia.org/r/645114

Mentioned in SAL (#wikimedia-operations) [2020-12-04T07:09:30Z] <marostegui> Stop mysql on clouddb1016 to clone clouddb1020 T267090

MoritzMuehlenhoff closed subtask T268725: Include mail on standard_packages.pp as Resolved.Dec 4 2020, 10:35 AM

• Marostegui updated the task description. (Show Details)Dec 4 2020, 12:16 PM

• Marostegui updated the task description. (Show Details)

• Marostegui updated the task description. (Show Details)Dec 4 2020, 2:57 PM

bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.Dec 5 2020, 12:11 AM

• Bstorm closed subtask T268312: Deploy labsdbuser and views to new clouddb hosts as Resolved.Dec 7 2020, 7:23 PM

• Bstorm closed subtask T269200: Create end-user accounts on the new clouddb hosts as Resolved.Dec 8 2020, 5:16 PM

• Marostegui updated the task description. (Show Details)Dec 14 2020, 8:04 AM

• Marostegui updated the task description. (Show Details)

• Marostegui updated the task description. (Show Details)Dec 14 2020, 8:12 AM

Mentioned in SAL (#wikimedia-operations) [2020-12-15T11:09:50Z] <marostegui> Create fake db to trigger data checks alerts for clouddb hosts T267090

All this is pretty much done. The last thing I am testing is that all the hosts would properly send an email if there's private data detected.
For that I have created a test database on each instance, and I will wait for the weekly data check to see if all the instances report it correctly

In T267090#6691483, @Marostegui wrote:

All this is pretty much done. The last thing I am testing is that all the hosts would properly send an email if there's private data detected.
For that I have created a test database on each instance, and I will wait for the weekly data check to see if all the instances report it correctly

This worked fine and emails for all the hosts arrived.
I have dropped that empty test database.

For the record: notifications are disabled, and won't be enabled till the hosts are started to receive users.

Mentioned in SAL (#wikimedia-operations) [2021-01-14T20:17:32Z] <mutante> ACKing all unhandled crit alerts about systemd on clouddb hosts - notifications are disabled but this cleans up Icinga web UI noise - T267090

Thanks @Dzahn for the above!. I have fixed them, basically it was the old single instance pt-kill service, which has been replaced by a multi-instance one, the old one was a left over from the installation.

Thanks @Marostegui :)

• Marostegui closed subtask T268742: Test upgrading sanitarium hosts to Buster + 10.4 as Resolved.Jan 18 2021, 6:08 AM

Mentioned in SAL (#wikimedia-operations) [2021-01-29T08:20:14Z] <marostegui> Change buffer pool sizes on clouddb1013,1015,1017,1019 T267090

I am starting to change buffer pool sizes on all the clouddb hosts to make sure we are using 403 out of 512GB of RAM (which is what we use at the moment). This is what I pushed today and will be doing next week for the other hosts too: https://gerrit.wikimedia.org/r/c/operations/puppet/+/659729

Change 660989 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] clouddb*: Enable notifications

https://gerrit.wikimedia.org/r/660989

Change 660989 merged by Marostegui:
[operations/puppet@production] clouddb*: Enable notifications

https://gerrit.wikimedia.org/r/660989

Status	Subtype	Assigned	Task
Resolved		• Marostegui	T233766 labsdb1011 mariadb crashed
			Restricted Task
			Restricted Task
Open		None	T215858 Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema
Open		None	T280152 Mitigate breaking changes from the new Wiki Replicas architecture
			Unknown Object (Task)
Resolved		RobH	T260441 (Need By: ASAP) rack/setup/install clouddb10[13-20]
Resolved		• Bstorm	T260389 Redesign and rebuild the wikireplicas service using a multi-instance architecture
Resolved		• Bstorm	T260843 Set up roles for new wiki replicas layout
Resolved		• Marostegui	T265135 wikireplicas: Define MW sections per host
Resolved		• Marostegui	T267090 Productionize clouddb10[13-20]
Resolved		• Marostegui	T268312 Deploy labsdbuser and views to new clouddb hosts
Resolved		• Bstorm	T269200 Create end-user accounts on the new clouddb hosts
Resolved		• Bstorm	T269620 maintain-dbusers doesn't close connections right on harvest-replicas
Resolved		MoritzMuehlenhoff	T268725 Include mail on standard_packages.pp
Resolved		• Marostegui	T268742 Test upgrading sanitarium hosts to Buster + 10.4
Resolved		• Marostegui	T272008 Move wikireplicas under the new sanitarium hosts (db1154, db1155)
Resolved		• Cmjohnson	T272125 Memory errors on clouddb1019
Resolved		dcaro	T272127 2021-01-15: PROBLEM alert - labstore1004/Ensure mysql credential creation for tools users is running is CRITICAL
Resolved		• Marostegui	T280492 Upgrade all sanitarium masters to 10.4 and Buster
Resolved	Request	wiki_willy	T281794 decommission db1082.eqiad.wmnet
Resolved	Request	• Cmjohnson	T281959 decommission db1074.eqiad.wmnet
Resolved	Request	• Cmjohnson	T282079 decommission db1079.eqiad.wmnet
Resolved	Request	• Cmjohnson	T282093 decommission db1087.eqiad.wmnet
Resolved	Request	• Cmjohnson	T282096 decommission db1085.eqiad.wmnet

Productionize clouddb10[13-20]
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Productionize clouddb10[13-20]Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Productionize clouddb10[13-20]
Closed, ResolvedPublic
Actions

Related Objects
Search...