Page MenuHomePhabricator

Test upgrading sanitarium hosts to Buster + 10.4
Closed, ResolvedPublic

Description

While populating data from sanitarium to the new clouddb hosts (T267090) some sections showed InnoDB errors after the upgrade and as soon as replication started.
Examples:
s1 T267090#6629364
s2 T267090#6644949
s4 T267090#6640399

The workaround for this (that has worked so far) is copying the data from the sanitarium master instead and sanitize clouddb hosts.
This makes me thing that maybe upgrading sanitarium hosts might trigger InnoDB errors for the sections above.

Let's take two hosts from T267043: (Need By: 2020-11-29) rack/setup/install db11[51-76] and use them as temporary sanitarium with 10.4, copying the data from the existing ones to them and see what happens with the upgrade.
Chosen hosts:

  • db1154
  • s1
  • s3
  • s5
  • s8
  • db1155
  • s2
  • s4
  • s6
  • s7

There are two scenarios:

a) It works fine, so once ready we can just move clouddb hosts under them when ready and re-use the other ones for core
b) The upgrade doesn't work and we need to "convert" them by using the workaround above (copying the data from sanitarium masters and sanitize them).

  • Ensure InnoDB is compressed on the new hosts
    • db1154
    • db1155

Related Objects

StatusSubtypeAssignedTask
ResolvedMarostegui
ResolvedRLazarus
OpenRLazarus
ResolvedRobH
OpenNone
OpenBstorm
ResolvedBstorm
ResolvedMarostegui
ResolvedMarostegui
OpenNone
OpenNone
OpenKrinkle
Openaaron
OpenNone
OpenRobH
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedMarostegui
StalledMarostegui
OpenCmjohnson
Resolveddcaro

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 649839 merged by Marostegui:
[operations/puppet@production] site.pp: Clarify the situation with db1154 and db1155

https://gerrit.wikimedia.org/r/649839

Change 649856 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Add db1154 as a temp sanitarium

https://gerrit.wikimedia.org/r/649856

Change 649856 merged by Marostegui:
[operations/puppet@production] mariadb: Add db1154 as a temp sanitarium

https://gerrit.wikimedia.org/r/649856

Mentioned in SAL (#wikimedia-operations) [2020-12-16T11:52:42Z] <marostegui> Stop s1, s3, s5 and s8 on db1124 to copy it to db1154 (this will generate lag on wikireplicas) T268742

Marostegui moved this task from Ready to In progress on the DBA board.Dec 16 2020, 11:55 AM

I built db1154 with s1, s3, s5 and s8 and as soon as I started s1, there were InnoDB errors - which I was kinda expecting after seeing T267090#6629364
So s1, is not to be trusted and needs to be built from the sanitarium masters.

s3, s5 and s8 haven't shown errors for now. Replication was just enabled.

and needs to be built from the sanitarium masters

Not saying it has to happen this other way, but in case it could help, to provide an alternative to save doing manual work- could it be rebuilt from sanitarium on codfw? That way no manual tweaking would be necessary, even if transfer would take more time.

Probably codfw hosts will have the same problem
Also, I prefer not to mix data between DCs, especially cause the existing clouddb hosts were cloned from sanitarium's masters in eqiad

s8 has InnoDB errors.

Mentioned in SAL (#wikimedia-operations) [2020-12-17T05:55:56Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1106 for cloning db1154:3311 T268742 ', diff saved to https://phabricator.wikimedia.org/P13560 and previous config saved to /var/cache/conftool/dbconfig/20201217-055556-marostegui.json

Marostegui updated the task description. (Show Details)Dec 17 2020, 6:03 AM

Change 650029 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] redact_sanitarium: Add db1154 to the list of sanitarium hosts

https://gerrit.wikimedia.org/r/650029

Change 650029 merged by Marostegui:
[operations/puppet@production] redact_sanitarium: Add db1154 to the list of sanitarium hosts

https://gerrit.wikimedia.org/r/650029

Change 650030 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] redact_sanitarium: Fix typo

https://gerrit.wikimedia.org/r/650030

Change 650030 merged by Marostegui:
[operations/puppet@production] redact_sanitarium: Fix typo

https://gerrit.wikimedia.org/r/650030

Mentioned in SAL (#wikimedia-operations) [2020-12-17T07:19:03Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1082 for cloning db1154:3315 T268742 ', diff saved to https://phabricator.wikimedia.org/P13563 and previous config saved to /var/cache/conftool/dbconfig/20201217-071903-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-12-17T09:00:31Z] <marostegui> Sanitize s1 and s5 on db1154 T268742

Change 650089 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] check_private_data_report: Add db1154 as sanitarium host

https://gerrit.wikimedia.org/r/650089

Change 650089 merged by Marostegui:
[operations/puppet@production] check_private_data_report: Add db1154 as sanitarium host

https://gerrit.wikimedia.org/r/650089

s5 has been built from sanitarium's master on db1154:3315. It has been sanitized and running check_private_data returned clean results. Let's see if InnoDB show up in the next few days.

Marostegui updated the task description. (Show Details)Dec 17 2020, 12:07 PM

Mentioned in SAL (#wikimedia-operations) [2020-12-17T12:40:53Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db1082 (re)pooling @ 25%: Repooling after cloning db1154:3315 as sanitarium T268742', diff saved to https://phabricator.wikimedia.org/P13566 and previous config saved to /var/cache/conftool/dbconfig/20201217-124052-root.json

Mentioned in SAL (#wikimedia-operations) [2020-12-17T12:54:46Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1106 after cloning db1154:3311 as sanitarium T268742', diff saved to https://phabricator.wikimedia.org/P13567 and previous config saved to /var/cache/conftool/dbconfig/20201217-125446-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-12-17T12:55:56Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db1082 (re)pooling @ 50%: Repooling after cloning db1154:3315 as sanitarium T268742', diff saved to https://phabricator.wikimedia.org/P13569 and previous config saved to /var/cache/conftool/dbconfig/20201217-125556-root.json

Mentioned in SAL (#wikimedia-operations) [2020-12-17T13:01:01Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1087 to clone db1154:3318 add db1092 as vslow,dump service for s8 T268742 ', diff saved to https://phabricator.wikimedia.org/P13571 and previous config saved to /var/cache/conftool/dbconfig/20201217-130101-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-12-17T13:11:00Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db1082 (re)pooling @ 75%: Repooling after cloning db1154:3315 as sanitarium T268742', diff saved to https://phabricator.wikimedia.org/P13574 and previous config saved to /var/cache/conftool/dbconfig/20201217-131059-root.json

Mentioned in SAL (#wikimedia-operations) [2020-12-17T13:26:03Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db1082 (re)pooling @ 100%: Repooling after cloning db1154:3315 as sanitarium T268742', diff saved to https://phabricator.wikimedia.org/P13576 and previous config saved to /var/cache/conftool/dbconfig/20201217-132603-root.json

s1 and s8 are being sanitized on db1154, which I expect it to take a couple of days or so.

Marostegui updated the task description. (Show Details)Dec 18 2020, 6:58 AM

Mentioned in SAL (#wikimedia-operations) [2020-12-18T07:06:48Z] <marostegui> Stop mysql on db1124:3313 T268742

s3 gave errors when copied from db1124 and upgraded, so needs also copying from sanitarium masters.

Dec 18 08:52:53 db1154 mysqld[20635]: InnoDB: tuple DATA TUPLE: 3 fields;
Dec 18 08:52:53 db1154 mysqld[20635]:  0: len 4; hex 80000000; asc     ;;
Dec 18 08:52:53 db1154 mysqld[20635]:  1: len 26; hex 5379726961635f556e696f6e5f50617274795f28537972696129; asc Syriac_Union_Party_(Syria);;
Dec 18 08:52:53 db1154 mysqld[20635]:  2: len 4; hex 00029ff4; asc     ;;
Dec 18 08:52:53 db1154 mysqld[20635]: InnoDB: record PHYSICAL RECORD: n_fields 3; compact format; info bits 0
Dec 18 08:52:53 db1154 mysqld[20635]:  0: len 4; hex 80000000; asc     ;;
Dec 18 08:52:53 db1154 mysqld[20635]:  1: len 26; hex 5379726961635f556e696f6e5f50617274795f28537972696129; asc Syriac_Union_Party_(Syria);;
Marostegui updated the task description. (Show Details)Dec 21 2020, 7:01 AM

Replication flowing in s1 and s8 with no InnoDB errors for the last 2 days

Mentioned in SAL (#wikimedia-operations) [2020-12-21T07:07:48Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1112 T268742 ', diff saved to https://phabricator.wikimedia.org/P13609 and previous config saved to /var/cache/conftool/dbconfig/20201221-070748-marostegui.json

Marostegui updated the task description. (Show Details)Dec 21 2020, 7:57 AM
Marostegui updated the task description. (Show Details)Dec 21 2020, 8:32 AM
Marostegui updated the task description. (Show Details)Dec 23 2020, 5:51 AM

Change 651694 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Do not reimage db1154

https://gerrit.wikimedia.org/r/651694

Change 651694 merged by Marostegui:
[operations/puppet@production] install_server: Do not reimage db1154

https://gerrit.wikimedia.org/r/651694

Marostegui updated the task description. (Show Details)Dec 23 2020, 6:28 AM

So far no errors on s1, s3, s5 and s8 on db1154 since they were re-cloned from the sanitarium masters.
Triggers seem to be working fine, no PII there reported by the private data check.
Also manually checking the user table reveals no PII on any of those sections.

No errors on db1154 after 10 days

Change 654370 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Productionize db1155 as sanitarium

https://gerrit.wikimedia.org/r/654370

Change 654370 merged by Marostegui:
[operations/puppet@production] mariadb: Productionize db1155 as sanitarium

https://gerrit.wikimedia.org/r/654370

Mentioned in SAL (#wikimedia-operations) [2021-01-05T06:40:27Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1074 to clone db1155:3312 T268742 ', diff saved to https://phabricator.wikimedia.org/P13647 and previous config saved to /var/cache/conftool/dbconfig/20210105-064026-marostegui.json

Change 654371 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1074: Disable notifications

https://gerrit.wikimedia.org/r/654371

Change 654371 merged by Marostegui:
[operations/puppet@production] db1074: Disable notifications

https://gerrit.wikimedia.org/r/654371

On-going transfer from db1074 to db1155

1root@cumin1001:/home/marostegui/T270053# mysql.py -hdb1125:3312 -e "show slave status\G"
2*************************** 1. row ***************************
3 Slave_IO_State: Reconnecting after a failed master event read
4 Master_Host: db1074.eqiad.wmnet
5 Master_User: repl
6 Master_Port: 3306
7 Connect_Retry: 60
8 Master_Log_File: db1074-bin.006147
9 Read_Master_Log_Pos: 401395959
10 Relay_Log_File: db1125-relay-bin.000384
11 Relay_Log_Pos: 401396248
12 Relay_Master_Log_File: db1074-bin.006147
13 Slave_IO_Running: Connecting
14 Slave_SQL_Running: Yes
15 Replicate_Do_DB:
16 Replicate_Ignore_DB:
17 Replicate_Do_Table:
18 Replicate_Ignore_Table:
19 Replicate_Wild_Do_Table:
20 Replicate_Wild_Ignore_Table: mysql.%,oai.%,advisorswiki.%,arbcom_cswiki.%,arbcom_dewiki.%,arbcom_enwiki.%,arbcom_fiwiki.%,arbcom_nlwiki.%,arbcom_ruwiki.%,auditcomwiki.%,boardgovcomwiki.%,boardwiki.%,chairwiki.%,chapcomwiki.%,checkuserwiki.%,collabwiki.%,ecwikimedia.%,electcomwiki.%,execwiki.%,fdcwiki.%,grantswiki.%,id_internalwikimedia.%,iegcomwiki.%,ilwikimedia.%,internalwiki.%,legalteamwiki.%,movementroleswiki.%,noboard_chapterswikimedia.%,officewiki.%,ombudsmenwiki.%,otrs_wikiwiki.%,projectcomwiki.%,searchcomwiki.%,spcomwiki.%,stewardwiki.%,sysop_itwiki.%,techconductwiki.%,transitionteamwiki.%,wg_enwiki.%,wikimaniateamwiki.%,zerowiki.%,%.__wmf_checksums,%.accountaudit_login,%.arbcom1_vote,%.archive_old,%.blob_orphans,%.blob_tracking,%.bot_passwords,%.bv2009_edits,%.categorylinks_old,%.click_tracking,%.cu_changes,%.cu_log,%.cur,%.echo_email_batch,%.echo_event,%.echo_target_page,%.echo_unread_wikis,%.echo_notification,%.echo_push_subscription,%.edit_page_tracking,%.email_capture,%.exarchive,%.exrevision,%.globalnames,%.hidden,%.image_old,%.job,%.linkscc,%.localnames,%.log_search,%.logging_old,%.long_run_profiling,%.migrateuser_medium,%.moodbar_feedback,%.moodbar_feedback_response,%.msg_resource,%.oathauth_users,%.oauth_accepted_consumer,%.oauth_ratelimit_client_tier,%.oauth_registered_consumer,%.oauth2_access_tokens,%.objectcache,%.old_growth,%.oldimage_old,%.optin_survey,%.prefstats,%.prefswitch_survey,%.profiling,%.querycache,%.querycache_info,%.querycache_old,%.querycachetwo,%.reading_list,%.reading_list_entry,%.securepoll_cookie_match,%.securepoll_elections,%.securepoll_entity,%.securepoll_lists,%.securepoll_msgs,%.securepoll_options,%.securepoll_properties,%.securepoll_questions,%.securepoll_strike,%.securepoll_voters,%.securepoll_votes,%.spoofuser,%.text,%.titlekey,%.transcache,%.uploadstash,%.urlshortcodes,%.user_newtalk,%.vote_log,%.watchlist,%.watchlist_expiry,%.wikimedia_editor_tasks_counts,%.wikimedia_editor_tasks_keys,%.wikimedia_editor_tasks_targets_passed
21 Last_Errno: 0
22 Last_Error:
23 Skip_Counter: 0
24 Exec_Master_Log_Pos: 401395959
25 Relay_Log_Space: 401396591
26 Until_Condition: None
27 Until_Log_File:
28 Until_Log_Pos: 0
29 Master_SSL_Allowed: Yes
30 Master_SSL_CA_File:
31 Master_SSL_CA_Path:
32 Master_SSL_Cert:
33 Master_SSL_Cipher:
34 Master_SSL_Key:
35 Seconds_Behind_Master: NULL
36Master_SSL_Verify_Server_Cert: No
37 Last_IO_Errno: 2003
38 Last_IO_Error: error reconnecting to master 'repl@db1074.eqiad.wmnet:3306' - retry-time: 60 maximum-retries: 86400 message: Can't connect to MySQL server on 'db1074.eqiad.wmnet' (111 "Connection refused")
39 Last_SQL_Errno: 0
40 Last_SQL_Error:
41 Replicate_Ignore_Server_Ids:
42 Master_Server_Id: 171966668
43 Master_SSL_Crl:
44 Master_SSL_Crlpath:
45 Using_Gtid: Slave_Pos
46 Gtid_IO_Pos: 0-180359173-4858865027,171966668-171966668-3090,180359173-180359173-70817914,171978786-171978786-2475603130,171970567-171970567-390719906,171966574-171966574-2221092918,180359241-180359241-121693516,171966670-171966670-2410812544,180359271-180359271-332498589
47 Replicate_Do_Domain_Ids:
48 Replicate_Ignore_Domain_Ids:
49 Parallel_Mode: conservative

Change 654372 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Add db1155 to redact sanitarium and check_private_data

https://gerrit.wikimedia.org/r/654372

Change 654372 merged by Marostegui:
[operations/puppet@production] mariadb: Add db1155 to redact sanitarium and check_private_data

https://gerrit.wikimedia.org/r/654372

s2 has been brought up on db1155, so far no InnoDB errors - it was cloned from the sanitarium master, db1074.
Sanitization done.
Triggers and filters working fine
InnoDB compression is on-going

Marostegui updated the task description. (Show Details)Tue, Jan 5, 10:29 AM

Change 654974 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1074: Disable notifications

https://gerrit.wikimedia.org/r/654974

Mentioned in SAL (#wikimedia-operations) [2021-01-08T06:33:01Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1085 to clone db1155:3316 T268742 ', diff saved to https://phabricator.wikimedia.org/P13666 and previous config saved to /var/cache/conftool/dbconfig/20210108-063301-marostegui.json

Change 654974 merged by Marostegui:
[operations/puppet@production] db1085: Disable notifications

https://gerrit.wikimedia.org/r/654974

On-going transfer from db1085 to db1155

s6 replication positions:

1mysql.py -hdb1125:3316 -e "show slave status\G"
2*************************** 1. row ***************************
3 Slave_IO_State: Reconnecting after a failed master event read
4 Master_Host: db1085.eqiad.wmnet
5 Master_User: repl
6 Master_Port: 3306
7 Connect_Retry: 60
8 Master_Log_File: db1085-bin.003884
9 Read_Master_Log_Pos: 313241001
10 Relay_Log_File: db1125-relay-bin.000338
11 Relay_Log_Pos: 313241290
12 Relay_Master_Log_File: db1085-bin.003884
13 Slave_IO_Running: Connecting
14 Slave_SQL_Running: Yes
15 Replicate_Do_DB:
16 Replicate_Ignore_DB:
17 Replicate_Do_Table:
18 Replicate_Ignore_Table:
19 Replicate_Wild_Do_Table:
20 Replicate_Wild_Ignore_Table: mysql.%,oai.%,advisorswiki.%,arbcom_cswiki.%,arbcom_dewiki.%,arbcom_enwiki.%,arbcom_fiwiki.%,arbcom_nlwiki.%,arbcom_ruwiki.%,auditcomwiki.%,boardgovcomwiki.%,boardwiki.%,chairwiki.%,chapcomwiki.%,checkuserwiki.%,collabwiki.%,ecwikimedia.%,electcomwiki.%,execwiki.%,fdcwiki.%,grantswiki.%,id_internalwikimedia.%,iegcomwiki.%,ilwikimedia.%,internalwiki.%,legalteamwiki.%,movementroleswiki.%,noboard_chapterswikimedia.%,officewiki.%,ombudsmenwiki.%,otrs_wikiwiki.%,projectcomwiki.%,searchcomwiki.%,spcomwiki.%,stewardwiki.%,sysop_itwiki.%,techconductwiki.%,transitionteamwiki.%,wg_enwiki.%,wikimaniateamwiki.%,zerowiki.%,%.__wmf_checksums,%.accountaudit_login,%.arbcom1_vote,%.archive_old,%.blob_orphans,%.blob_tracking,%.bot_passwords,%.bv2009_edits,%.categorylinks_old,%.click_tracking,%.cu_changes,%.cu_log,%.cur,%.echo_email_batch,%.echo_event,%.echo_target_page,%.echo_unread_wikis,%.echo_notification,%.echo_push_subscription,%.edit_page_tracking,%.email_capture,%.exarchive,%.exrevision,%.globalnames,%.hidden,%.image_old,%.job,%.linkscc,%.localnames,%.log_search,%.logging_old,%.long_run_profiling,%.migrateuser_medium,%.moodbar_feedback,%.moodbar_feedback_response,%.msg_resource,%.oathauth_users,%.oauth_accepted_consumer,%.oauth_ratelimit_client_tier,%.oauth_registered_consumer,%.oauth2_access_tokens,%.objectcache,%.old_growth,%.oldimage_old,%.optin_survey,%.prefstats,%.prefswitch_survey,%.profiling,%.querycache,%.querycache_info,%.querycache_old,%.querycachetwo,%.reading_list,%.reading_list_entry,%.securepoll_cookie_match,%.securepoll_elections,%.securepoll_entity,%.securepoll_lists,%.securepoll_msgs,%.securepoll_options,%.securepoll_properties,%.securepoll_questions,%.securepoll_strike,%.securepoll_voters,%.securepoll_votes,%.spoofuser,%.text,%.titlekey,%.transcache,%.uploadstash,%.urlshortcodes,%.user_newtalk,%.vote_log,%.watchlist,%.watchlist_expiry,%.wikimedia_editor_tasks_counts,%.wikimedia_editor_tasks_keys,%.wikimedia_editor_tasks_targets_passed
21 Last_Errno: 0
22 Last_Error:
23 Skip_Counter: 0
24 Exec_Master_Log_Pos: 313241001
25 Relay_Log_Space: 313241633
26 Until_Condition: None
27 Until_Log_File:
28 Until_Log_Pos: 0
29 Master_SSL_Allowed: Yes
30 Master_SSL_CA_File:
31 Master_SSL_CA_Path:
32 Master_SSL_Cert:
33 Master_SSL_Cipher:
34 Master_SSL_Key:
35 Seconds_Behind_Master: NULL
36Master_SSL_Verify_Server_Cert: No
37 Last_IO_Errno: 2003
38 Last_IO_Error: error reconnecting to master 'repl@db1085.eqiad.wmnet:3306' - retry-time: 60 maximum-retries: 86400 message: Can't connect to MySQL server on 'db1085.eqiad.wmnet' (111 "Connection refused")
39 Last_SQL_Errno: 0
40 Last_SQL_Error:
41 Replicate_Ignore_Server_Ids:
42 Master_Server_Id: 171970663
43 Master_SSL_Crl:
44 Master_SSL_Crlpath:
45 Using_Gtid: Slave_Pos
46 Gtid_IO_Pos: 171974883-171974883-1921892293,0-180359184-3049354376,171974662-171974662-315898504,171970594-171970594-1063329989,171970663-171970663-313,171978904-171978904-200854125,171970705-171970705-239075865,171978766-171978766-1375989821,180359184-180359184-35598956,180367474-180367474-91976046,180367475-180367475-264159665
47 Replicate_Do_Domain_Ids:
48 Replicate_Ignore_Domain_Ids:
49 Parallel_Mode: conservative

s2 compression on db1155 finished

s6 started on db1155, so far no InnoDB errors after the PII removal. Replication is working fine. Triggers also working fine and check_private_data was clean.
I am compressing InnoDB

Marostegui updated the task description. (Show Details)Fri, Jan 8, 11:04 AM

Compression finished on s2.

On-going transfer from db1121 to db1155:3314
s4 sanitarium positions

1root@cumin1001:~# mysql.py -hdb1125:3314 -e "show slave status\G"
2*************************** 1. row ***************************
3 Slave_IO_State: Reconnecting after a failed master event read
4 Master_Host: db1121.eqiad.wmnet
5 Master_User: repl
6 Master_Port: 3306
7 Connect_Retry: 60
8 Master_Log_File: db1121-bin.005996
9 Read_Master_Log_Pos: 901413614
10 Relay_Log_File: db1125-relay-bin.001031
11 Relay_Log_Pos: 901413903
12 Relay_Master_Log_File: db1121-bin.005996
13 Slave_IO_Running: Connecting
14 Slave_SQL_Running: Yes
15 Replicate_Do_DB:
16 Replicate_Ignore_DB:
17 Replicate_Do_Table:
18 Replicate_Ignore_Table:
19 Replicate_Wild_Do_Table:
20 Replicate_Wild_Ignore_Table: mysql.%,oai.%,advisorswiki.%,arbcom_cswiki.%,arbcom_dewiki.%,arbcom_enwiki.%,arbcom_fiwiki.%,arbcom_nlwiki.%,arbcom_ruwiki.%,auditcomwiki.%,boardgovcomwiki.%,boardwiki.%,chairwiki.%,chapcomwiki.%,checkuserwiki.%,collabwiki.%,ecwikimedia.%,electcomwiki.%,execwiki.%,fdcwiki.%,grantswiki.%,id_internalwikimedia.%,iegcomwiki.%,ilwikimedia.%,internalwiki.%,legalteamwiki.%,movementroleswiki.%,noboard_chapterswikimedia.%,officewiki.%,ombudsmenwiki.%,otrs_wikiwiki.%,projectcomwiki.%,searchcomwiki.%,spcomwiki.%,stewardwiki.%,sysop_itwiki.%,techconductwiki.%,transitionteamwiki.%,wg_enwiki.%,wikimaniateamwiki.%,zerowiki.%,%.__wmf_checksums,%.accountaudit_login,%.arbcom1_vote,%.archive_old,%.blob_orphans,%.blob_tracking,%.bot_passwords,%.bv2009_edits,%.categorylinks_old,%.click_tracking,%.cu_changes,%.cu_log,%.cur,%.echo_email_batch,%.echo_event,%.echo_target_page,%.echo_unread_wikis,%.echo_notification,%.echo_push_subscription,%.edit_page_tracking,%.email_capture,%.exarchive,%.exrevision,%.globalnames,%.hidden,%.image_old,%.job,%.linkscc,%.localnames,%.log_search,%.logging_old,%.long_run_profiling,%.migrateuser_medium,%.moodbar_feedback,%.moodbar_feedback_response,%.msg_resource,%.oathauth_users,%.oauth_accepted_consumer,%.oauth_ratelimit_client_tier,%.oauth_registered_consumer,%.oauth2_access_tokens,%.objectcache,%.old_growth,%.oldimage_old,%.optin_survey,%.prefstats,%.prefswitch_survey,%.profiling,%.querycache,%.querycache_info,%.querycache_old,%.querycachetwo,%.reading_list,%.reading_list_entry,%.securepoll_cookie_match,%.securepoll_elections,%.securepoll_entity,%.securepoll_lists,%.securepoll_msgs,%.securepoll_options,%.securepoll_properties,%.securepoll_questions,%.securepoll_strike,%.securepoll_voters,%.securepoll_votes,%.spoofuser,%.text,%.titlekey,%.transcache,%.uploadstash,%.urlshortcodes,%.user_newtalk,%.vote_log,%.watchlist,%.watchlist_expiry,%.wikimedia_editor_tasks_counts,%.wikimedia_editor_tasks_keys,%.wikimedia_editor_tasks_targets_passed
21 Last_Errno: 0
22 Last_Error:
23 Skip_Counter: 0
24 Exec_Master_Log_Pos: 901413614
25 Relay_Log_Space: 901414246
26 Until_Condition: None
27 Until_Log_File:
28 Until_Log_Pos: 0
29 Master_SSL_Allowed: Yes
30 Master_SSL_CA_File:
31 Master_SSL_CA_Path:
32 Master_SSL_Cert:
33 Master_SSL_Cipher:
34 Master_SSL_Key:
35 Seconds_Behind_Master: NULL
36Master_SSL_Verify_Server_Cert: No
37 Last_IO_Errno: 2003
38 Last_IO_Error: error reconnecting to master 'repl@db1121.eqiad.wmnet:3306' - retry-time: 60 maximum-retries: 86400 message: Can't connect to MySQL server on 'db1121.eqiad.wmnet' (111 "Connection refused")
39 Last_SQL_Errno: 0
40 Last_SQL_Error:
41 Replicate_Ignore_Server_Ids:
42 Master_Server_Id: 171974668
43 Master_SSL_Crl:
44 Master_SSL_Crlpath:
45 Using_Gtid: Slave_Pos
46 Gtid_IO_Pos: 171974668-171974668-1546,171978876-171978876-1972981824,171966557-171966557-1846583256,180359190-180359190-192195477,171978775-171978775-4822899280,180363436-180363436-1155411339
47 Replicate_Do_Domain_Ids:
48 Replicate_Ignore_Domain_Ids:
49 Parallel_Mode: conservative

Mentioned in SAL (#wikimedia-operations) [2021-01-11T09:31:55Z] <marostegui> Sanitize db1155:3314 - T268742

s4 on db1155:3314 is being sanitized

db1155:3314 is now replicating. So far no InnoDB errors.
Triggers seem to be working fine
Going to start compression there.
Check private data came back clean.

Marostegui updated the task description. (Show Details)Mon, Jan 11, 1:15 PM

Mentioned in SAL (#wikimedia-operations) [2021-01-12T06:16:09Z] <marostegui> Stop mysql on db1079 to clone db1155:3317 T268742

Sanitarium positions to replicate from on s7

1# mysql.py -hdb1125:3317 -e "show slave status\G"
2*************************** 1. row ***************************
3 Slave_IO_State: Reconnecting after a failed master event read
4 Master_Host: db1079.eqiad.wmnet
5 Master_User: repl
6 Master_Port: 3306
7 Connect_Retry: 60
8 Master_Log_File: db1079-bin.005403
9 Read_Master_Log_Pos: 604784159
10 Relay_Log_File: db1125-relay-bin.000382
11 Relay_Log_Pos: 604784448
12 Relay_Master_Log_File: db1079-bin.005403
13 Slave_IO_Running: Connecting
14 Slave_SQL_Running: Yes
15 Replicate_Do_DB:
16 Replicate_Ignore_DB:
17 Replicate_Do_Table:
18 Replicate_Ignore_Table:
19 Replicate_Wild_Do_Table:
20 Replicate_Wild_Ignore_Table: mysql.%,oai.%,advisorswiki.%,arbcom_cswiki.%,arbcom_dewiki.%,arbcom_enwiki.%,arbcom_fiwiki.%,arbcom_nlwiki.%,arbcom_ruwiki.%,auditcomwiki.%,boardgovcomwiki.%,boardwiki.%,chairwiki.%,chapcomwiki.%,checkuserwiki.%,collabwiki.%,ecwikimedia.%,electcomwiki.%,execwiki.%,fdcwiki.%,grantswiki.%,id_internalwikimedia.%,iegcomwiki.%,ilwikimedia.%,internalwiki.%,legalteamwiki.%,movementroleswiki.%,noboard_chapterswikimedia.%,officewiki.%,ombudsmenwiki.%,otrs_wikiwiki.%,projectcomwiki.%,searchcomwiki.%,spcomwiki.%,stewardwiki.%,sysop_itwiki.%,techconductwiki.%,transitionteamwiki.%,wg_enwiki.%,wikimaniateamwiki.%,zerowiki.%,%.__wmf_checksums,%.accountaudit_login,%.arbcom1_vote,%.archive_old,%.blob_orphans,%.blob_tracking,%.bot_passwords,%.bv2009_edits,%.categorylinks_old,%.click_tracking,%.cu_changes,%.cu_log,%.cur,%.echo_email_batch,%.echo_event,%.echo_target_page,%.echo_unread_wikis,%.echo_notification,%.echo_push_subscription,%.edit_page_tracking,%.email_capture,%.exarchive,%.exrevision,%.globalnames,%.hidden,%.image_old,%.job,%.linkscc,%.localnames,%.log_search,%.logging_old,%.long_run_profiling,%.migrateuser_medium,%.moodbar_feedback,%.moodbar_feedback_response,%.msg_resource,%.oathauth_users,%.oauth_accepted_consumer,%.oauth_ratelimit_client_tier,%.oauth_registered_consumer,%.oauth2_access_tokens,%.objectcache,%.old_growth,%.oldimage_old,%.optin_survey,%.prefstats,%.prefswitch_survey,%.profiling,%.querycache,%.querycache_info,%.querycache_old,%.querycachetwo,%.reading_list,%.reading_list_entry,%.securepoll_cookie_match,%.securepoll_elections,%.securepoll_entity,%.securepoll_lists,%.securepoll_msgs,%.securepoll_options,%.securepoll_properties,%.securepoll_questions,%.securepoll_strike,%.securepoll_voters,%.securepoll_votes,%.spoofuser,%.text,%.titlekey,%.transcache,%.uploadstash,%.urlshortcodes,%.user_newtalk,%.vote_log,%.watchlist,%.watchlist_expiry,%.wikimedia_editor_tasks_counts,%.wikimedia_editor_tasks_keys,%.wikimedia_editor_tasks_targets_passed
21 Last_Errno: 0
22 Last_Error:
23 Skip_Counter: 0
24 Exec_Master_Log_Pos: 604784159
25 Relay_Log_Space: 604784791
26 Until_Condition: None
27 Until_Log_File:
28 Until_Log_Pos: 0
29 Master_SSL_Allowed: Yes
30 Master_SSL_CA_File:
31 Master_SSL_CA_Path:
32 Master_SSL_Cert:
33 Master_SSL_Cipher:
34 Master_SSL_Key:
35 Seconds_Behind_Master: NULL
36Master_SSL_Verify_Server_Cert: No
37 Last_IO_Errno: 2003
38 Last_IO_Error: error reconnecting to master 'repl@db1079.eqiad.wmnet:3306' - retry-time: 60 maximum-retries: 86400 message: Can't connect to MySQL server on 'db1079.eqiad.wmnet' (111 "Connection refused")
39 Last_SQL_Errno: 0
40 Last_SQL_Error:
41 Replicate_Ignore_Server_Ids:
42 Master_Server_Id: 171966555
43 Master_SSL_Crl:
44 Master_SSL_Crlpath:
45 Using_Gtid: Slave_Pos
46 Gtid_IO_Pos: 171966555-171966555-1411,0-180359185-3359637071,180367395-180367395-284439564,171978767-171978767-4484858466,180355111-180355111-131673159,171970664-171970664-2046603966,180359185-180359185-71998080,171970590-171970590-196280066
47 Replicate_Do_Domain_Ids:
48 Replicate_Ignore_Domain_Ids:
49 Parallel_Mode: conservative

I have finished the sanitization of db1155:3317, and triple checked that centralauth is clean.

Replication started on db1155:3317. Also InnoDB compression has been started

Compressing the recently created databases on db1154:3315 and on clouddb1016:3315 and cloudb1020 too

No errors on db1155:3317, compression still on-going

Marostegui updated the task description. (Show Details)Wed, Jan 13, 6:25 AM
Marostegui updated the task description. (Show Details)Thu, Jan 14, 12:28 PM

Change 656145 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] check_private_data_report: Run the checks on db1155's sections

https://gerrit.wikimedia.org/r/656145

Change 656145 merged by Marostegui:
[operations/puppet@production] check_private_data_report: Run the checks on db1155's sections

https://gerrit.wikimedia.org/r/656145

Marostegui closed this task as Resolved.Mon, Jan 18, 6:08 AM

db1155 ran its automatic data check with no issues. So everything is completed.
Next step is to move all the new clouddb hosts under these new sanitariums T272008