Page MenuHomePhabricator

ToolsDB replication is broken
Closed, ResolvedPublic

Description

On Oct 27 14:09:54 UTC, replication stopped for a data drift error:

Could not execute Update_rows_v1 event on table s54518__mw.online; Can't find record in 'online', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log log.221984, end_log_pos 27719835

I filtered the table and restarted replication, but it simply failed on another table. After investigation, I found that the replica was set to read-write mode, which could explain the drift, though we don't advertise the existence of that server. Someone could have set their system to write to it for some reason, but this would be surprising.

Overall, this has blocked failover (T263679) until we can fix or rebuild this server.

Coordinating etherpad here: https://etherpad.wikimedia.org/p/toolsdb-2020-replica-rebuild

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I see these warnings, poking around more:

Oct 22 13:28:17 clouddb1002 mysqld[2726]: 2020-10-22 13:28:17 140615019570944 [Warning] InnoDB: Difficult to find free blocks in the buffer pool (21 search iterations)! 0 failed attempts to flush a page!
Oct 22 13:28:17 clouddb1002 mysqld[2726]: 2020-10-22 13:28:17 140615019570944 [Note] InnoDB: Consider increasing the buffer pool size.
Oct 22 13:28:17 clouddb1002 mysqld[2726]: 2020-10-22 13:28:17 140615019570944 [Note] InnoDB: Pending flushes (fsync) log: 0 buffer pool: 0 OS file reads: 1665814 OS file writes: 20110376 OS fsyncs: 14838967

There might be some tuning needed. I think when I built this cluster, I might have just started a fresh replica from scratch because re-reading the materials from that time, it took a long time for things to catch up. I wish this wasn't so large that it is tied to hardware, or I could just spin up another replica.

@Andrew Question: we have enough storage that I could build a new database replica on a ceph VM right? I'd wanted to wait until we had cinder volumes to migrate these to allow detachability, but that would be one way to get this "unlocked" from the hardware. They'd need far less of a throttle for storage, of course.

@Andrew Question: we have enough storage that I could build a new database replica on a ceph VM right? I'd wanted to wait until we had cinder volumes to migrate these to allow detachability, but that would be one way to get this "unlocked" from the hardware. They'd need far less of a throttle for storage, of course.

Yes, there's room. If we moved all four to Ceph it would take a pretty big bite -- maybe 10% of total capacity (and 20% of currently remaining capacity)

Mentioned in SAL (#wikimedia-cloud) [2020-10-29T23:49:56Z] <bstorm> launching clouddb-toolsdb-03 (trying to make a better naming convention) to replace failed replica T266587

Change 637726 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: Make socket a parameter so new servers might work right

https://gerrit.wikimedia.org/r/637726

Change 637726 merged by Bstorm:
[operations/puppet@production] toolsdb: Make socket a parameter so new servers might work right

https://gerrit.wikimedia.org/r/637726

Change 637751 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: Fix the my.cnf template to include parameter for socket

https://gerrit.wikimedia.org/r/637751

Change 637751 merged by Bstorm:
[operations/puppet@production] toolsdb: Fix the my.cnf template to include parameter for socket

https://gerrit.wikimedia.org/r/637751

Change 637763 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: actually use the read_only parameter from the profiles

https://gerrit.wikimedia.org/r/637763

Change 637763 merged by Bstorm:
[operations/puppet@production] toolsdb: actually use the read_only parameter from the profiles

https://gerrit.wikimedia.org/r/637763

So trying to get a dump in order to rebuild replication is infuriatingly tricky because so much changes on ToolsDB.
Today, I got this far (48GB compressed dump):

root@clouddb1001:/srv/labsdb/backup1# mysqldump --skip-lock-tables --single-transaction --flush-logs --hex-blob --master-data=2 -A | gzip  > dump.sql.gz
mysqldump: Error 1412: Table definition has changed, please retry transaction when dumping table `ch_fixed_links` at row: 0

I think I'm not going to be able to get around scheduling a read lock on the tables to take a solid backup to start replication from. With an announcement, we could possibly discourage changing table definitions for a day or so, and then my command might work. It takes so long, that I'm not sure that's worth it vs. the brute force method of locking the whole setup for a bit. I can give it another shot tomorrow.

I did not get a chance to give it another shot today. I suspect I really need to schedule a downtime, but I could always start another run just in case I get lucky one more time.

It did not work. Scheduling downtime.

@Marostegui Just wanted a second opinion based on my comments above. I believe that all means I need to take a backup with a read lock in place. Correct?

@Marostegui Just wanted a second opinion based on my comments above. I believe that all means I need to take a backup with a read lock in place. Correct?

You could also set the host with read_only set global read_only=ON if there is nothing writing as root.

Not sure if you'll be using mysqldump or mydumper. If mysqldump, make sure tu use --master-data=2 so you get the binlog coordinates to start replication from on the .sql file

Although if you are planning to set the host as read only, if you do a show master status, the position shouldn't be changing, do you can grab it anytime before setting the host writable again.

Just out of curiosity I checked for MyISAM tables:

1MariaDB [(none)]> SELECT TABLE_SCHEMA,TABLE_NAME,ENGINE FROM INFORMATION_SCHEMA.TABLES WHERE ENGINE = 'MyISAM';
2+--------------------------------+---------------------------------------+--------+
3| TABLE_SCHEMA | TABLE_NAME | ENGINE |
4+--------------------------------+---------------------------------------+--------+
5| heartbeat | heartbeat | MyISAM |
6| mysql | column_stats | MyISAM |
7| mysql | columns_priv | MyISAM |
8| mysql | db | MyISAM |
9| mysql | event | MyISAM |
10| mysql | func | MyISAM |
11| mysql | help_category | MyISAM |
12| mysql | help_keyword | MyISAM |
13| mysql | help_relation | MyISAM |
14| mysql | help_topic | MyISAM |
15| mysql | host | MyISAM |
16| mysql | index_stats | MyISAM |
17| mysql | ndb_binlog_index | MyISAM |
18| mysql | plugin | MyISAM |
19| mysql | proc | MyISAM |
20| mysql | procs_priv | MyISAM |
21| mysql | proxies_priv | MyISAM |
22| mysql | roles_mapping | MyISAM |
23| mysql | servers | MyISAM |
24| mysql | table_stats | MyISAM |
25| mysql | tables_priv | MyISAM |
26| mysql | time_zone | MyISAM |
27| mysql | time_zone_leap_second | MyISAM |
28| mysql | time_zone_name | MyISAM |
29| mysql | time_zone_transition | MyISAM |
30| mysql | time_zone_transition_type | MyISAM |
31| mysql | user | MyISAM |
32| s51083__ukbot | articles | MyISAM |
33| s51083__ukbot | contests | MyISAM |
34| s51083__ukbot | contribs | MyISAM |
35| s51083__ukbot | fulltexts | MyISAM |
36| s51083__ukbot | notifications | MyISAM |
37| s51083__ukbot | prizes | MyISAM |
38| s51083__ukbot | schemachanges | MyISAM |
39| s51083__ukbot | users | MyISAM |
40| s51109__cb | fulltexts | MyISAM |
41| s51109__cb | trr | MyISAM |
42| s51138__heritage_p | commonscat | MyISAM |
43| s51138__heritage_p | id_dump | MyISAM |
44| s51138__heritage_p | image | MyISAM |
45| s51138__heritage_p | monuments-old_ch_(de) | MyISAM |
46| s51138__heritage_p | monuments-old_ch_(fr) | MyISAM |
47| s51138__heritage_p | monuments-old_ch_(it) | MyISAM |
48| s51138__heritage_p | monuments_by_(be-x-old) | MyISAM |
49| s51138__heritage_p | monuments_de-nrw_(de) | MyISAM |
50| s51138__heritage_p | monuments_hk-hb_(en) | MyISAM |
51| s51138__heritage_p | monuments_pl-old_(pl) | MyISAM |
52| s51138__heritage_p | monuments_ru-old_(ru) | MyISAM |
53| s51138__heritage_p | monuments_se-arbetsliv_(sv) | MyISAM |
54| s51138__heritage_p | monuments_se-fornminne_(sv) | MyISAM |
55| s51138__heritage_p | prox_search | MyISAM |
56| s51138__heritage_p | wlpa_at_(de) | MyISAM |
57| s51138__heritage_p | wlpa_es-ct_(ca) | MyISAM |
58| s51138__heritage_p | wlpa_fi_(fi) | MyISAM |
59| s51138__heritage_p | wlpa_il_(en) | MyISAM |
60| s51139__mwtest | searchindex | MyISAM |
61| s51149__wikitest-rtl | searchindex | MyISAM |
62| s51185__zhwp | zhwp_dykarticle | MyISAM |
63| s51185__zhwp | zhwp_updatequeue | MyISAM |
64| s51201_wiki | searchindex | MyISAM |
65| s51204__whichisbetterwiki | foosearchindex | MyISAM |
66| s51211__duplicity_p | bad_cats | MyISAM |
67| s51211__duplicity_p | no_wd | MyISAM |
68| s51211__feed_p | feed | MyISAM |
69| s51211__feed_p | item_cache | MyISAM |
70| s51211__ib2_p | searchindex | MyISAM |
71| s51211__sourcerer_p | suggestions | MyISAM |
72| s51219__db | wiki | MyISAM |
73| s51258__bookmanager | searchindex | MyISAM |
74| s51280__catmonitor | articles | MyISAM |
75| s51280__catmonitor | stats | MyISAM |
76| s51280__cmon_nn | articles | MyISAM |
77| s51280__cmon_nn | stats | MyISAM |
78| s51280__cmon_no | articles | MyISAM |
79| s51280__cmon_no | stats | MyISAM |
80| s51290__dpl_p | dab_hof | MyISAM |
81| s51296__huwiki_p | ores_dump | MyISAM |
82| s51296__huwiki_p | revision | MyISAM |
83| s51315__DBName | searchindex | MyISAM |
84| s51315__wiki | searchindex | MyISAM |
85| s51375_wiki | searchindex | MyISAM |
86| s51391__mp | comentors | MyISAM |
87| s51391__mp | logging | MyISAM |
88| s51391__mp | mentee | MyISAM |
89| s51391__mp | mentee_articles | MyISAM |
90| s51391__mp | mentee_mentor | MyISAM |
91| s51391__mp | mentor | MyISAM |
92| s51397__owintes | delete---searchindex | MyISAM |
93| s51397__owintes | mw_searchindex | MyISAM |
94| s51397__owintes | ow2_searchindex | MyISAM |
95| s51397__owintes | ow3_searchindex | MyISAM |
96| s51397__owintes | tst001_searchindex | MyISAM |
97| s51397__owintes | xx02_searchindex | MyISAM |
98| s51397__owintes | xx02_translate_tmf | MyISAM |
99| s51397__owintes | zuzuzu_searchindex | MyISAM |
100| s51401__pop_data | config_history | MyISAM |
101| s51401__pop_data | pop_Jun16 | MyISAM |
102| s51415__stations_p | links | MyISAM |
103| s51415__stations_p | stations | MyISAM |
104| s51419__data | ipp_changes | MyISAM |
105| s51419__data | ipp_changes_old | MyISAM |
106| s51419__data | ipp_vandalen | MyISAM |
107| s51419__data | ipp_vandalen_old | MyISAM |
108| s51419__data | npp_nb_data | MyISAM |
109| s51419__data | npp_nb_words | MyISAM |
110| s51419__data | npp_new_pages | MyISAM |
111| s51419__data | npp_old_pages | MyISAM |
112| s51419__data | npp_success | MyISAM |
113| s51422__geophotoreq | photocoords | MyISAM |
114| s51434__mixnmatch2015_p | auxiliary | MyISAM |
115| s51434__mixnmatch2015_p | catalog | MyISAM |
116| s51434__mixnmatch2015_p | entry | MyISAM |
117| s51434__mixnmatch2015_p | log | MyISAM |
118| s51434__mixnmatch2015_p | user | MyISAM |
119| s51434__mixnmatch2015_p | user_edits | MyISAM |
120| s51434__mixnmatch2015_p | vw_common_names | MyISAM |
121| s51511__lestatywiki | searchindex | MyISAM |
122| s51511__wiki | searchindex | MyISAM |
123| s51522__jeltebot | ircbot_wikipages | MyISAM |
124| s51522__jeltebot | ircbot_wikiusers | MyISAM |
125| s51835_usagestats | FriendlySessions | MyISAM |
126| s51884_wikibot | pmid_id | MyISAM |
127| s51892_toolserverdb | language | MyISAM |
128| s51892_toolserverdb | namespacename | MyISAM |
129| s51892_toolserverdb | servermapping | MyISAM |
130| s51892_toolserverdb | wiki | MyISAM |
131| s51892_toolserverdb_p | language | MyISAM |
132| s51892_toolserverdb_p | namespace | MyISAM |
133| s51892_toolserverdb_p | namespace_test | MyISAM |
134| s51892_toolserverdb_p | namespacename | MyISAM |
135| s51892_toolserverdb_p | servermapping | MyISAM |
136| s51892_toolserverdb_p | wiki | MyISAM |
137| s51932__mw | searchindex | MyISAM |
138| s51964_denkmalliste | DenkmalGemeinde | MyISAM |
139| s51964_denkmalliste | monuments_at_(de) | MyISAM |
140| s51964_denkmalliste | wle_at_wp | MyISAM |
141| s51964_denkmalliste | wle_at_wp_objects | MyISAM |
142| s51999__merge_candidates | bad_images | MyISAM |
143| s51999__merge_candidates | daily | MyISAM |
144| s51999__merge_candidates | genderless_people | MyISAM |
145| s51999__merge_candidates | item_pairs | MyISAM |
146| s51999__merge_candidates | people_no_date | MyISAM |
147| s51999__merge_candidates | people_no_dob | MyISAM |
148| s51999__merge_candidates | people_no_dod | MyISAM |
149| s51999__merge_candidates | potential_alma_mater | MyISAM |
150| s51999__merge_candidates | potential_commonscat | MyISAM |
151| s51999__merge_candidates | potential_disambig | MyISAM |
152| s51999__merge_candidates | potential_genre | MyISAM |
153| s51999__merge_candidates | potential_nationality | MyISAM |
154| s51999__merge_candidates | potential_new_pages | MyISAM |
155| s51999__merge_candidates | potential_occupation | MyISAM |
156| s51999__merge_candidates | potential_people | MyISAM |
157| s51999__merge_candidates | scores | MyISAM |
158| s51999__merge_candidates | users | MyISAM |
159| s52078__ESPC | 2015_fotos | MyISAM |
160| s52078__ESPC | 2015_fotos_commons | MyISAM |
161| s52078__ESPC | 2015_jury | MyISAM |
162| s52078__ESPC | 2015_points | MyISAM |
163| s52078__ESPC | 2015_v_jury | MyISAM |
164| s52078__ESPC | 2015_v_votes | MyISAM |
165| s52078__ESPC | 2015_votes | MyISAM |
166| s52078__wlm | mon_pics | MyISAM |
167| s52078__wlm | mon_pics_commons | MyISAM |
168| s52131__ltcommunity | feed_matches | MyISAM |
169| s52164__worldadmin98_p | worldadmin98 | MyISAM |
170| s52229__wlmuk_p | coords | MyISAM |
171| s52229__wlmuk_p | existing_media | MyISAM |
172| s52229__wlmuk_p | search_entry | MyISAM |
173| s52229__wlmuk_p | search_region | MyISAM |
174| s52247__align | alig_equiv | MyISAM |
175| s52247__align | alig_pages | MyISAM |
176| s52247__align | alig_values | MyISAM |
177| s52252__d1 | searchindex | MyISAM |
178| s52258__mw | searchindex | MyISAM |
179| s52276__analyze | parsec_questions | MyISAM |
180| s52281__directory | dir_ft | MyISAM |
181| s52302__mw | searchindex | MyISAM |
182| s52322__wlm | ratings | MyISAM |
183| s52322__wlm | sessions | MyISAM |
184| s52322__wlm | settings | MyISAM |
185| s52322__wlm | users | MyISAM |
186| s52355__wikisoba_p | answer | MyISAM |
187| s52355__wikisoba_p | unit | MyISAM |
188| s52355__wikisoba_p | user | MyISAM |
189| s52380__metamine_p | catalogue | MyISAM |
190| s52380__metamine_p | catalogue_meta | MyISAM |
191| s52380__metamine_p | fact | MyISAM |
192| s52380__metamine_p | fact_meta | MyISAM |
193| s52542__commonsarchivewiki | searchindex | MyISAM |
194| s52567__pywikibot_testwiki | searchindex | MyISAM |
195| s52585__cb | trr | MyISAM |
196| s52663__freefiles | searchindex | MyISAM |
197| s52707__wlm_jury_at | wlm_2015_fotos | MyISAM |
198| s52707__wlm_jury_at | wlm_2015_jury | MyISAM |
199| s52707__wlm_jury_at | wlm_2015_points | MyISAM |
200| s52707__wlm_jury_at | wlm_2015_v_jury | MyISAM |
201| s52707__wlm_jury_at | wlm_2015_v_votes | MyISAM |
202| s52707__wlm_jury_at | wlm_2015_votes | MyISAM |
203| s52722__wikidb | searchindex | MyISAM |
204| s52779__mediawikidev | searchindex | MyISAM |
205| s52780__etherpad | store | MyISAM |
206| s52861__bwAPI_p | Bilderwunsch_Geo | MyISAM |
207| s52861__fnbot | edit_log | MyISAM |
208| s52861__fnbot | review_filter | MyISAM |
209| s52861__fnbot | review_opt_in | MyISAM |
210| s52861__request | cvu | MyISAM |
211| s52861__request | fnbot_edit_log | MyISAM |
212| s52861__request | log_filter | MyISAM |
213| s52861__request | logfilter_user | MyISAM |
214| s52861__request | rev_query | MyISAM |
215| s52861__request | sumdisc_log | MyISAM |
216| s52861__request | thanks | MyISAM |
217| s52861__request | thanks_dump | MyISAM |
218| s52861__request | thanks_offset | MyISAM |
219| s52861__request | thanks_user | MyISAM |
220| s52861__secWatch | log | MyISAM |
221| s52861__secWatch | optin_list | MyISAM |
222| s52861__secWatch | rev_log | MyISAM |
223| s52861__thanks_de | thx_list | MyISAM |
224| s52973__alvarowiki | alvaro_searchindex | MyISAM |
225| s52973__alvarowiki_p | searchindex | MyISAM |
226| s53093__wiki | mw_searchindex | MyISAM |
227| s53115__cb | trr | MyISAM |
228| s53152__dow | tmp_pages | MyISAM |
229| s53152__dow | tmp_redirects | MyISAM |
230| s53209__suggester-test | searchindex | MyISAM |
231| s53218__testwiki | searchindex | MyISAM |
232| s53306__test | lime_answers | MyISAM |
233| s53306__test | lime_assessments | MyISAM |
234| s53306__test | lime_boxes | MyISAM |
235| s53306__test | lime_conditions | MyISAM |
236| s53306__test | lime_defaultvalues | MyISAM |
237| s53306__test | lime_expression_errors | MyISAM |
238| s53306__test | lime_failed_login_attempts | MyISAM |
239| s53306__test | lime_groups | MyISAM |
240| s53306__test | lime_labels | MyISAM |
241| s53306__test | lime_labelsets | MyISAM |
242| s53306__test | lime_map_tutorial_users | MyISAM |
243| s53306__test | lime_notifications | MyISAM |
244| s53306__test | lime_participant_attribute | MyISAM |
245| s53306__test | lime_participant_attribute_names | MyISAM |
246| s53306__test | lime_participant_attribute_names_lang | MyISAM |
247| s53306__test | lime_participant_attribute_values | MyISAM |
248| s53306__test | lime_participant_shares | MyISAM |
249| s53306__test | lime_participants | MyISAM |
250| s53306__test | lime_permissions | MyISAM |
251| s53306__test | lime_plugin_settings | MyISAM |
252| s53306__test | lime_plugins | MyISAM |
253| s53306__test | lime_question_attributes | MyISAM |
254| s53306__test | lime_questions | MyISAM |
255| s53306__test | lime_quota | MyISAM |
256| s53306__test | lime_quota_languagesettings | MyISAM |
257| s53306__test | lime_quota_members | MyISAM |
258| s53306__test | lime_saved_control | MyISAM |
259| s53306__test | lime_sessions | MyISAM |
260| s53306__test | lime_settings_global | MyISAM |
261| s53306__test | lime_settings_user | MyISAM |
262| s53306__test | lime_survey_213756 | MyISAM |
263| s53306__test | lime_survey_213756_timings | MyISAM |
264| s53306__test | lime_survey_links | MyISAM |
265| s53306__test | lime_survey_url_parameters | MyISAM |
266| s53306__test | lime_surveymenu | MyISAM |
267| s53306__test | lime_surveymenu_entries | MyISAM |
268| s53306__test | lime_surveys | MyISAM |
269| s53306__test | lime_surveys_groups | MyISAM |
270| s53306__test | lime_surveys_languagesettings | MyISAM |
271| s53306__test | lime_template_configuration | MyISAM |
272| s53306__test | lime_templates | MyISAM |
273| s53306__test | lime_tutorial_entries | MyISAM |
274| s53306__test | lime_tutorial_entry_relation | MyISAM |
275| s53306__test | lime_tutorials | MyISAM |
276| s53306__test | lime_user_groups | MyISAM |
277| s53306__test | lime_user_in_groups | MyISAM |
278| s53306__test | lime_users | MyISAM |
279| s53314__testwiki | searchindex | MyISAM |
280| s53343__comprende_p | searchindex | MyISAM |
281| s53378__mediawiki | searchindex | MyISAM |
282| s53507__zh_delwiki | searchindex | MyISAM |
283| s53613__ace2018 | ace2018 | MyISAM |
284| s53666__testwiki | searchindex | MyISAM |
285| s53666__testwiki2 | searchindex | MyISAM |
286| s53674__wiki | mw_searchindex | MyISAM |
287| s53700__ddescriptions | demoniemen | MyISAM |
288| s53700__ddescriptions | log | MyISAM |
289| s53936__wiki | searchindex | MyISAM |
290| s53949__wikidb | searchindex | MyISAM |
291| s53951__wmdereference-previews | searchindex | MyISAM |
292| s54007__wiki | searchindex | MyISAM |
293| s54101__flexion_p | adjektive | MyISAM |
294| s54101__flexion_p | nomen | MyISAM |
295| s54101__flexion_p | verben | MyISAM |
296| s54193__data | w_searchindex | MyISAM |
297| s54224__wudele | fd_framadate_migration | MyISAM |
298| s54226__sigmabot | mailbox | MyISAM |
299| s54226__sigmabot | quotes | MyISAM |
300| s54235__commonswiki | searchindex | MyISAM |
301| s54235__commonswiki | translate_tmf | MyISAM |
302| s54235__fansvillewiki | searchindex | MyISAM |
303| s54235__fansvillewiki | translate_tmf | MyISAM |
304| s54235__hamishwiki | searchindex | MyISAM |
305| s54235__logins1wiki | searchindex | MyISAM |
306| s54235__logins1wiki | translate_tmf | MyISAM |
307| s54235__test2wiki | searchindex | MyISAM |
308| s54235__test2wiki | translate_tmf | MyISAM |
309| s54235__test35wiki | searchindex | MyISAM |
310| s54235__test35wiki | translate_tmf | MyISAM |
311| s54235__testwiki | searchindex | MyISAM |
312| s54235__testwiki | translate_tmf | MyISAM |
313| s54242__nokib | Taxonomy | MyISAM |
314| s54242__nokib | Users | MyISAM |
315| s54378__wiki | searchindex | MyISAM |
316| s54389__tutorial | wp9c_loginizer_logs | MyISAM |
317| s54435__tutorial | wp9c_loginizer_logs | MyISAM |
318| s54490__dase_lab | searchindex | MyISAM |
319| s54507__daselabdb | smwsearchindex | MyISAM |
320| s54518__mw | h_attachments | MyISAM |
321| s54518__mw | h_banned_emails | MyISAM |
322| s54518__mw | h_banned_ips | MyISAM |
323| s54518__mw | h_categories | MyISAM |
324| s54518__mw | h_custom_fields | MyISAM |
325| s54518__mw | h_custom_statuses | MyISAM |
326| s54518__mw | h_kb_articles | MyISAM |
327| s54518__mw | h_kb_attachments | MyISAM |
328| s54518__mw | h_kb_categories | MyISAM |
329| s54518__mw | h_logins | MyISAM |
330| s54518__mw | h_mail | MyISAM |
331| s54518__mw | h_notes | MyISAM |
332| s54518__mw | h_online | MyISAM |
333| s54518__mw | h_pipe_loops | MyISAM |
334| s54518__mw | h_replies | MyISAM |
335| s54518__mw | h_reply_drafts | MyISAM |
336| s54518__mw | h_reset_password | MyISAM |
337| s54518__mw | h_service_messages | MyISAM |
338| s54518__mw | h_std_replies | MyISAM |
339| s54518__mw | h_ticket_templates | MyISAM |
340| s54518__mw | h_tickets | MyISAM |
341| s54518__mw | h_users | MyISAM |
342| s54518__mw | searchindex | MyISAM |
343| s54524__mediawiki | searchindex | MyISAM |
344| s54548__certify | Certificate | MyISAM |
345| s54548__certify | Queue | MyISAM |
346| s54548__certify | Response | MyISAM |
347| s54548__certify | Users | MyISAM |
348| s54548__certify | Workshop | MyISAM |
349| s54560__cptdb | pagetriagesearchindex | MyISAM |
350| s54566__wiki | mw_searchindex | MyISAM |
351| u10076__wikidata | P569 | MyISAM |
352| u10076__wikidata | P570 | MyISAM |
353| u2138__wiki | searchindex | MyISAM |
354| u2264_cx_db | cxsearchindex | MyISAM |
355| u2815__p | coord_afwiki | MyISAM |
356| u2815__p | coord_alswiki | MyISAM |
357| u2815__p | coord_arwiki | MyISAM |
358| u2815__p | coord_astwiki | MyISAM |
359| u2815__p | coord_azwiki | MyISAM |
360| u2815__p | coord_bewiki | MyISAM |
361| u2815__p | coord_bgwiki | MyISAM |
362| u2815__p | coord_bpywiki | MyISAM |
363| u2815__p | coord_cawiki | MyISAM |
364| u2815__p | coord_cebwiki | MyISAM |
365| u2815__p | coord_commonswiki | MyISAM |
366| u2815__p | coord_cswiki | MyISAM |
367| u2815__p | coord_dawiki | MyISAM |
368| u2815__p | coord_dewiki | MyISAM |
369| u2815__p | coord_elwiki | MyISAM |
370| u2815__p | coord_enwiki | MyISAM |
371| u2815__p | coord_eowiki | MyISAM |
372| u2815__p | coord_eswiki | MyISAM |
373| u2815__p | coord_etwiki | MyISAM |
374| u2815__p | coord_euwiki | MyISAM |
375| u2815__p | coord_fawiki | MyISAM |
376| u2815__p | coord_fiwiki | MyISAM |
377| u2815__p | coord_frwiki | MyISAM |
378| u2815__p | coord_fywiki | MyISAM |
379| u2815__p | coord_gawiki | MyISAM |
380| u2815__p | coord_glwiki | MyISAM |
381| u2815__p | coord_hewiki | MyISAM |
382| u2815__p | coord_hiwiki | MyISAM |
383| u2815__p | coord_hrwiki | MyISAM |
384| u2815__p | coord_htwiki | MyISAM |
385| u2815__p | coord_huwiki | MyISAM |
386| u2815__p | coord_hywiki | MyISAM |
387| u2815__p | coord_idwiki | MyISAM |
388| u2815__p | coord_ilowiki | MyISAM |
389| u2815__p | coord_itwiki | MyISAM |
390| u2815__p | coord_jawiki | MyISAM |
391| u2815__p | coord_kawiki | MyISAM |
392| u2815__p | coord_kkwiki | MyISAM |
393| u2815__p | coord_kowiki | MyISAM |
394| u2815__p | coord_kuwiki | MyISAM |
395| u2815__p | coord_lawiki | MyISAM |
396| u2815__p | coord_lbwiki | MyISAM |
397| u2815__p | coord_ltwiki | MyISAM |
398| u2815__p | coord_lvwiki | MyISAM |
399| u2815__p | coord_mkwiki | MyISAM |
400| u2815__p | coord_mlwiki | MyISAM |
401| u2815__p | coord_mswiki | MyISAM |
402| u2815__p | coord_ndswiki | MyISAM |
403| u2815__p | coord_newwiki | MyISAM |
404| u2815__p | coord_nlwiki | MyISAM |
405| u2815__p | coord_nnwiki | MyISAM |
406| u2815__p | coord_nowiki | MyISAM |
407| u2815__p | coord_nvwiki | MyISAM |
408| u2815__p | coord_ocwiki | MyISAM |
409| u2815__p | coord_oswiki | MyISAM |
410| u2815__p | coord_pamwiki | MyISAM |
411| u2815__p | coord_plwiki | MyISAM |
412| u2815__p | coord_pmswiki | MyISAM |
413| u2815__p | coord_ptwiki | MyISAM |
414| u2815__p | coord_rowiki | MyISAM |
415| u2815__p | coord_ruwiki | MyISAM |
416| u2815__p | coord_shwiki | MyISAM |
417| u2815__p | coord_simplewiki | MyISAM |
418| u2815__p | coord_skwiki | MyISAM |
419| u2815__p | coord_slwiki | MyISAM |
420| u2815__p | coord_srwiki | MyISAM |
421| u2815__p | coord_svwiki | MyISAM |
422| u2815__p | coord_swwiki | MyISAM |
423| u2815__p | coord_tawiki | MyISAM |
424| u2815__p | coord_tewiki | MyISAM |
425| u2815__p | coord_thwiki | MyISAM |
426| u2815__p | coord_tlwiki | MyISAM |
427| u2815__p | coord_trwiki | MyISAM |
428| u2815__p | coord_ukwiki | MyISAM |
429| u2815__p | coord_uzwiki | MyISAM |
430| u2815__p | coord_vecwiki | MyISAM |
431| u2815__p | coord_viwiki | MyISAM |
432| u2815__p | coord_vowiki | MyISAM |
433| u2815__p | coord_warwiki | MyISAM |
434| u2815__p | coord_zhwiki | MyISAM |
435| u2815__p | principle_links | MyISAM |
436| u2815__p | projectbanner | MyISAM |
437| u2815__p | redlinks_enwiki_p | MyISAM |
438| u2815__p | region_cache | MyISAM |
439| u2815__p | task | MyISAM |
440| u2815__p | task_page | MyISAM |
441| u2815__p | worldadmin98 | MyISAM |
442+--------------------------------+---------------------------------------+--------+
443437 rows in set (0.28 sec)

I would contact the tools' owner and check if there is an specific reason for them to be MyISAM, and if not, encourage them to convert them to InnoDB.

I would contact the tools' owner and check if there is an specific reason for them to be MyISAM, and if not, encourage them to convert them to InnoDB.

s54524__mediawiki.searchindex (notwikilambda tool) is MyISAM because it’s a MediaWiki table that needs fulltext indexes. (Compare T107875.) The same is probably true for a handful of other tables in that list.

Mentioned in SAL (#wikimedia-cloud) [2020-11-10T16:24:20Z] <bstorm> set toolsdb to read-only T266587

Mentioned in SAL (#wikimedia-cloud) [2020-11-10T16:37:58Z] <arturo> icinga downtime toolschecker for 2h becasue toolsdb maintenance (T266587)

Mentioned in SAL (#wikimedia-cloud) [2020-11-10T16:41:21Z] <arturo> set paws in sqlite mode because T266587 (kubectl --namespace prod edit configmap hub-config)

Change 640483 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: set up quickcopy to move the dump

https://gerrit.wikimedia.org/r/640483

Change 640483 merged by Bstorm:
[operations/puppet@production] toolsdb: set up quickcopy to move the dump

https://gerrit.wikimedia.org/r/640483

Dump is still progressing. It's taking a very long time, but it got much further in read-only mode than it did in full operation already. (now at 60GB compressed)

Mentioned in SAL (#wikimedia-cloud) [2020-11-11T06:06:42Z] <bstorm> dump completed, transferring to replica to start things up again T266587

I would contact the tools' owner and check if there is an specific reason for them to be MyISAM, and if not, encourage them to convert them to InnoDB.

It seems weird to me that the mysql tables are MyISAM. Is that something we can fix on the fly?

I can spawn tickets to try to get users to switch. I see some of these are ones that have crashed replication before which isn't surprising, and I've been really hoping to remove things that don't replicate rather than the other way around.

Transfer complete. Loading the database on the new replica

Mentioned in SAL (#wikimedia-cloud) [2020-11-11T08:20:38Z] <bstorm> loading dump on the replica T266587

First, thank you for fixing this. Do you have any best/worst guess when the tooldb is up again? There's WLM jurying going using http://montage.toolforge.org which is down and we need to inform the jurors of when the reviewing the photos is expected to continue.

@Zache At this point I am sure it will be done today (US day). I don't have a better estimate yet. The dump took 14 hours. The load will hopefully be quicker. I started it last night, and compression is not happening in the middle of it. I worry the primary will drift too far if I turn off read-only just now.

If the load also takes 14 hours, which I do have hope that it won't, it started at around 830 UTC. That would put the completion at 2230 UTC. I hope it will be faster, as I said.

For future reference, while this imports, the compressed size of a full dump of ToolsDB at this point in time is 110GB. The uncompressed size is more like 1TB.

That means that if we get much growth (like 500GB) without introducing cinder or T212595 (which I believe also depends on cinder), this will become impossible in the future. Good to know.

This is taking much longer than my estimate. As is, I'm able to see that it is still recreating some of the very large databases.

It's still rebuilding. The real disadvantage of this monolithic design is that the few really huge databases in there hold up progress for lots of tiny ones. It'll get there. No errors or anything so far.

In trying to calculate better time frames for how quickly this is likely to work. It seems to have succeeded in a quarter of the data imported in 28 hours. In light of that, we've decided to take the risk and let data continue to flow while the base import continues. As long as it doesn't take too long, it should still really should be able to pick up replication when it is all set.

Fingers crossed that we don't have to start all over again, but since there's a good chance we don't need to keep it readonly until this is done, it's unlocked now. In the event this proves wrong, we will have to start over at a later time.

The import died with ERROR 2006 (HY000) at line 617487: MySQL server has gone away
Trying to find out why now.

This comment was removed by Bstorm.

I've restarted the import while I look for some reason that might have happened.

Mentioned in SAL (#wikimedia-cloud) [2020-11-15T19:36:04Z] <bstorm> set max_allowed_package to 64MB on clouddb-toolsdb-03 T266587

Removed comment on replication since that was actually just old logs. Nothing current about that. It is entirely possible that there are larger-than-default inserts in there that used a larger max_allowed_packet (not package...sorry for the typo in the log) in the session variables. By doubling what they were (32MB), it might succeed.

That increases the chances that we will fall too far behind to start replication. Really hoping not :(

Mentioned in SAL (#wikimedia-cloud) [2020-11-15T19:45:45Z] <bstorm> restarting the import to clouddb-toolsdb-03 with --max-allowed-packet=1G to rule out that as a problem entirely T266587

Since it seems very likely that is why mariadb hung up on me, setting the session variable to the maximum to try to prevent this from happening again.

I would contact the tools' owner and check if there is an specific reason for them to be MyISAM, and if not, encourage them to convert them to InnoDB.

It seems weird to me that the mysql tables are MyISAM. Is that something we can fix on the fly?

That's the default until 10.4.
From 10.4 they use Aria, which is supposedly crash-safe.

I can spawn tickets to try to get users to switch. I see some of these are ones that have crashed replication before which isn't surprising, and I've been really hoping to remove things that don't replicate rather than the other way around.

The database load died again. This time it was:
ERROR 1030 (HY000) at line 663214: Got error 175 "File too short; Expected more data in file" from storage engine Aria
@Marostegui any ideas? Is my dump bad somehow? I'll google around a bit in case there's something I can do. I'd rather not spend another 14 hours in read-only mode. I also wonder if I can get away with throwing this in read-only and copying the data directory. I have a solid rsync setup.

At this point, I wouldn't try the mysqldump (or mydumper again). What you suggest is what I would do.

  • Set the host read only.
  • Once you are sure it is read only run: show master status\G (run it a few times to make sure it doesn't change, that would mean no writes are happening). Write down those coordinates.
  • Stop MySQL.
  • Copy the directory to the new replica.
  • Once it is done, on the new replica do: systemctl set-environment MYSQLD_OPTS="--skip-slave-start" and start mysql (make sure the directory privileges are good, or otherwise mysql will not start).
  • Run on the new replica reset replication: mysql -e "stop slave; reset slave all" and also do a mysql_upgrade just in case.
  • Configure replication on the new replica: change master to master_host='master.fqdn', master_user='theuseryouuse', master_ssl=1, master_password='itspassword', master_log_file='the log file the above show master status showed', master_log_pos='the position grabbed from show master status'; and then start slave;.
  • Start mysql on the master, and the slave will reconnect and start replicating if it all went fine

Sounds good. I'll set another time for it. Based on the rsync speeds I saw last time with the dump, I am hopeful that this will be a comparatively short read-only period.

@Bstorm keep it in mind that it won't be a read-only time. It will be a "down" time. As mysql needs to be switched off.
Not sure if in your environment you can use transfer.py and try to use xtrabackup for it: https://wikitech.wikimedia.org/wiki/Transfer.py if you can, then you need read-only and you won't need to stop mysql.

Thank you again! I'll poke around and see.

FYI this is on toolsdb secondary server

aborrero@clouddb1002:~$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::wmcs::services::toolsdb::rebuild_primary' (file: /etc/puppet/modules/profile/manifests/wmcs/services/toolsdb_secondary.pp, line: 5) on node clouddb1002.clouddb-services.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Sorry if this is not the right ticket to report this.

FYI this is on toolsdb secondary server

aborrero@clouddb1002:~$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::wmcs::services::toolsdb::rebuild_primary' (file: /etc/puppet/modules/profile/manifests/wmcs/services/toolsdb_secondary.pp, line: 5) on node clouddb1002.clouddb-services.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Sorry if this is not the right ticket to report this.

I know. I broke that. I'll add a value there so it will not keep erroring.

Outage scheduled for 2020-12-16@1700 UTC.

Also:

clouddb1002 is a Cloud user database secondary (role::wmcs::db::toolsdb_secondary)
The last Puppet run was at Tue Nov 10 21:48:10 UTC 2020 (40400 minutes ago).

I seem to have left that broken for way longer than I should have 😅
Puppet is fixed there now.

Change 649893 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: set the rsync script to be the whole data dir

https://gerrit.wikimedia.org/r/649893

Change 649893 merged by Bstorm:
[operations/puppet@production] toolsdb: set the rsync script to be the whole data dir

https://gerrit.wikimedia.org/r/649893

Mentioned in SAL (#wikimedia-cloud) [2020-12-16T17:06:14Z] <bstorm> switching the secondary config back to clouddb1002 in order to minimize concerns about affecting ceph performance T266587

Mentioned in SAL (#wikimedia-cloud) [2020-12-16T17:24:28Z] <bstorm> settings toolsdb to readonly to prepare for shutdown T266587

MariaDB [(none)]> show master status\G
*************************** 1. row ***************************
 File: log.225791
 Position: 66155872
 Binlog_Do_DB:
 Binlog_Ignore_DB:

Mentioned in SAL (#wikimedia-cloud) [2020-12-16T17:29:48Z] <bstorm> stopped mariadb on the replica T266587

Mentioned in SAL (#wikimedia-cloud) [2020-12-16T17:31:49Z] <bstorm> sync started from toolsdb to its replica server T266587

Mentioned in SAL (#wikimedia-cloud) [2020-12-16T18:34:40Z] <bstorm> restarted sync from toolsdb to its replica server after cleanup to prevent disk filling T266587

Mentioned in SAL (#wikimedia-cloud) [2020-12-17T01:10:38Z] <bstorm> the sync is done and we have a good copy of the toolsdb data, proceeding with the upgrades and stuff to that hypervisor while configuring replication to work again T266587

Mentioned in SAL (#wikimedia-cloud) [2020-12-17T02:14:19Z] <bstorm> toolsdb is back and so is the replica T266587

Mentioned in SAL (#wikimedia-cloud) [2020-12-17T02:22:16Z] <bstorm> Set PAWS hub back to using mariadb T266587

Bstorm claimed this task.

It's still replicating this morning. This is done and also is a fine archive of what we have to do if ToolsDB does this again. At least we know it will only take a day or so.