Page MenuHomePhabricator

s3 master emergency failover (db1075)
Closed, ResolvedPublic

Description

Due to issues on A2 PDUs, we have to emergency failover db1075 to db1078 (which is on C3).
This will impact around 800 wikis which will go read-only for a few minutes (reads will not be affected)

1root@cumin1001:/home/marostegui/git/mediawiki-config/dblists# cat s3.dblist
2aawiki
3aawikibooks
4aawiktionary
5abwiki
6abwiktionary
7acewiki
8advisorswiki
9advisorywiki
10adywiki
11afwiki
12afwikibooks
13afwikiquote
14afwiktionary
15akwiki
16akwikibooks
17akwiktionary
18alswiki
19amwiki
20amwikimedia
21amwikiquote
22amwiktionary
23angwiki
24angwikibooks
25angwikiquote
26angwikisource
27angwiktionary
28anwiki
29anwiktionary
30arbcom_cswiki
31arbcom_dewiki
32arbcom_enwiki
33arbcom_fiwiki
34arbcom_nlwiki
35arcwiki
36arwikibooks
37arwikimedia
38arwikinews
39arwikiquote
40arwikisource
41arwikiversity
42arwiktionary
43arzwiki
44astwiki
45astwikibooks
46astwikiquote
47astwiktionary
48aswiki
49aswikibooks
50aswikisource
51aswiktionary
52atjwiki
53auditcomwiki
54avwiki
55avwiktionary
56aywiki
57aywikibooks
58aywiktionary
59azbwiki
60azwiki
61azwikibooks
62azwikiquote
63azwikisource
64azwiktionary
65barwiki
66bat_smgwiki
67bawiki
68bawikibooks
69bclwiki
70bdwikimedia
71be_x_oldwiki
72betawikiversity
73bewiki
74bewikibooks
75bewikimedia
76bewikiquote
77bewikisource
78bewiktionary
79bgwikibooks
80bgwikinews
81bgwikiquote
82bgwikisource
83bhwiki
84bhwiktionary
85biwiki
86biwikibooks
87biwiktionary
88bjnwiki
89bmwiki
90bmwikibooks
91bmwikiquote
92bmwiktionary
93bnwiki
94bnwikibooks
95bnwikisource
96bnwiktionary
97bnwikivoyage
98boardgovcomwiki
99boardwiki
100bowiki
101bowikibooks
102bowiktionary
103bpywiki
104brwiki
105brwikimedia
106brwikiquote
107brwikisource
108brwiktionary
109bswiki
110bswikibooks
111bswikinews
112bswikiquote
113bswikisource
114bswiktionary
115bugwiki
116bxrwiki
117cawikibooks
118cawikimedia
119cawikinews
120cawikiquote
121cawikisource
122cawiktionary
123cbk_zamwiki
124cdowiki
125cewiki
126chairwiki
127chapcomwiki
128checkuserwiki
129chowiki
130chrwiki
131chrwiktionary
132chwiki
133chwikibooks
134chwiktionary
135chywiki
136ckbwiki
137cnwikimedia
138collabwiki
139cowiki
140cowikibooks
141cowikimedia
142cowikiquote
143cowiktionary
144crhwiki
145crwiki
146crwikiquote
147crwiktionary
148csbwiki
149csbwiktionary
150cswikibooks
151cswikinews
152cswikiquote
153cswikisource
154cswikiversity
155cswiktionary
156cuwiki
157cvwiki
158cvwikibooks
159cywiki
160cywikibooks
161cywikiquote
162cywikisource
163cywiktionary
164dawiki
165dawikibooks
166dawikiquote
167dawikisource
168dawiktionary
169dewikibooks
170dewikinews
171dewikiquote
172dewikisource
173dewikiversity
174dewikivoyage
175dewiktionary
176dinwiki
177diqwiki
178dkwikimedia
179donatewiki
180dsbwiki
181dtywiki
182dvwiki
183dvwiktionary
184dzwiki
185dzwiktionary
186ecwikimedia
187eewiki
188electcomwiki
189elwiki
190elwikibooks
191elwikinews
192elwikiquote
193elwikisource
194elwikiversity
195elwikivoyage
196elwiktionary
197emlwiki
198enwikibooks
199enwikinews
200enwikisource
201enwikiversity
202eowikibooks
203eowikinews
204eowikiquote
205eowikisource
206eowiktionary
207eswikibooks
208eswikinews
209eswikiquote
210eswikisource
211eswikiversity
212eswikivoyage
213eswiktionary
214etwiki
215etwikibooks
216etwikimedia
217etwikiquote
218etwikisource
219etwiktionary
220euwiki
221euwikibooks
222euwikiquote
223euwikisource
224euwiktionary
225execwiki
226extwiki
227fawikibooks
228fawikinews
229fawikiquote
230fawikisource
231fawikivoyage
232fawiktionary
233fdcwiki
234ffwiki
235fiu_vrowiki
236fiwikibooks
237fiwikimedia
238fiwikinews
239fiwikiquote
240fiwikisource
241fiwikiversity
242fiwikivoyage
243fiwiktionary
244fixcopyrightwiki
245fjwiki
246fjwiktionary
247foundationwiki
248fowiki
249fowikisource
250fowiktionary
251frpwiki
252frrwiki
253frwikibooks
254frwikinews
255frwikiquote
256frwikisource
257frwikiversity
258frwikivoyage
259furwiki
260fywiki
261fywikibooks
262fywiktionary
263gagwiki
264ganwiki
265gawiki
266gawikibooks
267gawikiquote
268gawiktionary
269gdwiki
270gdwiktionary
271glkwiki
272glwiki
273glwikibooks
274glwikiquote
275glwikisource
276glwiktionary
277gnwiki
278gnwikibooks
279gnwiktionary
280gomwiki
281gorwiki
282gotwiki
283gotwikibooks
284grantswiki
285guwiki
286guwikibooks
287guwikiquote
288guwikisource
289guwiktionary
290gvwiki
291gvwiktionary
292hakwiki
293hawiki
294hawiktionary
295hawwiki
296hewikibooks
297hewikinews
298hewikiquote
299hewikisource
300hewikivoyage
301hewiktionary
302hifwiki
303hifwiktionary
304hiwiki
305hiwikibooks
306hiwikimedia
307hiwikiquote
308hiwikiversity
309hiwikivoyage
310hiwiktionary
311howiki
312hrwiki
313hrwikibooks
314hrwikiquote
315hrwikisource
316hrwiktionary
317hsbwiki
318hsbwiktionary
319htwiki
320htwikisource
321huwikibooks
322huwikinews
323huwikiquote
324huwikisource
325huwiktionary
326hywiki
327hywikibooks
328hywikiquote
329hywikisource
330hywiktionary
331hzwiki
332iawiki
333iawikibooks
334iawiktionary
335idwikibooks
336idwikimedia
337id_internalwikimedia
338idwikiquote
339idwikisource
340idwiktionary
341iegcomwiki
342iewiki
343iewikibooks
344iewiktionary
345igwiki
346iiwiki
347ikwiki
348ikwiktionary
349ilowiki
350ilwikimedia
351incubatorwiki
352inhwiki
353internalwiki
354iowiki
355iowiktionary
356iswiki
357iswikibooks
358iswikiquote
359iswikisource
360iswiktionary
361itwikibooks
362itwikinews
363itwikiquote
364itwikisource
365itwikiversity
366itwikivoyage
367itwiktionary
368iuwiki
369iuwiktionary
370jamwiki
371jawikibooks
372jawikinews
373jawikiquote
374jawikisource
375jawikiversity
376jawiktionary
377jbowiki
378jbowiktionary
379jvwiki
380jvwiktionary
381kaawiki
382kabwiki
383kawiki
384kawikibooks
385kawikiquote
386kawiktionary
387kbdwiki
388kbpwiki
389kgwiki
390kiwiki
391kjwiki
392kkwiki
393kkwikibooks
394kkwikiquote
395kkwiktionary
396klwiki
397klwiktionary
398kmwiki
399kmwikibooks
400kmwiktionary
401knwiki
402knwikibooks
403knwikiquote
404knwikisource
405knwiktionary
406koiwiki
407kowikibooks
408kowikinews
409kowikiquote
410kowikisource
411kowikiversity
412kowiktionary
413krcwiki
414krwiki
415krwikiquote
416kshwiki
417kswiki
418kswikibooks
419kswikiquote
420kswiktionary
421kuwiki
422kuwikibooks
423kuwikiquote
424kuwiktionary
425kvwiki
426kwwiki
427kwwikiquote
428kwwiktionary
429kywiki
430kywikibooks
431kywikiquote
432kywiktionary
433ladwiki
434lawiki
435lawikibooks
436lawikiquote
437lawikisource
438lawiktionary
439lbewiki
440lbwiki
441lbwikibooks
442lbwikiquote
443lbwiktionary
444legalteamwiki
445lezwiki
446lfnwiki
447lgwiki
448lijwiki
449liwiki
450liwikibooks
451liwikinews
452liwikiquote
453liwikisource
454liwiktionary
455lmowiki
456lnwiki
457lnwikibooks
458lnwiktionary
459loginwiki
460lowiki
461lowiktionary
462lrcwiki
463ltgwiki
464ltwiki
465ltwikibooks
466ltwikiquote
467ltwikisource
468ltwiktionary
469lvwiki
470lvwikibooks
471lvwiktionary
472maiwiki
473maiwikimedia
474map_bmswiki
475mdfwiki
476mediawikiwiki
477mgwiki
478mgwikibooks
479mhrwiki
480mhwiki
481mhwiktionary
482minwiki
483miwiki
484miwikibooks
485miwiktionary
486mkwiki
487mkwikibooks
488mkwikimedia
489mkwikisource
490mkwiktionary
491mlwiki
492mlwikibooks
493mlwikiquote
494mlwikisource
495mlwiktionary
496mnwiki
497mnwikibooks
498mnwiktionary
499movementroleswiki
500mrjwiki
501mrwiki
502mrwikibooks
503mrwikiquote
504mrwikisource
505mrwiktionary
506mswiki
507mswikibooks
508mswiktionary
509mtwiki
510mtwiktionary
511muswiki
512mwlwiki
513mxwikimedia
514myvwiki
515mywiki
516mywikibooks
517mywiktionary
518mznwiki
519nahwiki
520nahwikibooks
521nahwiktionary
522napwiki
523nawiki
524nawikibooks
525nawikiquote
526nawiktionary
527nds_nlwiki
528ndswiki
529ndswikibooks
530ndswikiquote
531ndswiktionary
532newiki
533newikibooks
534newiktionary
535newwiki
536ngwiki
537nlwikibooks
538nlwikimedia
539nlwikinews
540nlwikiquote
541nlwikisource
542nlwikivoyage
543nlwiktionary
544nnwiki
545nnwikiquote
546nnwiktionary
547noboard_chapterswikimedia
548nostalgiawiki
549novwiki
550nowikibooks
551nowikimedia
552nowikinews
553nowikiquote
554nowikisource
555nowiktionary
556nrmwiki
557nsowiki
558nvwiki
559nycwikimedia
560nywiki
561nzwikimedia
562ocwiki
563ocwikibooks
564ocwiktionary
565officewiki
566olowiki
567ombudsmenwiki
568omwiki
569omwiktionary
570orwiki
571orwikisource
572orwiktionary
573oswiki
574otrs_wikiwiki
575outreachwiki
576pa_uswikimedia
577pagwiki
578pamwiki
579papwiki
580pawiki
581pawikibooks
582pawikisource
583pawiktionary
584pcdwiki
585pdcwiki
586pflwiki
587pihwiki
588piwiki
589piwiktionary
590plwikibooks
591plwikimedia
592plwikinews
593plwikiquote
594plwikisource
595plwikivoyage
596plwiktionary
597pmswiki
598pmswikisource
599pnbwiki
600pnbwiktionary
601pntwiki
602projectcomwiki
603pswiki
604pswikibooks
605pswikivoyage
606pswiktionary
607ptwikibooks
608ptwikimedia
609ptwikinews
610ptwikiquote
611ptwikisource
612ptwikiversity
613ptwikivoyage
614ptwiktionary
615punjabiwikimedia
616qualitywiki
617quwiki
618quwikibooks
619quwikiquote
620quwiktionary
621rmwiki
622rmwikibooks
623rmwiktionary
624rmywiki
625rnwiki
626rnwiktionary
627roa_rupwiki
628roa_rupwiktionary
629roa_tarawiki
630romdwikimedia
631rowikibooks
632rowikinews
633rowikiquote
634rowikisource
635rowikivoyage
636rowiktionary
637rswikimedia
638ruewiki
639ruwikibooks
640ruwikimedia
641ruwikinews
642ruwikiquote
643ruwikisource
644ruwikiversity
645ruwikivoyage
646ruwiktionary
647rwwiki
648rwwiktionary
649sahwiki
650sahwikiquote
651sahwikisource
652satwiki
653sawiki
654sawikibooks
655sawikiquote
656sawikisource
657sawiktionary
658scnwiki
659scnwiktionary
660scowiki
661scwiki
662scwiktionary
663sdwiki
664sdwikinews
665sdwiktionary
666searchcomwiki
667sewiki
668sewikibooks
669sewikimedia
670sgwiki
671sgwiktionary
672shwiktionary
673shnwiki
674simplewiki
675simplewikibooks
676simplewikiquote
677simplewiktionary
678siwiki
679siwikibooks
680siwiktionary
681skwiki
682skwikibooks
683skwikiquote
684skwikisource
685skwiktionary
686slwiki
687slwikibooks
688slwikiquote
689slwikisource
690slwikiversity
691slwiktionary
692smwiki
693smwiktionary
694snwiki
695snwiktionary
696sourceswiki
697sowiki
698sowiktionary
699spcomwiki
700specieswiki
701sqwiki
702sqwikibooks
703sqwikinews
704sqwikiquote
705sqwiktionary
706srnwiki
707srwikibooks
708srwikinews
709srwikiquote
710srwikisource
711srwiktionary
712sswiki
713sswiktionary
714stewardwiki
715stqwiki
716strategywiki
717stwiki
718stwiktionary
719suwiki
720suwikibooks
721suwikiquote
722suwiktionary
723svwikibooks
724svwikinews
725svwikiquote
726svwikisource
727svwikiversity
728svwikivoyage
729svwiktionary
730swwiki
731swwikibooks
732swwiktionary
733szlwiki
734tawiki
735tawikibooks
736tawikinews
737tawikiquote
738tawikisource
739tawiktionary
740tcywiki
741techconductwiki
742tenwiki
743test2wiki
744testwiki
745testwikidatawiki
746tetwiki
747tewiki
748tewikibooks
749tewikiquote
750tewikisource
751tewiktionary
752tgwiki
753tgwikibooks
754tgwiktionary
755thwikibooks
756thwikinews
757thwikiquote
758thwikisource
759thwiktionary
760tiwiki
761tiwiktionary
762tkwiki
763tkwikibooks
764tkwikiquote
765tkwiktionary
766tlwiki
767tlwikibooks
768tlwiktionary
769tnwiki
770tnwiktionary
771towiki
772towiktionary
773tpiwiki
774tpiwiktionary
775transitionteamwiki
776trwikibooks
777trwikimedia
778trwikinews
779trwikiquote
780trwikisource
781trwiktionary
782tswiki
783tswiktionary
784ttwiki
785ttwikibooks
786ttwikiquote
787ttwiktionary
788tumwiki
789twwiki
790twwiktionary
791tyvwiki
792tywiki
793uawikimedia
794udmwiki
795ugwiki
796ugwikibooks
797ugwikiquote
798ugwiktionary
799ukwikibooks
800ukwikinews
801ukwikiquote
802ukwikisource
803ukwikivoyage
804ukwiktionary
805urwiki
806urwikibooks
807urwikiquote
808urwiktionary
809usabilitywiki
810uzwiki
811uzwikibooks
812uzwikiquote
813uzwiktionary
814vecwiki
815vecwikisource
816vecwiktionary
817vepwiki
818vewiki
819viwikibooks
820viwikiquote
821viwikisource
822viwikivoyage
823viwiktionary
824vlswiki
825votewiki
826vowiki
827vowikibooks
828vowikiquote
829vowiktionary
830warwiki
831wawiki
832wawikibooks
833wawiktionary
834wbwikimedia
835wg_enwiki
836wikimania2005wiki
837wikimania2006wiki
838wikimania2007wiki
839wikimania2008wiki
840wikimania2009wiki
841wikimania2010wiki
842wikimania2011wiki
843wikimania2012wiki
844wikimania2013wiki
845wikimania2014wiki
846wikimania2015wiki
847wikimania2016wiki
848wikimania2017wiki
849wikimania2018wiki
850wikimaniawiki
851wikimaniateamwiki
852wowiki
853wowikiquote
854wowiktionary
855wuuwiki
856xalwiki
857xhwiki
858xhwikibooks
859xhwiktionary
860xmfwiki
861yiwiki
862yiwikisource
863yiwiktionary
864yowiki
865yowikibooks
866yowiktionary
867yuewiktionary
868zawiki
869zawikibooks
870zawikiquote
871zawiktionary
872zeawiki
873zerowiki
874zh_classicalwiki
875zh_min_nanwiki
876zh_min_nanwikibooks
877zh_min_nanwikiquote
878zh_min_nanwikisource
879zh_min_nanwiktionary
880zh_yuewiki
881zhwikibooks
882zhwikinews
883zhwikiquote
884zhwikisource
885zhwikiversity
886zhwikivoyage
887zhwiktionary
888zuwiki
889zuwikibooks
890zuwiktionary

Date: Thursday 17th January
Time: 07:00 AM UTC - 07:30 AM UTC (we expect not to use the full 30 minutes window)

Impact: All those wikis will go read-only. No edits will be allowed. Reads will not be impacted.

Details

Related Gerrit Patches:
operations/dns : masterwmnet: Update s3 alias
operations/mediawiki-config : masterdb-eqiad.php: Promote db1078 to master
operations/mediawiki-config : masterdb-eqiad.php: Set s3 to read only
operations/puppet : productionmariadb: Promote db1078 to s3 master
operations/mediawiki-config : mastermariadb: Depool db1123 for maintenance
operations/mediawiki-config : mastermariadb: Depool db1077 for maintenance

Event Timeline

Marostegui triaged this task as High priority.Jan 15 2019, 7:58 PM
Marostegui created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 15 2019, 7:58 PM
Marostegui moved this task from Triage to In progress on the DBA board.Jan 15 2019, 7:59 PM
Marostegui updated the task description. (Show Details)

I would like to aim for Thursday at 7AM UTC

@jcrespo I have created the usual failover checklist

@Anomie I will let you know if we need to pause your migration script - as the lag in codfw would make the failover harder. Once we have agreed on a date/time we will talk to you!

Anomie added a comment.EditedJan 15 2019, 9:40 PM

Thanks for letting me know about the failover. It will probably kill the script anyway when the old master goes away, or at least whichever s3 wiki it happens to be processing at the time.

I won't be around at 7am UTC as that's 2am for me. If nothing else, it's running on anomie@mwmaint1002 in a screen named T188327-actor-migration-s3. If you kill the process but leave the screen running then it should work to just hit "resurrect" afterwords, or otherwise I can restart it when I get in (probably around 14:00 UTC).

Don't worry, as soon as we arrange a date/time, I will stop it, so we are sure that no lag will happen before the failover.
I will leave the screen running and just kill the process so you can resume it once we are out of the woods

Thank you for your understanding

Change 484612 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1078 to s3 master

https://gerrit.wikimedia.org/r/484612

Change 484613 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Put s3 on read only

https://gerrit.wikimedia.org/r/484613

Change 484614 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1078 to master

https://gerrit.wikimedia.org/r/484614

Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-operations) [2019-01-16T09:29:34Z] <marostegui> Stop s3 actor-migration script in order to allow s3 to catch up and to avoid lag during the failover - T188327 T213858

@Anomie I have stopped the script as we are most likely going to go ahead with the failover in EU morning (still waiting for the managers to confirm)

Change 484620 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1077 for maintenance

https://gerrit.wikimedia.org/r/484620

Change 484620 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1077 for maintenance

https://gerrit.wikimedia.org/r/484620

Change 484642 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1123 for maintenance

https://gerrit.wikimedia.org/r/484642

Just to confirm:
Date: Thursday 17th January
Time: 07:00 AM UTC - 07:30 AM UTC (we expect not to use the full 30 minutes window)

Impact: All those wikis will go read-only. No edits will be allowed. Reads will not be impacted.

Marostegui updated the task description. (Show Details)Jan 16 2019, 12:54 PM

Change 484642 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1123 for maintenance

https://gerrit.wikimedia.org/r/484642

switchover script works as expected (tested on db1111/db1112):

./switchover.py --skip-slave-move db1111 db1112
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                              
 3301 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=test-s4 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                         
PASS:  |████████████████████████████████████| 100% (1/1) [00:00<00:00,  2.96hosts/s]     
FAIL:  |                                            |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.                                                                              
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.                                                                                     
Stopping heartbeat pid 3301 at db1111.eqiad.wmnet:3306/(none)
----- OUTPUT of '/bin/kill 3301' -----                                                   
================                                                                         
PASS:  |████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.14hosts/s]     
FAIL:  |                                            |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 3301'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.                                                                                     
Setting up original master as read-only
Slave caught up to the master after waiting 0.008825302124023438 seconds
Servers sync at master: db1111-bin.000285:24750836 slave: db1112-bin.000258:561179
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Starting heartbeat section test-s4 at db1112.eqiad.wmnet
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' -----                              
================                                                                         
PASS:  |████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.10hosts/s]     
FAIL:  |                                            |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'.                                                                              
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.                                                                                     
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                              
 3613 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=test-s4 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                         
PASS:  |████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.13hosts/s]     
FAIL:  |                                            |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.                                                                              
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.                                                                                     
Detected heartbeat at db1112.eqiad.wmnet running with PID 3613
Verifying everything went as expected...
SUCCESS: Master switch completed successfully
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move db1112 db1111
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                              
 3613 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=test-s4 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                         
PASS:  |████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.14hosts/s]     
FAIL:  |                                            |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.                                                                              
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.                                                                                     
Stopping heartbeat pid 3613 at db1112.eqiad.wmnet:3306/(none)
----- OUTPUT of '/bin/kill 3613' -----                                                   
================                                                                         
PASS:  |████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.97hosts/s]     
FAIL:  |                                            |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 3613'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.                                                                                     
Setting up original master as read-only
Slave caught up to the master after waiting 0.008609294891357422 seconds
Servers sync at master: db1112-bin.000258:566093 slave: db1111-bin.000285:24755750
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Starting heartbeat section test-s4 at db1111.eqiad.wmnet
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' -----                              
================                                                                         
PASS:  |████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.05hosts/s]     
FAIL:  |                                            |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                              
32398 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=test-s4 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                         
PASS:  |████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.12hosts/s]     
FAIL:  |                                            |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected heartbeat at db1111.eqiad.wmnet running with PID 32398
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

Awesome news!
We have to include it on the steps list on our etherpad, which I wrote yesterday evening and needs to be reviewed by you, as it was late in the day, so errors are probably to be expected!

Mentioned in SAL (#wikimedia-operations) [2019-01-17T06:10:51Z] <marostegui> Downtime s3 hosts for 2 hours - T213858

Mentioned in SAL (#wikimedia-operations) [2019-01-17T06:14:07Z] <marostegui> Disable gtid on s3 hosts - T213858

Mentioned in SAL (#wikimedia-operations) [2019-01-17T06:19:58Z] <marostegui> Change s3 topology to get ready for s3 failover - T213858

Mentioned in SAL (#wikimedia-operations) [2019-01-17T06:26:52Z] <marostegui> Enable GTID back on all hosts but db1075 db1078 - T213858

Mentioned in SAL (#wikimedia-operations) [2019-01-17T06:30:45Z] <marostegui> Disable puppet on db1075 and db1078 - T213858

Change 484612 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1078 to s3 master

https://gerrit.wikimedia.org/r/484612

Change 484613 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Set s3 to read only

https://gerrit.wikimedia.org/r/484613

Change 484614 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Promote db1078 to master

https://gerrit.wikimedia.org/r/484614

Mentioned in SAL (#wikimedia-operations) [2019-01-17T07:00:14Z] <marostegui> Start s3 failover T213858

Mentioned in SAL (#wikimedia-operations) [2019-01-17T07:01:01Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set s3 on read-only T213858 (duration: 00m 31s)

Mentioned in SAL (#wikimedia-operations) [2019-01-17T07:03:09Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Switchover s3master eqiad from db1075 to db1078 T213858 (duration: 00m 30s)

Mentioned in SAL (#wikimedia-operations) [2019-01-17T07:04:20Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove s3 ready only T213858 (duration: 00m 30s)

Mentioned in SAL (#wikimedia-operations) [2019-01-17T07:18:11Z] <marostegui> Enable GTID on db1075 - T213858

Change 484860 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s3 alias

https://gerrit.wikimedia.org/r/484860

Mentioned in SAL (#wikimedia-operations) [2019-01-17T07:20:39Z] <marostegui> Change thread_pool_stall_limit on db1075 and db1078 - T213858

Change 484860 merged by Marostegui:
[operations/dns@master] wmnet: Update s3 alias

https://gerrit.wikimedia.org/r/484860

Marostegui closed this task as Resolved.Jan 17 2019, 7:32 AM
Marostegui claimed this task.

This was done:
Read only ON at: 07:01:00
Read only OFF at: 07:04:20

Total time read only time: 03:20 minutes

If you see something strange, please let us know.

@Anomie you can restart s3 migration script

Mentioned in SAL (#wikimedia-operations) [2019-02-28T17:23:38Z] <jynus> recreating replicas, master ops events for db1078, db1075 T213858