Page MenuHomePhabricator

Rebuild s6 and s7 on labsdb1002
Closed, ResolvedPublic

Description

In order to make replication work again, I have skipped 1 event for the replication on s6 and thousands on s7, reinitialize those databases.

This was the error:

Could not execute Update_rows_v1 event on table eswiki.user_daily_contribs; Can't find record in 'user_daily_contribs', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log s7-bin.000846, end_log_pos 214616413

Event Timeline

jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added projects: Cloud-Services, DBA.
jcrespo added a subscriber: jcrespo.
jcrespo added a subscriber: Springle.

@Springle did you fix this today at 8 UTC? I am interested on why it failed in the first place.

I did indeed do a similar fix to get s6 going, and then sinned slightly by setting slave_exec_mode=idempotent to keep it alive for the weekend.

The mysqld process restarted on Jun 5th, but don't know what triggered that. Noticed a lot of "XFS: possible memory allocation deadlock in kmem_alloc" noise in dmesg from around that time, and in syslog lots of:

Jun  5 16:40:08 labsdb1002 kernel: [15865487.149823] INFO: task mysqld:10266 blocked for more than 120 seconds.
Jun  5 16:40:08 labsdb1002 kernel: [15865487.157416]       Tainted: G          I   3.13.0-40-generic #69-Ubuntu
Jun  5 16:40:08 labsdb1002 kernel: [15865487.164998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174038] mysqld          D ffff88180fcd4480     0 10266   8259 0x0000
0000
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174044]  ffff8819c4697d10 0000000000000082 ffff88010fdf1800 ffff8819
c4697fd8
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174048]  0000000000014480 0000000000014480 ffff88010fdf1800 ffff8818
76180240
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174051]  ffff881876180244 ffff88010fdf1800 00000000ffffffff ffff8818
76180248
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174054] Call Trace:
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174068]  [<ffffffff81723b69>] schedule_preempt_disabled+0x29/0x70
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174071]  [<ffffffff817259d5>] __mutex_lock_slowpath+0x135/0x1b0
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174077]  [<ffffffff81178a2a>] ? do_numa_page+0x14a/0x1b0
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174080]  [<ffffffff81725a6f>] mutex_lock+0x1f/0x2f
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174132]  [<ffffffffa02b7f3c>] xfs_file_buffered_aio_write+0x5c/0x1a0
 [xfs]
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174150]  [<ffffffffa02b8140>] xfs_file_aio_write+0xc0/0x120 [xfs]
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174158]  [<ffffffff811bd21a>] do_sync_write+0x5a/0x90
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174160]  [<ffffffff811bd9a4>] vfs_write+0xb4/0x1f0
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174163]  [<ffffffff811be552>] SyS_pwrite64+0x72/0xb0
Jun  5 16:40:08 labsdb1002 kernel: [15865487.174166]  [<ffffffff8172fd2d>] system_call_fastpath+0x1a/0x1f

Memory and blocking errors related to XFS, plus those boxes do not use the production raid+bbu, plus no innodb replication tables... all might mean the datadir is now untrustworthy.

Haven't done any further digging yet.

jcrespo raised the priority of this task from Low to Medium.Jun 7 2015, 12:59 PM

Yes, I saw MySQL had restarted, but with a "normal shutdown", plus machine uptime not affected, so I assumed lab people were doing some regular maintenance and just hadn't restarted the slave (they weren’t available at the moment).

I would put that box out of labs production on Monday and perform a filesystem check, we may have cursed it too much now. Hopefully it will be a sw-only problem.

jcrespo moved this task from Triage to In progress on the DBA board.

This looks similar to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1382333 but we are running the exact same kernel that supposedly patched this (checked on the changelog). But we are getting those errors after the reboot:

Mar 21 14:12:41 labsdb1002 kernel: [9289832.236151] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

I recommend a kernel update, rebooting and performing an xfs check, then reevaluate.

https://github.com/wikimedia/operations-puppet/blob/production/modules/toollabs/templates/hosts.erb#L240 should probably change these to point to labsdb1003? We can do this earlier (a day before?) to ensure tools start using it before the actual downtime starts.

If possible, we should also update the mariadb version due to T100408

jcrespo added a parent task: Restricted Task.Jun 8 2015, 1:48 PM
jcrespo removed a parent task: Restricted Task.Jun 8 2015, 4:58 PM

Change 216971 had a related patch set uploaded (by Yuvipanda):
tools: Move c2 *.labsdb aliases to c1 and some to c3

https://gerrit.wikimedia.org/r/216971

Change 216971 merged by Yuvipanda:
tools: Move c2 *.labsdb aliases to c1 and some to c3

https://gerrit.wikimedia.org/r/216971

Done:

Updated kernel and mariadb package
restarted
umounted partition and xfs_checked it
reviewed logs with no result (good news?)

STOP SLAVE AND RESET ALL for s6 and s7 to avoid mistakes of starting replication before the imports have finished
s6 exported /srv/tmp/s6.sql.gz
s7 exported /srv/tmp/s7.sql.gz
s6 import

STOP SLAVES ALL;
SET GLOBAL slave_exec_mode=STRICT;
START SLAVE ALL;

s6:
Drop triggers- were included by mistake
CHANGE MASTER TO MASTER_LOG_FILE='s6-bin.000411', MASTER_LOG_POS=1062757892;
START SLAVE 's6';

s7:
Import
CHANGE MASTER TO MASTER_LOG_FILE='s7-bin.000855', MASTER_LOG_POS=899016129;
START SLAVE 's7'

wait until replication lag goes away

There could be some issue left with %auth% databases, but that can be solved without downtime.