Labs: High Replication Lag (s6 on c1, s1 and s5 on c2)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Boshomi
	Jan 8 2016, 6:48 PM

Description

Currently we have a high replication lag at wmflabs.

valhallasw@tools-bastion-01:~$ for db in c1 c2 c3; do echo Server: $db; echo "select * from heartbeat_p.heartbeat order by shard" | mysql -h $db.labsdb; done
Server: c1
shard   last_updated    lag
s1      2016-01-08T20:45:54.000760      0
s2      2016-01-08T20:45:54.001090      0
s3      2016-01-08T20:45:54.001120      0
s4      2016-01-08T20:45:54.001050      0
s5      2016-01-08T20:45:54.000580      0
s6      2016-01-08T19:26:26.001130      4767
s7      2016-01-08T20:45:54.001120      0
Server: c2
shard   last_updated    lag
s1      2016-01-07T23:43:19.500950      75754
s2      2016-01-08T20:45:54.001090      0
s3      2016-01-08T20:45:54.001120      0
s4      2016-01-08T20:45:54.001050      0
s5      2016-01-08T15:54:56.000540      17457
s6      2016-01-08T20:45:54.001180      0
s7      2016-01-08T20:45:54.001120      0
Server: c3
shard   last_updated    lag
s1      2016-01-08T20:45:54.000760      0
s2      2016-01-08T20:45:54.001090      0
s3      2016-01-08T20:45:54.001120      0
s4      2016-01-08T20:45:54.001050      0
s5      2016-01-08T20:45:54.000580      0
s6      2016-01-08T20:45:54.001180      0
s7      2016-01-08T20:45:54.001120      0

Event Timeline

Boshomi created this task.Jan 8 2016, 6:48 PM

Boshomi raised the priority of this task from to Needs Triage.

Boshomi updated the task description. (Show Details)

Boshomi subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 8 2016, 6:48 PM

Krenair added projects: Cloud-Services, DBA.Jan 8 2016, 7:12 PM

Krenair subscribed.

@Boshomi: How to see that problem / check that it still exists? (Also see https://mediawiki.org/wiki/How_to_report_a_bug )

valhallasw renamed this task from Labs: High Replication Lag to Labs: High Replication Lag (s6 on c1, s1 and s5 on c2).Jan 8 2016, 8:46 PM

valhallasw updated the task description. (Show Details)

valhallasw set Security to None.

@Aklapper Yes the problem still exists. supercount shows a delay of 21954 seconds, but supercount is not the case, there are other tools for editing like https://tools.wmflabs.org/giftbot/weblinksuche.fcgi which should work with very very small delay.

Luke081515 added subscribers: Luke081515, jcrespo.Jan 8 2016, 10:18 PM

This is a known issue, 2 databases have crashed in the last 2 days. I have yet to investigate the core cause but I am taking care of fixing the replication issues first.

Please note that the fact that there are some issues with some replication channels shouldn't impact directly labs users, as not all are exposed as the default hosts (e.g. you should not use s1 on c2). See:
https://tools.wmflabs.org/replag/
For the lag of the ones exposed directly.

jcrespo claimed this task.Jan 8 2016, 11:15 PM

jcrespo triaged this task as High priority.

jcrespo moved this task from Triage to In progress on the DBA board.

To give an advance of the work done:

Things are now getting corrected, and all replication channels are active again, but it will take some hours to reduce the replication lag back to 0. However, until the root cause is eliminated and due to some corruption in users's tables using non-transactional engines, we may have some hiccups in the following days.

A first evaluation points to some large imports going on a few hours before the crashes, but I cannot say for sure that is the real cause. I will continue investigating and reporting back.

This was fixed days ago, will close the ticket, but continue monitoring labs to understand the root cause.

$ for db in labsdb1001 labsdb1002 labsdb1003; do echo Server: $db; echo "select * from heartbeat_p.heartbeat order by shard" | mysql -h $db; done
Server: labsdb1001
shard	last_updated	lag
s1	2016-01-18T13:32:12.000640	0
s2	2016-01-18T13:32:12.001130	0
s3	2016-01-18T13:32:12.000630	0
s4	2016-01-18T13:32:12.001250	0
s5	2016-01-18T13:32:12.000450	0
s6	2016-01-18T13:32:12.001060	0
s7	2016-01-18T13:32:12.001100	0
Server: labsdb1002
shard	last_updated	lag
s1	2016-01-18T13:32:12.000640	0
s2	2016-01-18T13:32:12.001130	0
s3	2016-01-18T13:32:12.000630	0
s4	2016-01-18T13:32:12.001250	0
s5	2016-01-18T13:32:12.000450	0
s6	2016-01-18T13:32:12.001060	0
s7	2016-01-18T13:32:12.001100	0
Server: labsdb1003
shard	last_updated	lag
s1	2016-01-18T13:32:12.000640	0
s2	2016-01-18T13:32:12.001130	0
s3	2016-01-18T13:32:12.000630	0
s4	2016-01-18T13:32:12.001250	0
s5	2016-01-18T13:32:12.000450	0
s6	2016-01-18T13:32:12.001060	0
s7	2016-01-18T13:32:12.001100	0

https://tools.wmflabs.org/replag/ shows high replag again:

Shard Lag (seconds) Lag (time)
s1 0 00:00:00
s2 0 00:00:00
s3 122 00:02:02
s4 0 00:00:00
s5 17701 04:55:01
s6 0 00:00:00
s7 0 00:00:00
Wiki Shard Lag (seconds) Lag (time)
dewiki s5 17701 04:55:01
wikidatawiki s5 17701 04:55:01
aawiki s3 122 00:02:02
aawikibooks s3 122 00:02:02
aawiktionary s3 122 00:02:02
...

This is intended- see https://phabricator.wikimedia.org/T124172#1948046

Labs: High Replication Lag (s6 on c1, s1 and s5 on c2)Closed, ResolvedPublicActions

Description

Event Timeline

Labs: High Replication Lag (s6 on c1, s1 and s5 on c2)
Closed, ResolvedPublic
Actions