Page MenuHomePhabricator

High replication activity filled up labsdb1004 with binlogs
Closed, ResolvedPublic

Description

Today labsdb1004 paged for FS full, after investigation turns out there were many binlogs created and the FS was extended.

@Volans looked into the binlogs and it looks like most activity came from s51370__isbntool.nrmentry. Judging by network activity starting 2016-11-10 around 3.30 UTC.

Event Timeline

Just to clarify, this is because of row-based replication, for each 100MB relay log there was ~93k updated rows for the table s51370__isbntool.nrmentry.

I was unable to find any related slow query on tendril on labsdb1005, but the same thread from the relay log is still connected to labsdb1005 and performing queries like:

update nrmentry set ant = 14 where nid = 1 and wki = 164

For which the count is 109151 rows.

Adding to the task the members of the isbn tool group listed on http://tools.wmflabs.org/contact/ :
Could you please slow down the updates that you're performing either reducing the number of rows updated for each query of the type listed above or increasing the time between one query and the next?
Thanks for the collaboration.

To the isbn tool group:
Looks like you have some issue in the queries you're performing because I can see that you're updating always the same ~100k rows with a different value of ant.
I doubt this is intended, I've seen those queries that were running shortly one after the other:

update nrmentry set ant = 4 where nid = 1 and wki = 164
update nrmentry set ant = 5 where nid = 1 and wki = 164
update nrmentry set ant = 21 where nid = 1 and wki = 164

those are updating each time ~1/3rd of the whole table, the same ~100k rows.

Please fix the error ASAP.

I already added all 7 of them listed there in a previous post ;)

I have already some banned tools from replication (meaning they are not guaranteed redundancy) due to high import/updater rates.

If this happens again, we can filter out on the slave 's51370\_\_isbntool.%' here: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/templates/mariadb/tools.my.cnf.erb;2b87787472fde6f41a4561cbf45423b49bb600ce$62

I have stopped the jobs, that cause this and will not run them again before investigating the cause.

Ireas has not been active for some time now.

I am the one that responded (partly) to the page by increasing the space on said filesystem. That was done via LVM resizing and XFS grow filesystem mechanism. I did notice the following that should probably be looked at. It's completely irrelevant to the above about the Tool, all of which still stands as is. Mostly consists of some housecleaning I think we should do.

akosiaris@labsdb1004:/srv/labsdb$ ls -lrt
total 592
drwx------   2 mysql mysql   4096 Sep 30  2015 s51071__templatetiger_p
drwx------   2 root  root  196608 Oct 21  2015 export-20151021-132530
drwxr-xr-x 203 mysql mysql  57344 Oct 21  2015 data.bak
drwxr-xr-x 292 mysql mysql 114688 Nov 14 09:15 data
drwxr-xr-x   2 mysql mysql 147456 Nov 14 09:24 binlogs
drwxr-xr-x   2 mysql mysql   4096 Nov 14 09:29 tmp
akosiaris@labsdb1004:/srv/labsdb$ sudo du -hsc *
628G	binlogs
673G	data
459G	data.bak
239G	export-20151021-132530
48G	s51071__templatetiger_p
2.4G	tmp
2.0T	total

3 directories with quite indicative of temporary nature names > 1 year old occupying >700GB of disk. Probably worth deleting.

Mentioned in SAL (#wikimedia-operations) [2016-11-14T23:07:33Z] <godog> silence disk space alerts on labsdb1004 for 4h while investigating reoccurence - T150553

Mentioned in SAL (#wikimedia-operations) [2016-11-14T23:18:09Z] <godog> add 150G to labsdb1004:/srv/labsdb to get it out of warning threshold T150553

I talked to Jaime yesterday and we agreed that data from the /srv/labsdb can be deleted.
So I have deleted the directory data.bak which was 459G and it was all from Oct 2015.

So we are in a healthy status again.

root@labsdb1004:/srv/labsdb# df -hT /srv/labsdb/
Filesystem                        Type  Size  Used Avail Use% Mounted on
/dev/mapper/labsdb1004--vg-labsdb xfs   2.3T  1.6T  765G  68% /srv/labsdb

There is still around 280G that can be delete from old stuff there.

Change 321618 had a related patch set uploaded (by Jcrespo):
toolsdb: Ignore events from s51071templatetiger_p, s52721pagecount_stats_p

https://gerrit.wikimedia.org/r/321618

Change 321618 merged by Jcrespo:
toolsdb: Ignore events from s51071templatetiger_p, s52721pagecount_stats_p

https://gerrit.wikimedia.org/r/321618

This i the state now:

Size  Used Avail Use% Mounted on
2.3T  410G  1.9T  18% /srv/labsdb

But binary logs grows at 101M/minute. Statement based replication is not an option for user-written databases, as they will fail to replicate soon after it is enabled. I will reset the replication on the master to check there is not transaction going back and forth.

jcrespo changed the task status from Open to Stalled.Nov 16 2016, 10:23 AM
jcrespo triaged this task as Low priority.
jcrespo moved this task from In progress to Backlog on the DBA board.