Page MenuHomePhabricator

labsdb1004's toolsdb mariadb is lagging behind labsdb1005
Closed, ResolvedPublic

Description

labsdb1004 seems to be stuck on replication executing log.212147:29750512. No user impact as this is passive, but it reduces our redundancy.

After the first day, I tried stopping replication, and resetting it, but apparently there seems to be a very large open transaction (I checked, and there is ongoing InnoDB activity by the replication thread) that takes a lot of time to be executed in ROW (?). I left it for a while, hoping it will finish on its own in a couple of days, but if it does not, we will have to identify the reason and ignore on replication or reconstruct the replica again. We can wait a bit, but as data grows, there is a limit to what we can store/be delayed.

Event Timeline

After 4.5 days of waiting for replication to catch up, I had to ignore s51290\_\_dpl\_p.% replication. CC @Dispenser @JaGa @russblau.

FYI, while this doesn't change your day-to-day work, this means that the database s51290__dpl_p, like some few ones: T127164 will not get replicated, meaning that you should have your own backups, because it will be lost if hw has issues or host is replaces, and will not be available if there is a programmed maintenance. To be able to replicate, transactions have to be small (e.g. no LOAD data or imports, no large deletes, etc.).

I am making you aware of this- if you can change the way you do queries, I can reimport it and replicate it again, otherwise, it will not be supported in the way I mention above and you should setup your own backups.

Change 453355 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] toolsdb: Ignore s51290__dpl_p replication on toolsdb replica

https://gerrit.wikimedia.org/r/453355

jcrespo claimed this task.

I have backed up s51290dpl_p at /srv/labsdb/s51290dpl_p-20180817.sql.gz on the replica and applied the filter, replication started to flow again: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1004&var-port=9104&from=1534032000000&to=1534636799999

Change 453355 merged by Jcrespo:
[operations/puppet@production] toolsdb: Ignore s51290__dpl_p replication on toolsdb replica

https://gerrit.wikimedia.org/r/453355

This is all resolved from an infrastructure perspective- tool maintainers, please reopen if some change or question happens on your end.