Page MenuHomePhabricator

db1053 transient replica lag during dumping
Closed, ResolvedPublic

Description

db1053 lagged up to 555 seconds in a period of ~30 minutes tonight from 22:45 UTC to 23:15.
This DB has the role of vslow and dump, hence not user-facing.

The replica lagged because of what after a first analysis seems to be a deadlock acquiring a semaphore lock, details to follow.

Event Timeline

Those my first findings:

  • The replica thread (0x7f2a514b3700 - 139819729237760) was waiting to acquire the lock to perform the next UPDATE query
  • The main thread (139776246134528) was waiting to acquire the lock to save a table's statistics in function dict_stats_save(...) at storage/xtradb/dict/dict0stats.cc:2391: rw_lock_x_lock(&dict_operation_lock);. This is usually called as part of an ALTER TABLE, maybe could be related to one of our events.
  • There is also a reference to srv0srv.cc:2667 where mutex_exit(&log_sys->mutex); is called

(I'm checking with upstream code now, so references could be off).

From MySQL error log:

InnoDB: Warning: a long semaphore wait:
--Thread 139819729237760 has waited at row0upd.cc line 2367 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f2a568252c0 '&block->lock'
a writer (thread id 139819729237760) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file btr0sea.cc line 979
Last time write locked in file /home/jynus/mariadb-10.0.22/storage/xtradb/row/row0upd.cc line 2367
InnoDB: Warning: semaphore wait:
--Thread 139819729237760 has waited at row0upd.cc line 2367 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f2a568252c0 '&block->lock'
a writer (thread id 139819729237760) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file btr0sea.cc line 979
Last time write locked in file /home/jynus/mariadb-10.0.22/storage/xtradb/row/row0upd.cc line 2367
InnoDB: Warning: Writer thread is waiting this semaphore:
--Thread 139819729237760 has waited at row0upd.cc line 2367 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f2a568252c0 '&block->lock'
a writer (thread id 139819729237760) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file btr0sea.cc line 979
Last time write locked in file /home/jynus/mariadb-10.0.22/storage/xtradb/row/row0upd.cc line 2367
InnoDB: Warning: semaphore wait:
--Thread 139776246134528 has waited at srv0srv.cc line 2667 for 238.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x1420640 '&dict_operation_lock'
a writer (thread id 139776246134528) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file row0purge.cc line 768
Last time write locked in file /home/jynus/mariadb-10.0.22/storage/xtradb/dict/dict0stats.cc line 2391
InnoDB: Warning: Writer thread is waiting this semaphore:
--Thread 139776246134528 has waited at srv0srv.cc line 2667 for 238.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x1420640 '&dict_operation_lock'
a writer (thread id 139776246134528) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file row0purge.cc line 768
Last time write locked in file /home/jynus/mariadb-10.0.22/storage/xtradb/dict/dict0stats.cc line 2391
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 30, pwrites 0

Those the transactions catched by the InnoDB monitor:

------------
TRANSACTIONS
------------
Trx id counter 93857352074
Purge done for trx's n:o < 93826926033 undo n:o < 0 state: running
History list length 14722220
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 10846128, OS thread handle 0x7f208ddca700, query id 6860096425 10.64.0.15 watchdog cleaning up
---TRANSACTION 0, not started
MySQL thread id 10846174, OS thread handle 0x7f25a8177700, query id 6860096248 10.64.0.15 watchdog cleaning up
---TRANSACTION 87791088131, not started
MySQL thread id 1, OS thread handle 0x7f33aa325700, query id 0 Waiting for background binlog tasks
---TRANSACTION 93857351984, ACTIVE 241 sec updating or deleting
mysql tables in use 1, locked 1
2 lock struct(s), heap size 360, 1 row lock(s)
MySQL thread id 1888840, OS thread handle 0x7f2a514b3700, query id 6860094793 updating
UPDATE /* Title::invalidateCache Cyberbot I */  `page` SET page_touched = '20160409224513' WHERE page_id = '20642622' AND (page_touched < '20160409224513')
Trx #rec lock waits 0 #table lock waits 0
Trx total rec lock wait time 0 SEC
Trx total table lock wait time 0 SEC

In addition, since ~3 days the server is under heavy load (~30), mostly I/O waits, and there are 27 parallel WikiExporter::dumpFrom long running query plus two even more long running queries (UncategorizedPagesPage::reallyDoQuery and DeadendPagesPage::reallyDoQuery), but they should be unrelated to the deadlock.

The higher load due to long running queries ended this morning, the DB is in a much better shape right now.

@ArielGlenn not fully, that's why I've left it open.
As soon as I have time I want to double check the code reference with our source code and dig deeper.

For reference, the main thread was in state enforcing dict cache limit when the InnoDB status monitor started.

Maybe related to: T133262

Lag could create an automatic depool, that may create an aparent spike on mediawiki failures (although it should not be visible to end users).

There could be an issue with backups being overaggressive, maybe when combined with long-running queries from maintenance server (terbium). I would give that, however, low priority as it should not be pooled for regular traffic. And maybe soon have its own set of machines: T131363

A more worrying issue would be that dumps may be creating on a depooled datacenter, and that could create a different set of issues. (we saw dumps going to db1065 even if it does not have a dump role, from snapshot1006).

Change 297250 had a related patch set uploaded (by Jcrespo):
Failover db1053 to db1072

https://gerrit.wikimedia.org/r/297250

Change 297250 merged by Jcrespo:
Failover db1053 to db1072

https://gerrit.wikimedia.org/r/297250

jcrespo renamed this task from db1053 transient replica lag, possible deadlock to db1053 transient replica lag during dumping.Jul 4 2016, 5:07 PM

So after some testing, db1072 also lagged.

With this in mind, I think my preliminary conclusions is that: 1) the lagging is caused directly by the dumping process, not by any other process. 2) The host or the hardware are not to be blamed. However, *maybe* the previous host created the dumps from a specially-partitioned slave (to be checked), so the previous and the new state may not be 100% exact.

I am concerned with the paralelism: I believe that we could achieve a smaller overall dump time with lower parallelism because we would not be IO-bound. But that would need testing.

Another possibility is that some sizes reached a tipping point in which the previous situation doesn't scale (it could be the size dumped, but it could be anything else, such as more parallel writes while the dumps are onging, etc.).

I think the next step is to leave this run going on, and before the next one, @ArielGlenn, please ping me in advance so I can try some techniques: parallel replication T85266 (once I test it as secure), reduced durability or compression T139055.

In the meantime I will be working on splitting up groups of jobs into sequential batches (configurable) so that it's ready to be applied for the stubs step on the next run.

I've disabled notifications on db1072 for slave s1 lag for the next couple of days, just so that if it does get above the threshhold again no one will get paged for it.

Change 297399 had a related patch set uploaded (by ArielGlenn):
allow stubs to be done in serial batches of parallel jobs

https://gerrit.wikimedia.org/r/297399

Change 297399 merged by ArielGlenn:
allow stubs to be done in serial batches of parallel jobs

https://gerrit.wikimedia.org/r/297399

Not yet deployed; this should be done before next dumps run.

Changeset https://gerrit.wikimedia.org/r/#/c/297427/ should be tested and merged before then as well.

Changesets are all merged and deployed; we'll see what happens during next run.

"Blocked" only because waiting on updates of those changes, see what happens.

Change 300224 had a related patch set uploaded (by ArielGlenn):
fix up xmlstubs batch jobs setting for en wiki xml dumps

https://gerrit.wikimedia.org/r/300224

Change 300224 merged by ArielGlenn:
fix up xmlstubs batch jobs setting for en wiki xml dumps

https://gerrit.wikimedia.org/r/300224

Failed because I forgot how to set the config setting properly. Interestingly enough we had lag of only about 15 seconds max during the en wiki stubs phase this run, so it's definitely an interaction with something else. Anyways, Aug 1 for the regular run we'll see again.

15 seconds is not a concern. Also, as I said many times, if you told me you need to stop replication, that would be ok (and I would like to explore that once we have dedicated hardware), it is just that I believe less load == faster backups.

Yeah I don't think we should need to stop it. This approach of running the problematic job in two batches should be just fine, as long as I remember how to invoke the feature which I wrote just recently (eyeroll).

This will kick off on the 1st or 2nd of August and I'll be watching it.

Today's dump skipped over that stage and hung on a dependency, no info in the logs, which is problematic. It skipped over the tables job as well. I've shot the runner and will see what happens tonight when the cron job attempts to restart.

Another directory in the old dumps that couldn't be cleaned up because it was owned by root. I'm doing a mass chown -R now but the exception should have wound up somewhere I could see it. I'll look into that later. In the meantime, waiting for the cron job to kick off in a little while and we'll see how it goes.

Stubs are running, first batch (14) jobs started as expected. @jcrespo I expect to see no issues with the db, but here's the heads up that the job has kicked off.

ArielGlenn claimed this task.

And it's done, both phases, so closing this ticket!