Page MenuHomePhabricator

enwiki master oaiUpdatePage spike
Closed, ResolvedPublic

Description

Tonight enwiki master db1052 load spiked with many concurrent oaiUpdatePage statements from jobrunners:

REPLACE /* oaiUpdatePage 127.0.0.1 */ INTO updates (up_page,up_action,up_timestamp,up_sequence) VALUES ('0','modify','20150421134316',NULL)

Notice up_page=0. The timestamp steadily increments. The production enwiki slaves do not show excessive lag, but this is likely due to the master throttling commit speed via the semi-synchronous replication plugin. Other slaves not in the semi-sync group (db1047, dbstore*) are showing replag. The REPLACE statements take up to 10s to commit at times as they fight for locks.

Other anecdotal info from IRC:

  • MaxSem said up_page=0 is related to replag.
  • legoktm said SUL finalization started on enwiki around that time
  • I didn't notice SUL traffic going slow, but perhaps this was some cumulative thing

https://gerrit.wikimedia.org/r/#/c/205606/

We should find out how not to fill up binlogs with hordes of identical queries.

Event Timeline

Springle raised the priority of this task from to Needs Triage.
Springle updated the task description. (Show Details)
Springle set Security to None.

Deployed MaxSem's debugging info patch, the queries are of "OAIHook::updateMove-to" type which means they're definitely caused by all the SUL finalization page moves.

They're all setting up_page = 0, because we're suppressing redirects....does it actually make sense to update a row for that? The data seems to be useless since no page actually has an id of 0.

Change 205615 had a related patch set uploaded (by Legoktm):
Don't try to update up_page=0 if page moves suppressed redirects

https://gerrit.wikimedia.org/r/205615

Change 205615 merged by jenkins-bot:
Don't try to update up_page=0 if page moves suppressed redirects

https://gerrit.wikimedia.org/r/205615

Change 205628 had a related patch set uploaded (by Legoktm):
Don't try to update up_page=0 if page moves suppressed redirects

https://gerrit.wikimedia.org/r/205628

Change 205629 had a related patch set uploaded (by Legoktm):
Don't try to update up_page=0 if page moves suppressed redirects

https://gerrit.wikimedia.org/r/205629

Change 205628 merged by jenkins-bot:
Don't try to update up_page=0 if page moves suppressed redirects

https://gerrit.wikimedia.org/r/205628

Change 205629 merged by jenkins-bot:
Don't try to update up_page=0 if page moves suppressed redirects

https://gerrit.wikimedia.org/r/205629

If db1047 is using single-threaded replication, how is there lock contention on the slaves?

Also semi-sync replication only makes sure the log replication + fsync makes it to another slave, not that the transaction actually applied. So I'm curious how much that helps with slow queries.

@aaron, Sorry, I was unclear: The REPLACE statements take up to 10s to commit on the master...

Semi-sync delays commit on the master until fsync occurs on a slave. This acts like a throttle on master's transaction throughput, forcing clients on the master to wait a little longer for every commit. In this case, where many small transactions appeared surges, at least some of the 10s is due to semi-sync. Without semi-sync, I think more slaves would have behaved like db1047.

I'm not saying semi-sync helps with slow queries, only that there are interesting side-effects that can shift the visible lag from replication onto every client connection on the master.

Springle claimed this task.

https://gerrit.wikimedia.org/r/#/c/205629/ fixed the immediate problem.

Legoktm mentioned that this should be discussed further in case the fix is not the right approach for OAI, but apparently it's being killed off...