Page MenuHomePhabricator

enwiki master oaiUpdatePage spike
Closed, ResolvedPublic

Description

Tonight enwiki master db1052 load spiked with many concurrent oaiUpdatePage statements from jobrunners:

REPLACE /* oaiUpdatePage 127.0.0.1 */ INTO updates (up_page,up_action,up_timestamp,up_sequence) VALUES ('0','modify','20150421134316',NULL)

Notice up_page=0. The timestamp steadily increments. The production enwiki slaves do not show excessive lag, but this is likely due to the master throttling commit speed via the semi-synchronous replication plugin. Other slaves not in the semi-sync group (db1047, dbstore*) are showing replag. The REPLACE statements take up to 10s to commit at times as they fight for locks.

Other anecdotal info from IRC:

  • MaxSem said up_page=0 is related to replag.
  • legoktm said SUL finalization started on enwiki around that time
  • I didn't notice SUL traffic going slow, but perhaps this was some cumulative thing

https://gerrit.wikimedia.org/r/#/c/205606/

We should find out how not to fill up binlogs with hordes of identical queries.

Event Timeline

Springle created this task.Apr 21 2015, 2:09 PM
Springle raised the priority of this task from to Needs Triage.
Springle updated the task description. (Show Details)
Springle added subscribers: Springle, aaron, brion and 2 others.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 21 2015, 2:09 PM

Improved debuggi ng information in https://gerrit.wikimedia.org/r/205606

Springle triaged this task as High priority.Apr 21 2015, 2:11 PM
Springle set Security to None.

Deployed MaxSem's debugging info patch, the queries are of "OAIHook::updateMove-to" type which means they're definitely caused by all the SUL finalization page moves.

They're all setting up_page = 0, because we're suppressing redirects....does it actually make sense to update a row for that? The data seems to be useless since no page actually has an id of 0.

Change 205615 had a related patch set uploaded (by Legoktm):
Don't try to update up_page=0 if page moves suppressed redirects

https://gerrit.wikimedia.org/r/205615

Change 205615 merged by jenkins-bot:
Don't try to update up_page=0 if page moves suppressed redirects

https://gerrit.wikimedia.org/r/205615

Change 205628 had a related patch set uploaded (by Legoktm):
Don't try to update up_page=0 if page moves suppressed redirects

https://gerrit.wikimedia.org/r/205628

Change 205629 had a related patch set uploaded (by Legoktm):
Don't try to update up_page=0 if page moves suppressed redirects

https://gerrit.wikimedia.org/r/205629

Change 205628 merged by jenkins-bot:
Don't try to update up_page=0 if page moves suppressed redirects

https://gerrit.wikimedia.org/r/205628

Change 205629 merged by jenkins-bot:
Don't try to update up_page=0 if page moves suppressed redirects

https://gerrit.wikimedia.org/r/205629

aaron added a comment.Apr 21 2015, 5:01 PM

If db1047 is using single-threaded replication, how is there lock contention on the slaves?

Also semi-sync replication only makes sure the log replication + fsync makes it to another slave, not that the transaction actually applied. So I'm curious how much that helps with slow queries.

@aaron, Sorry, I was unclear: The REPLACE statements take up to 10s to commit on the master...

Semi-sync delays commit on the master until fsync occurs on a slave. This acts like a throttle on master's transaction throughput, forcing clients on the master to wait a little longer for every commit. In this case, where many small transactions appeared surges, at least some of the 10s is due to semi-sync. Without semi-sync, I think more slaves would have behaved like db1047.

I'm not saying semi-sync helps with slow queries, only that there are interesting side-effects that can shift the visible lag from replication onto every client connection on the master.

Springle closed this task as Resolved.Apr 21 2015, 11:11 PM
Springle claimed this task.

https://gerrit.wikimedia.org/r/#/c/205629/ fixed the immediate problem.

Legoktm mentioned that this should be discussed further in case the fix is not the right approach for OAI, but apparently it's being killed off...