Page MenuHomePhabricator

wfWaitForSlaves in JobRunner can massively slow down run rate if just a single slave is lagged
Closed, DuplicatePublic

Description

Ideally it would use some "majority of non-zero load slaves" logic rather than "all slaves up-to-date" or such.

Event Timeline

aaron created this task.Apr 11 2015, 4:33 AM
aaron raised the priority of this task from to Needs Triage.
aaron updated the task description. (Show Details)
aaron added a project: Availability.
aaron moved this task to Backlog on the Availability board.
aaron added a subscriber: aaron.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 11 2015, 4:33 AM
Nemo_bis set Security to None.
aaron claimed this task.Sep 12 2016, 10:52 PM
aaron triaged this task as High priority.
aaron added a project: Performance-Team.
aaron added a project: DBA.
aaron added a subscriber: jcrespo.

This is a more complex issue than it seems, aaron. Because we have "groups", if all slaves are up to date but one, the recentchanges one, is behind, we have now a degradation of service. Ideally, in that case, that service jumps to a secondary server, or to the main traffic one; in practice, I do not see that happening (specially due to special partitioning and buffer pool status).

At present, because latest issues, we have setup the dump slave (the one more prone to get behind) with load 0 and all others with lag at least 1 so they are taken into account for lag, even if they shouln't receive main traffic.

This is not an easy task, and most of it has to do with slave groups, which are at the same time a great idea and a threat for availability.

aaron added a comment.Sep 13 2016, 8:33 AM

Yeah, there are complexities, which is why I never got around to it (though I looked at it before).

aaron lowered the priority of this task from High to Medium.Sep 13 2016, 8:33 AM
Gilles lowered the priority of this task from Medium to Low.Dec 7 2016, 7:41 PM
aaron removed aaron as the assignee of this task.Mar 28 2017, 11:24 PM
jcrespo raised the priority of this task from Low to Medium.Nov 30 2017, 2:49 PM

I am going to put this back to normal because there is a probable connection with T180918, unless someone disagrees.

1978Gage2001 moved this task from Triage to In progress on the DBA board.Dec 11 2017, 9:45 AM
1978Gage2001 moved this task from Triage to In progress on the DBA board.
Marostegui moved this task from In progress to Triage on the DBA board.Dec 11 2017, 11:06 AM
Krinkle edited projects, added Availability (MediaWiki-MultiDC); removed Availability.
Krinkle moved this task from MediaWiki-MultiDC to Backlog on the Availability board.
Krinkle edited projects, added Availability; removed Availability (MediaWiki-MultiDC).

I don't understand how we can implement the task as described. It's intentional that write-heavy maintenance scripts go at the speed of the slowest slave. If you only wait for a majority then you could have 50% of slaves permanently lagged, potentially by days or weeks.

aaron added a comment.Jun 12 2018, 9:13 AM

I don't understand how we can implement the task as described. It's intentional that write-heavy maintenance scripts go at the speed of the slowest slave. If you only wait for a majority then you could have 50% of slaves permanently lagged, potentially by days or weeks.

There are lots of ways to go about this. If there is just one server lagging, then it might be better to redirect traffic away from it rather than slow down for it (or stop-the-world if it is totally broken). If the lagged servers are only in vslow/dump, it might make sense to tolerate more lag; this helps for spikes of jobs though not for overall permanently high load. It gets complicated fairly quickly though...I can't think of anything simple.

Krinkle closed this task as a duplicate of Restricted Task.Jun 20 2018, 10:13 AM

Talked about at the offsite. Decided it probably doesn't make sense to exclude a lagged slave only for the purposes of LB:waitForAll(). Instead, these slaves should possibly be depooled entirely. For example via MediaWiki's rdbms/LoadMonitor in APC, or via a proxy of sorts.

Merged into T180918.