Lag on a single commons server created user-facing issues
Closed, ResolvedPublic

Description

How can issues on a single server affect all the service? Is this server (db1040) a SPOF? If yes, why?

The server started lagging due to an ongoing schema change; however, that schema change, to avoid issues, only happens on a server at a time.

This is a 'vslow' slave, but that should have little user impact (except for statistics)- as far as I know, job queue is run on all slaves, and only this one was affected by lag.
https://graphite.wikimedia.org/render/?width=846&height=557&_salt=1463722181.272&from=-1hours&target=MediaWiki.jobqueue.acks.ThumbnailRender.rate


https://commons.wikimedia.org/wiki/Commons:Village_pump#The_October_2015_categorization_lag_is_back
https://grafana.wikimedia.org/dashboard/db/job-queue-health?panelId=12&fullscreen&from=1463684575201&to=1463726580442&var-jobType=all

jcrespo created this task.May 20 2016, 6:41 AM
Restricted Application added subscribers: Zppix, Pokefan95, Steinsplitter. · View Herald TranscriptMay 20 2016, 6:41 AM
jcrespo edited the task description. (Show Details)May 20 2016, 6:43 AM

Change 289820 had a related patch set uploaded (by Jcrespo):
Reduce max table lock to identify metadata locks and abort

https://gerrit.wikimedia.org/r/289820

Change 289820 merged by Jcrespo:
Reduce max table lock to identify metadata locks and abort

https://gerrit.wikimedia.org/r/289820

Now, if it is blocked for more than 60 seconds it says:

ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction
WARNING commonswiki : categorylinks encountered problems
continue? yes/no

However, this is the "operations side", I would like some help from someone with better mediawiki knowledge and see if we can improve mediawiki db issues detection. If not, the alternative is to kill long running queries server-side (plus T112473).

jcrespo added a comment.EditedMay 20 2016, 8:43 AM

I think I could have identified the issue: some jobqueues perform a write, then get locked WAIT_FOR_BINLOG_POS, that is a horrible pattern- not only it is nonstandard (does not work on multi-tier slaves or other kinds of database T135027) , it makes active waits- not in the "I am using CPU resources"-active, but consuming connections and blocking on a potentially always-blocked server. When a slave is found as "lagged"- be it one transaction behind or 30 minutes, the job should disconnect, reload its config, and try another random slave (and not just stick to a single server) aka "Soft-fail".

A slave being 1 second continuously behind has a huge overhead maintaining 1 second worth of connections == thousands. User requests should wait and get blocked, when needed, but jobs can be handled a bit more roughly.

In an ideal world, transactions should take a max of 1 second or be killable- and that should be enforced on server side.

See also: T133333#2311726

Bawolff added a subscriber: Bawolff.EditedMay 20 2016, 10:01 AM

The code has stuff like:

if ( $this->mLoads[$i] > 0 ) {
        $ok = $this->doWait( $i, true, $timeout ) && $ok;
}

So in principle it should not be waiting due to slave lag on the vslow slave, since that slave has a mLoads of 0. Thus it sounds like there is a MW bug somewhere here.

try another random slave (and not just stick to a single server)

In the current code (As I understand it), between each job, the job queue isn't waiting for the current slave, but waiting for all slaves (With a non-zero load [in theory]). refreshLinksJob.php also happens to wait for the current slave to catch up to the stored master pos of when the job is generated, but that current slave should never be the vslow slave.

In the current code (As I understand it), between each job, the job queue isn't waiting for the current slave

The problem is not between job executions, but wait time for each individual job between a write and its propagation to the slave. If there is an issue on a slave, mediawiki should be made aware of it withing reasonable limits and not get blocked for so long.

The point I'm trying to make, is MW is already supposed to ignore issues on slaves that have a load of 0 (like vslow), so the fact that this bug actually happened suggests that there is something wrong with the current implementation, not that the current design is bad (The current design may also be unideal for all the reasons you pointed out, just that even with the current design, this bug should not have happened)

jcrespo added a comment.EditedMay 20 2016, 10:43 AM

just that even with the current design, this bug should not have happened

We agree on that :-D, I am puzzled.

I need help to do more research on this, mediawiki is not my strong skill. I fixed the immediate cause of the issue, but this could happen again for other complete unrelated reasons.

Meanwhile I will try (from the ops side) to create an alert to detect this on T112473.

jcrespo added a comment.EditedMay 20 2016, 10:52 AM

I have a suspicion, now that I see the logs- probably user-facing traffic was not affected at all- probably only the job queue was. Load balancing on the queue probably is to blame. I know this is out of scope, but I have asked several times for the job queue to use a separate role "rpc" for more flexibility in case of issues-specially given the high amount of connections it generates. The real fix would be something like periodically check the state of the servers every few seconds and error out faster.

jcrespo removed jcrespo as the assignee of this task.
jcrespo lowered the priority of this task from "High" to "Normal".May 20 2016, 12:18 PM

Setting it back to normal because this is not an ongoing issue anymore and also measures have been taken to prevent future issues like this; this ticket is just open for the underlaying causes.

Change 289993 had a related patch set uploaded (by Aaron Schulz):
Do not count 0-load servers in getMaxLag()

https://gerrit.wikimedia.org/r/289993

Jarekt added a subscriber: Jarekt.May 23 2016, 6:48 PM

It just happen that Commons finally got arbitrary access to Wikidata and there some development of wikidata based tools. For example my Module:Authority control is not pooling identifiers from Wikidata at 2M pages. I hope those two events are not related.

Change 289993 merged by jenkins-bot:
Do not count 0-load servers in getMaxLag()

https://gerrit.wikimedia.org/r/289993

Bawolff assigned this task to aaron.May 23 2016, 11:00 PM
Bawolff closed this task as "Resolved".

It just happen that Commons finally got arbitrary access to Wikidata and there some development of wikidata based tools. For example my Module:Authority control is not pooling identifiers from Wikidata at 2M pages. I hope those two events are not related.

I'm not entirely sure what "pooling identifiers from Wikidata at 2M pages" means. However the actual issue at commons has been resolved for three days (And really only lasted for a couple hours), I also don't think it would affect accessing wikidata properties (just possibly refreshing things when they change). The only reason this bug was still open was to make sure that it didn't happen again. With Aaron's patch now merged, that's complete