Lag on a single commons server created user-facing issues
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jcrespo
	May 20 2016, 6:41 AM

Description

How can issues on a single server affect all the service? Is this server (db1040) a SPOF? If yes, why?

The server started lagging due to an ongoing schema change; however, that schema change, to avoid issues, only happens on a server at a time.

This is a 'vslow' slave, but that should have little user impact (except for statistics)- as far as I know, job queue is run on all slaves, and only this one was affected by lag.
https://graphite.wikimedia.org/render/?width=846&height=557&_salt=1463722181.272&from=-1hours&target=MediaWiki.jobqueue.acks.ThumbnailRender.rate

Screenshot from 2016-05-20 08:34:04.png (959×1 px, 80 KB)

https://commons.wikimedia.org/wiki/Commons:Village_pump#The_October_2015_categorization_lag_is_back
https://grafana.wikimedia.org/dashboard/db/job-queue-health?panelId=12&fullscreen&from=1463684575201&to=1463726580442&var-jobType=all

Details

	Subject	Repo	Branch	Lines +/-
	Do not count 0-load servers in getMaxLag()	mediawiki/core	master	+1 -1
	Reduce max table lock to identify metadata locks and abort	operations/software	master	+3 -1

Customize query in gerrit

Related Objects

Mentioned In: T133333: Audit MySQL configurations
T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases
Mentioned Here: T133333: Audit MySQL configurations
T135027: MASTER_POS_WAIT() alternative that works cross-DC
T112473: Better mysql monitoring for number of connections and processlist strange patterns

Event Timeline

• jcrespo created this task.May 20 2016, 6:41 AM

Restricted Application added subscribers: Zppix, Poyekhali, Steinsplitter. · View Herald TranscriptMay 20 2016, 6:41 AM

• jcrespo updated the task description. (Show Details)May 20 2016, 6:43 AM

Change 289820 had a related patch set uploaded (by Jcrespo):
Reduce max table lock to identify metadata locks and abort

https://gerrit.wikimedia.org/r/289820

gerritbot added a project: Patch-For-Review.May 20 2016, 7:27 AM

Change 289820 merged by Jcrespo:
Reduce max table lock to identify metadata locks and abort

https://gerrit.wikimedia.org/r/289820

Now, if it is blocked for more than 60 seconds it says:

ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction
WARNING commonswiki : categorylinks encountered problems
continue? yes/no

However, this is the "operations side", I would like some help from someone with better mediawiki knowledge and see if we can improve mediawiki db issues detection. If not, the alternative is to kill long running queries server-side (plus T112473).

• jcrespo removed a project: Patch-For-Review.May 20 2016, 7:48 AM

zhuyifei1999 subscribed.May 20 2016, 7:51 AM

• jcrespo mentioned this in T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases.May 20 2016, 7:53 AM

I think I could have identified the issue: some jobqueues perform a write, then get locked WAIT_FOR_BINLOG_POS, that is a horrible pattern- not only it is nonstandard (does not work on multi-tier slaves or other kinds of database T135027) , it makes active waits- not in the "I am using CPU resources"-active, but consuming connections and blocking on a potentially always-blocked server. When a slave is found as "lagged"- be it one transaction behind or 30 minutes, the job should disconnect, reload its config, and try another random slave (and not just stick to a single server) aka "Soft-fail".

A slave being 1 second continuously behind has a huge overhead maintaining 1 second worth of connections == thousands. User requests should wait and get blocked, when needed, but jobs can be handled a bit more roughly.

In an ideal world, transactions should take a max of 1 second or be killable- and that should be enforced on server side.

See also: T133333#2311726

• jcrespo mentioned this in T133333: Audit MySQL configurations.May 20 2016, 8:46 AM

The code has stuff like:

if ( $this->mLoads[$i] > 0 ) {
        $ok = $this->doWait( $i, true, $timeout ) && $ok;
}

So in principle it should not be waiting due to slave lag on the vslow slave, since that slave has a mLoads of 0. Thus it sounds like there is a MW bug somewhere here.

try another random slave (and not just stick to a single server)

In the current code (As I understand it), between each job, the job queue isn't waiting for the current slave, but waiting for all slaves (With a non-zero load [in theory]). refreshLinksJob.php also happens to wait for the current slave to catch up to the stored master pos of when the job is generated, but that current slave should never be the vslow slave.

In the current code (As I understand it), between each job, the job queue isn't waiting for the current slave

The problem is not between job executions, but wait time for each individual job between a write and its propagation to the slave. If there is an issue on a slave, mediawiki should be made aware of it withing reasonable limits and not get blocked for so long.

The point I'm trying to make, is MW is already supposed to ignore issues on slaves that have a load of 0 (like vslow), so the fact that this bug actually happened suggests that there is something wrong with the current implementation, not that the current design is bad (The current design may also be unideal for all the reasons you pointed out, just that even with the current design, this bug should not have happened)

just that even with the current design, this bug should not have happened

We agree on that :-D, I am puzzled.

I need help to do more research on this, mediawiki is not my strong skill. I fixed the immediate cause of the issue, but this could happen again for other complete unrelated reasons.

Meanwhile I will try (from the ops side) to create an alert to detect this on T112473.

I have a suspicion, now that I see the logs- probably user-facing traffic was not affected at all- probably only the job queue was. Load balancing on the queue probably is to blame. I know this is out of scope, but I have asked several times for the job queue to use a separate role "rpc" for more flexibility in case of issues-specially given the high amount of connections it generates. The real fix would be something like periodically check the state of the servers every few seconds and error out faster.

• jcrespo removed • jcrespo as the assignee of this task.May 20 2016, 11:00 AM

• jcrespo edited projects, added MediaWiki-Core-JobQueue; removed Commons, DBA.

Poyekhali awarded a token.May 20 2016, 11:38 AM

Setting it back to normal because this is not an ongoing issue anymore and also measures have been taken to prevent future issues like this; this ticket is just open for the underlaying causes.

Johan moved this task from To Triage to In current Tech/News draft on the User-notice board.May 20 2016, 3:14 PM

Liuxinyu970226 subscribed.May 20 2016, 11:21 PM

Change 289993 had a related patch set uploaded (by Aaron Schulz):
Do not count 0-load servers in getMaxLag()

https://gerrit.wikimedia.org/r/289993

gerritbot added a project: Patch-For-Review.May 21 2016, 10:45 AM

• MZMcBride subscribed.May 22 2016, 12:44 AM

• jcrespo removed a parent task: T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases.May 23 2016, 8:07 AM

It just happen that Commons finally got arbitrary access to Wikidata and there some development of wikidata based tools. For example my Module:Authority control is not pooling identifiers from Wikidata at 2M pages. I hope those two events are not related.

Change 289993 merged by jenkins-bot:
Do not count 0-load servers in getMaxLag()

https://gerrit.wikimedia.org/r/289993

In T135809#2320708, @Jarekt wrote:

It just happen that Commons finally got arbitrary access to Wikidata and there some development of wikidata based tools. For example my Module:Authority control is not pooling identifiers from Wikidata at 2M pages. I hope those two events are not related.

I'm not entirely sure what "pooling identifiers from Wikidata at 2M pages" means. However the actual issue at commons has been resolved for three days (And really only lasted for a couple hours), I also don't think it would affect accessing wikidata properties (just possibly refreshing things when they change). The only reason this bug was still open was to make sure that it didn't happen again. With Aaron's patch now merged, that's complete

Sjoerddebruin moved this task from In current Tech/News draft to Recently announced in Tech/News on the User-notice board.May 25 2016, 7:27 AM

Johan moved this task from Recently announced in Tech/News to Already announced/Archive on the User-notice board.May 26 2016, 7:28 PM

Liuxinyu970226 unsubscribed.May 28 2016, 2:41 PM

ReleaseTaggerBot added projects: MW-1.28-release-notes, MW-1.28-release (WMF-deploy-2016-06-07_(1.28.0-wmf.5)).May 31 2016, 8:01 PM

Ladsgroup edited projects, added User-notice-archive; removed User-notice.Aug 13 2022, 1:54 PM

Restricted Application added a project: Performance-Team. · View Herald TranscriptAug 13 2022, 1:54 PM

	F4033555: Screenshot from 2016-05-20 08:34:04.png
	May 20 2016, 6:41 AM

Lag on a single commons server created user-facing issuesClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Lag on a single commons server created user-facing issues
Closed, ResolvedPublic
Actions