s3 database resource usage and contention increased 2-10x times
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Dec 14 2016, 10:07 AM

Description

Happening since 20:51-20:57 2016-12-13 UTC. Reverting 1.29.0-wmf.6 -> wmf.5 made no difference:

Suspicious patterns on the s3 master (a decrease in writes, but very spiky): https://grafana.wikimedia.org/dashboard/db/mysql?from=now-24h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1075
3x the query throughput (not a definitive measure, but could indicate something is wrong): https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=eqiad%20prometheus%2Fops&var-group=All&var-shard=s3&var-role=All&from=now-24h&to=now
Traffic increased 5x: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=2&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-group=All&var-shard=s3&var-role=All&from=now-24h&to=now
For a main s3 slave (db1077): 1.5-2x times the load of the average enwiki slave, when it used to be 0.5 times only
3-4x the QPS and traffic on a single slave: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1077&from=now-24h&to=now https://grafana.wikimedia.org/dashboard/db/mysql?panelId=5&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1077&from=now-24h&to=now
2-3x the server load: https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=db1077&var-network=eth0&from=now-24h&to=now
IO activity dropped hugely, which can be a sign of server contention: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=20&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1077&from=now-24h&to=now https://grafana.wikimedia.org/dashboard/db/mysql?panelId=34&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1077&from=now-24h&to=now
Mutex contention increased 10x: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=23&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1077&from=now-24h&to=now

Revisions and Commits

rOMWC Wikimedia - MediaWiki Config
	rOMWC11cd0b822f50 all wikis to 1.29.0-wmf.5

Related Objects

Mentioned Here: T152761: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php
T153159: pl proc line: 2959: warning: points must have either 4 or 2 values per line

Event Timeline

jcrespo created this task.Dec 14 2016, 10:07 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 14 2016, 10:07 AM

Legoktm added a parent task: T152563: MW-1.29.0-wmf.6 deployment blockers.Dec 14 2016, 10:21 AM

jcrespo updated the task description. (Show Details)Dec 14 2016, 10:23 AM

After a first look, the issue is not 100% clear that this is due to a train deployment, as the issue seems to continue, but there is not appreciable load on group0 wikis. The load on s3 is very timezone sensitive, so there is not yet firm conclusions. @Legoktm and I commented that would be a huge coincidence that the issue started at the same time than the scap.

are there any async jobs running along with group0?

This may be related:
After deploying the train, I noticed a lot of this error in logstash: pl proc line: 2959: warning: points must have either 4 or 2 ( and I filed T153159 )

@jcrespo: I can revert and see what happens?

• mmodell added a commit: rOMWC11cd0b822f50: all wikis to 1.29.0-wmf.5.Dec 14 2016, 7:57 PM

Mentioned in SAL (#wikimedia-operations) [2016-12-14T20:13:16Z] <twentyafterfour@tin> rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.5 refs T153184

I just reverted, waiting a while to see if it makes any difference in the graphs linked above..

It doesn't appear to be helping.

In T153184#2873726, @mmodell wrote:

are there any async jobs running along with group0?

This may be related:
After deploying the train, I noticed a lot of this error in logstash: pl proc line: 2959: warning: points must have either 4 or 2 ( and I filed T153159 )

It's not related, that's gone on for months now. It has something to do with ploticus (for the Timeline extension). I'm pretty sure we see it during deploys because some cache gets invalidated.

ok so rollback of the train hasn't helped. Any other ideas of what could be causing this?

On s3 we had the same bump of traffic over the week-end from Dec 10 08:55UTC till Dec 11 20:55UTC:

S3-7days-traffic.png (421×765 px, 59 KB)

removing from deployment blockers because this is apparently unrelated to the new branch.

• mmodell removed a parent task: T152563: MW-1.29.0-wmf.6 deployment blockers.Dec 14 2016, 9:06 PM

Thanks for checking. I really mean it. I will continue investigating to see what is the source of this overhead.

@jcrespo: Glad to help, that's what I'm here for :)

This maintenance T152761#2874723 would explain the extra resource usage.

The script is finally finished.

Confirmed that was the cause: https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated?from=1481220766259&to=1481825566259&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s3&var-role=All

	F5055852: S3-7days-traffic.png
	Dec 14 2016, 8:41 PM

s3 database resource usage and contention increased 2-10x timesClosed, ResolvedPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

s3 database resource usage and contention increased 2-10x times
Closed, ResolvedPublic
Actions