Page MenuHomePhabricator

s51362 has been rate limited to 2 concurrent connections for creating hundreds of 1400-second queries to labsdb1001 and labsdb1003 every 10 seconds
Closed, ResolvedPublic

Description

Event Timeline

jcrespo removed jcrespo as the assignee of this task.Apr 8 2017, 11:30 AM
jcrespo created this task.
jcrespo added a project: Tools.
jcrespo moved this task from Triage to Blocked external/Not db team on the DBA board.

It is believed this was the source of a brief outage happened today at 10-11 UTC, affecting all labsdb users.

jcrespo added a comment.EditedApr 8 2017, 11:41 AM

For lab admins: This incident https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=labsdb1001&var-network=eth0&from=1491640887345&to=1491651687345 (which stopped replication and caused lag due to the 500GB temporary tables) forced me to go briefly to read only mode and limit long running queries to 1000 seconds. I have already reverted the read only and replication lag is back to 0 https://tools.wmflabs.org/replag/ , but the hard query limit is still in place. I will take it away if things are stable on the following hours.

Noting here that s51362 is the mysql user for tool erwin85 - http://tools.wmflabs.org/?tool=erwin85

Akoopal added a subscriber: Akoopal.

Did:

tail -10000 access.log | awk '{print $7}' | sed 's/\?.*$//' | sort | uniq -c | sort -rn | less

(yes, not pretty, just quick) giving:

5941 /erwin85/xcontribs.php
2304 /erwin85/randomarticle.php
 296 /erwin85/relatedchanges.php
 267 /erwin85/categorycount.php
 227 /erwin85/contribs.php

Maybe splitting of the first tools, so they are more independent?

From irc:

12:00:15 valhallas : akoopal: the only script that has UNION_ALL is contribs.php
12:00:28 valhallas : so that's likely the culprit

The problem seems to be that the query uses the revision view rather than the revision_userindex view, which causes

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	revision	ALL	None	None	None	None	321741843	Using where; Using filesort
1	SIMPLE	page	eq_ref	PRIMARY	PRIMARY	4	enwiki.revision.rev_page	1

as EXPLAIN output, rather than

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	revision	ref	usertext_timestamp	usertext_timestamp	257	const	178	Using where; Using filesort
1	SIMPLE	page	eq_ref	PRIMARY	PRIMARY	4	enwiki.revision.rev_page	1

So the correct query would be along the lines of

select * from revision_userindex left join page on rev_page=page_id where rev_user_text='Valhallasw' order by rev_timestamp desc limit 1000;

Thanks; I switched to revision_userindex in cb8336a5238a72dff503bc598781059c8cead899 . Unless a way is provided to assess the impact of the rate limit (and hence the best way to give maximum service within the rate limit), please remove the rate limit and allow us to test whether the performance is sufficient.

On the same day I see one user who made many requests to xcontribs.php which generates a loop of user table queries. If *that* was the problem, we could instead block or rate limit that users.

I have increased the limit to 10 concurrent connections.

I would definitely support splitting tools accounts- othewise, if only one has issues, all will be affected. Please implement some concurrency limit to the database, too. Thank you.

Marostegui closed this task as Resolved.Jul 31 2017, 9:46 AM
Marostegui assigned this task to jcrespo.

Closing this - feel free to reopen if you believe this is not yet fixed.