Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | jcrespo | T132431 labsdb1001 and labsdb1003 short on available space | |||
| Resolved | jcrespo | T162519 s51362 has been rate limited to 2 concurrent connections for creating hundreds of 1400-second queries to labsdb1001 and labsdb1003 every 10 seconds |
Event Timeline
It is believed this was the source of a brief outage happened today at 10-11 UTC, affecting all labsdb users.
For lab admins: This incident https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=labsdb1001&var-network=eth0&from=1491640887345&to=1491651687345 (which stopped replication and caused lag due to the 500GB temporary tables) forced me to go briefly to read only mode and limit long running queries to 1000 seconds. I have already reverted the read only and replication lag is back to 0 https://tools.wmflabs.org/replag/ , but the hard query limit is still in place. I will take it away if things are stable on the following hours.
Noting here that s51362 is the mysql user for tool erwin85 - http://tools.wmflabs.org/?tool=erwin85
Did:
tail -10000 access.log | awk '{print $7}' | sed 's/\?.*$//' | sort | uniq -c | sort -rn | less
(yes, not pretty, just quick) giving:
5941 /erwin85/xcontribs.php 2304 /erwin85/randomarticle.php 296 /erwin85/relatedchanges.php 267 /erwin85/categorycount.php 227 /erwin85/contribs.php
Maybe splitting of the first tools, so they are more independent?
From irc:
12:00:15 valhallas : akoopal: the only script that has UNION_ALL is contribs.php
12:00:28 valhallas : so that's likely the culprit
The problem seems to be that the query uses the revision view rather than the revision_userindex view, which causes
id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE revision ALL None None None None 321741843 Using where; Using filesort 1 SIMPLE page eq_ref PRIMARY PRIMARY 4 enwiki.revision.rev_page 1
as EXPLAIN output, rather than
id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE revision ref usertext_timestamp usertext_timestamp 257 const 178 Using where; Using filesort 1 SIMPLE page eq_ref PRIMARY PRIMARY 4 enwiki.revision.rev_page 1
So the correct query would be along the lines of
select * from revision_userindex left join page on rev_page=page_id where rev_user_text='Valhallasw' order by rev_timestamp desc limit 1000;
Thanks; I switched to revision_userindex in cb8336a5238a72dff503bc598781059c8cead899 . Unless a way is provided to assess the impact of the rate limit (and hence the best way to give maximum service within the rate limit), please remove the rate limit and allow us to test whether the performance is sufficient.
On the same day I see one user who made many requests to xcontribs.php which generates a loop of user table queries. If *that* was the problem, we could instead block or rate limit that users.
I would definitely support splitting tools accounts- othewise, if only one has issues, all will be affected. Please implement some concurrency limit to the database, too. Thank you.
