18:15 < shinken-w> PROBLEM - Free space - all mounts on deployment-db2 is CRITICAL: CRITICAL: deployment-prep.deployment-db2.diskspace._mnt.byte_percentfree (<11.11%) 18:24 < wmf-insec> Project beta-update-databases-eqiad build #3855: FAILURE in 4 min 9 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/3855/ 18:26 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:26 < shinken-w> PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:34 < shinken-w> PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:35 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:39 < shinken-w> RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 38951 bytes in 0.470 second response time 18:39 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38645 bytes in 1.207 second response time 18:41 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 38636 bytes in 0.530 second response time 18:41 < shinken-w> PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<60.00%) 18:41 < shinken-w> RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 30385 bytes in 1.020 second response time 18:45 < shinken-w> PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:47 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:47 < shinken-w> PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:48 <+twentyaft> is it a general problem with labs infrastructure? I don't see a pattern ... 18:49 <+ greg-g> https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor 18:49 <+ greg-g> lot of db errors 18:50 <+marxarell> "Error: 1021 Disk full (/mnt/tmp/#sql_6f5_1); waiting for someone to free some space..." 18:51 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:52 <+twentyaft> deployment-db2 18:52 <+twentyaft> has full /mnt 18:53 <+twentyaft> and there isn't much I see to delete 18:53 <+twentyaft> I can delte the error log file 18:54 <+ greg-g> :/ 18:54 <+marxarell> looks like a massive temp disk table 18:54 <+marxarell> at /mnt/tmp/#sql_6f5_0.MAD 18:54 <+marxarell> 77G 18:54 <+twentyaft> so someone ran a very naughty query? 18:54 <+marxarell> likely 18:55 <+twentyaft> can we kill said query? 18:55 <+twentyaft> I don't know how to get root on mysql 18:55 <+marxarell> "| 15840440 | wikiadmin | 10.68.16.127:60495 | enwiki | Query | 3775 | Copying to tmp table | SELECT /* Flow\Formatter\ContributionsQuery::queryRevisions Luke081515 */ * F" 18:55 <+twentyaft> is the password stored somewhere I can look it up? 18:56 <+marxarell> twentyafterfour: you can sudo, then just `mysql` 18:56 <+marxarell> i'm guessing it's stored in root's my.cnf 18:56 <+twentyaft> oh I tried sudo mysql; didn't work, but sudo su; mysql; did work 18:58 <+twentyaft> so should we kill it? it's almost certainly not going to be able to complete since the disk can't be enlarged to accommodate ;) 18:58 <+marxarell> !log Killed mysql process 15840440 on account of its gargantuan temp file filling up /mnt 18:58 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master 18:58 <+twentyaft> marxarelli: nice work 18:59 <+twentyaft> !log deleted atop.log.* files on deployment-bastion. when are we going to enlarge /var on this instance. grr 18:59 <+marxarell> take that, fiend! 18:59 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master 19:00 < shinken-w> RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 38951 bytes in 0.520 second response time 19:00 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38652 bytes in 0.647 second response time 19:01 < shinken-w> RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK 19:02 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 38638 bytes in 0.512 second response time 19:55 < Krenair> State: Master has sent all binlog to slave; waiting for binlog to be updated 19:55 < Krenair> but on db2: 19:55 < Krenair> State: Waiting for master to send event 19:56 < Krenair> I wonder if that's correct.. 19:59 < Luke08151> hm. If this error was produced by my last action, it could be only this one: http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ALog&type=rights&user=&page=User%3ASelenium+Echo+user+2&year=&month=-1&tagfilter= 19:59 < Luke08151> But I don't know, why this action can make a db crash 20:00 < Krenair> Luke081515, sounds like it was you loading a user's contributions 20:01 < Luke08151> db in cause of a read action? o.O 20:01 < Krenair> sure 20:03 < Luke08151> I looked up the contributions of selenium user 2 and selenium user. selenium user later, and I loaded the pages of him with Special:Nuke, but don't did an action 20:03 < Luke08151> maybe an error there? 20:04 < Luke08151> or it 20:04 < Luke08151> *or at http://en.wikipedia.beta.wmflabs.org/wiki/Talk:Flow_QA, this is a very big flow board 20:05 < Luke08151> I hope that helps you 20:22 < Luke08151> greg-g: Good, that this happens only at beta. Imagine that would happend at production.... 20:35 < Krenair> Why is Seconds_Behind_Master NULL... 20:36 < Krenair> seems that means replication broke 20:50 <+marxarell> i'm seeing readonly errors from api calls to bc 20:50 < Krenair> to beta? 20:50 <+ greg-g> :( 20:50 <+marxarell> yeah 20:50 < Krenair> yes, it's read-only at the moment 20:50 <+marxarell> ah, ok 20:50 < Krenair> replication from deployment-db1 to deployment-db2 is broken 20:54 <+marxarell> !log deployment-db2 shows slave io but slave sql failed on duplicate key 20:54 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master 21:13 <+marxarell> !log deployment-db1 binlog deployment-db1-bin.000062 appears corrupt 21:13 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master 22:18 <+marxarell> !log dump of deployment-db1 failed due to "View 'labswiki.bounce_records' references invalid table(s)"
Description
Related Objects
- Mentioned In
- T78671: Beta Cluster Special:Contributions lags by a long time and notes slow Flow queries
T117446: Move `tmpdir` on deployment-db instances
T116793: Investigate slow query logging/digest for Beta Cluster
T116364: +Sysop for User:Mww113 - Mentioned Here
- T78671: Beta Cluster Special:Contributions lags by a long time and notes slow Flow queries
Event Timeline
The corrupt binlog was likely a result of /mnt filling up due to an earlier long/massive query that created a 70G+ file in /mnt/sqldata. One long term solution to mitigate corruption due to such unexpected tmpdir usage would be to move it to a volume/partition separate from datadir.
Replication is finally working again, after multiple errors due to views with invalid references in labswiki. No idea why it's setup the way it is—the entire database appears to consist of views of tables in deploymentwiki—but I was able to work around it by dumping the data and schema separately. (After a final look, doing the separate dumps may not have been strictly necessary, but it made finding the culprit dump queries much easier.)
Good, that this happens only at beta. Imagine that would happend at production....
I think this could not disrupt production wikis because flow is on a separate set of hosts. However, I would like the code owners to have a look at the code of Flow\Formatter\ContributionsQuery::queryRevisions to see if that is reproducible on production.
Also, I would like to work with RelEng to make Beta more similar to production (as part of other ongoing tasks). Let me know if I can help with anything related to the databases.
Thanks @jcrespo, would it be useful to have a conversation about what you're thinking about specifically? We (RelEng) probably won't be able to commit any time to it right now, but similar work (effectively the same work as you describe, afaict) is scheduled in our Q3 plans: https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Goals/201516Q3 (assuming plans don't change, which is a big assumption ;) ).
I looked, and unfortunately that function is way too versatile: there's a 3x3 matrix of possible query types. I still have to analyze these queries to see what indexes they can use, but here's what the queries look like.
The basic structure of the query is this (things in [brackets] are optional, i.e. not always present):
SELECT * FROM flow_revision (joins go here) WHERE rev_user_wiki='somewiki' AND workflow_wiki='somewiki' [AND workflow_namespace=somenumber] [AND rev_id < number | rev_id > number] (additional where clauses go here) LIMIT number ORDER BY rev_id DESC;
There are 3 options for the joins:
SELECT * FROM flow_revision INNER JOIN flow_tree_revision ON tree_rev_id=rev_id INNER JOIN flow_tree_node ON tree_descendant_id=tree_rev_descendant_id INNER JOIN flow_workflow ON workflow_id=tree_ancestor_id --- or --- SELECT * FROM flow_revision INNER JOIN flow_workflow ON workflow_id=rev_type_id AND rev_type='header' --- or --- SELECT * FROM flow_revision, flow_workflow, flow_tree_node WHERE workflow_id=tree_ancestor_id AND tree_descendant_id=rev_type_id AND rev_type='post-summary'
Then there are three possible sets of where clauses that will be added:
AND rev_user_id IS NULL AND rev_user_ip='something' --- or --- AND rev_user_id=number AND rev_user_ip IS NULL --- or --- AND rev_user_id > number AND rev_user_id NOT IN (number, number, number, ....) AND rev_user_ip IS NULL
I'll analyze these later, but my hunch is that the latter query type with the NOT IN and the range scan on rev_user_id might be a problem.
Do we have a way to find out what the slow query was that caused this outage?
You might want to follow up on query optimization and code audit in a different task?
Unfortunately, we don't have the full query. The slow query log is disabled and I didn't grab the full output of show full processlist—with a pager that wouldn't truncate it, at least—before killing the process. Perhaps we should experiment with the former and the right long_query_time to catch/analyze such trouble queries in the future. I'll open a follow-up task.
It appears there was a query logged by HHVM's SlowTimer 5 minutes prior to moment I killed the other one, in case this helps. It appears at least somewhat related, and may even be the same query if HHVM is logging these preemptively—I'm not familiar enough with the mechanism to say.
{ "_index": "logstash-2015.10.23", "_type": "hhvm", "_id": "AVCV1YAja1EjumVdvU2M", "_score": null, "_source": { "message": "SlowTimer [59996ms] at runtime/ext_mysql: slow query: SELECT /* Flow\\\\Formatter\\\\ContributionsQuery::queryRevisions Luke081515 */ * FROM `flow_revision`,`flow_workflow`,`flow_tree_node` WHERE rev_user_id = '820' AND rev_user_ip IS NULL AND rev_user_wiki = 'enwiki' AND (rev_id < '#005G\\xE9\\xEFp\\\\0\\\\0\\\\0\\\\0\\\\0\\\\0') AND workflow_wiki = 'enwiki' AND (tree_descendant_id = rev_type_id) AND rev_type = 'post-summary' ORDER BY rev_id DESC LIMIT 51", "@version": "1", "@timestamp": "2015-10-23T17:53:28.000Z", "type": "hhvm", "host": "deployment-mediawiki02", "priority": 11, "timestamp": "Oct 23 17:53:28", "logsource": "deployment-mediawiki02", "program": "hhvm", "severity": 3, "facility": 1, "facility_label": "user-level", "severity_label": "Error", "level": "ERROR", "tags": [ "syslog", "es", "_grokparsefailure", "normalized_message_trimmed" ], "normalized_message": "SlowTimer [59996ms] at runtime/ext_mysql: slow query: SELECT /* Flow\\\\Formatter\\\\ContributionsQuery::queryRevisions Luke081515 */ * FROM `flow_revision`,`flow_workflow`,`flow_tree_node` WHERE rev_user_id = '820' AND rev_user_ip IS NULL AND rev_user_wi" }, "sort": [ 1445622808000 ] }
Looks like this is the same query as the one called out at T78671: Beta Cluster Special:Contributions lags by a long time and notes slow Flow queries, which is the one I suspected.
Spoke too soon: the query type I suspected of being bad was the one with a range scan on rev_user_id, this one has a range scan on rev_id. I'll analyze both.
The root cause was a Flow query which is tracked via T78671 . Nothing left to be done here.