Page MenuHomePhabricator

[postmortem] Beta Cluster outage: deployment-db2 disk filled up, locked db replication
Closed, ResolvedPublic

Description

18:15 < shinken-w> PROBLEM - Free space - all mounts on deployment-db2 is CRITICAL: CRITICAL: deployment-prep.deployment-db2.diskspace._mnt.byte_percentfree (<11.11%)
18:24 < wmf-insec> Project beta-update-databases-eqiad build #3855: FAILURE in 4 min 9 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/3855/
18:26 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
18:26 < shinken-w> PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:34 < shinken-w> PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
18:35 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:39 < shinken-w> RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 38951 bytes in 0.470 second response time  
18:39 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38645 bytes in 1.207 second response time  
18:41 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 38636 bytes in 0.530 second response time  
18:41 < shinken-w> PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<60.00%)  
18:41 < shinken-w> RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 30385 bytes in 1.020 second response time
18:45 < shinken-w> PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:47 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
18:47 < shinken-w> PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds 
18:48 <+twentyaft> is it a general problem with labs infrastructure? I don't see a pattern ...  
18:49 <+   greg-g> https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor  
18:49 <+   greg-g> lot of db errors  
18:50 <+marxarell> "Error: 1021 Disk full (/mnt/tmp/#sql_6f5_1); waiting for someone to free some space..."
18:51 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
18:52 <+twentyaft> deployment-db2  
18:52 <+twentyaft> has full /mnt  
18:53 <+twentyaft> and there isn't much I see to delete  
18:53 <+twentyaft> I can delte the error log file  
18:54 <+   greg-g> :/  
18:54 <+marxarell> looks like a massive temp disk table  
18:54 <+marxarell> at /mnt/tmp/#sql_6f5_0.MAD  
18:54 <+marxarell> 77G  
18:54 <+twentyaft> so someone ran a very naughty query?  
18:54 <+marxarell> likely  
18:55 <+twentyaft> can we kill said query?  
18:55 <+twentyaft> I don't know how to get root on mysql  
18:55 <+marxarell> "| 15840440 | wikiadmin   | 10.68.16.127:60495 | enwiki      | Query   |    3775 | Copying to tmp table                   | SELECT /* Flow\Formatter\ContributionsQuery::queryRevisions Luke081515 */  *  F"
18:55 <+twentyaft> is the password stored somewhere I can look it up?
18:56 <+marxarell> twentyafterfour: you can sudo, then just `mysql`
18:56 <+marxarell> i'm guessing it's stored in root's my.cnf
18:56 <+twentyaft> oh I tried sudo mysql; didn't work, but sudo su; mysql;  did work
18:58 <+twentyaft> so should we kill it? it's almost certainly not going to be able to complete since the disk can't be enlarged to accommodate ;)
18:58 <+marxarell> !log Killed mysql process 15840440 on account of its gargantuan temp file filling up /mnt
18:58 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
18:58 <+twentyaft> marxarelli: nice work
18:59 <+twentyaft> !log deleted atop.log.* files on deployment-bastion. when are we going to enlarge /var on this instance. grr
18:59 <+marxarell> take that, fiend!
18:59 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
19:00 < shinken-w> RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 38951 bytes in 0.520 second response time
19:00 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38652 bytes in 0.647 second response time
19:01 < shinken-w> RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK
19:02 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 38638 bytes in 0.512 second response time

19:55 <   Krenair>    State: Master has sent all binlog to slave; waiting for binlog to be updated
19:55 <   Krenair> but on db2:
19:55 <   Krenair>    State: Waiting for master to send event
19:56 <   Krenair> I wonder if that's correct..
19:59 < Luke08151> hm. If this error was produced by my last action, it could be only this one: http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ALog&type=rights&user=&page=User%3ASelenium+Echo+user+2&year=&month=-1&tagfilter=
19:59 < Luke08151> But I don't know, why this action can make a db crash
20:00 <   Krenair> Luke081515, sounds like it was you loading a user's contributions
20:01 < Luke08151> db in cause of a read action? o.O
20:01 <   Krenair> sure
20:03 < Luke08151> I looked up the contributions of selenium user 2 and selenium user. selenium user later, and I loaded the pages of him with Special:Nuke, but don't did an action
20:03 < Luke08151> maybe an error there?
20:04 < Luke08151> or it 
20:04 < Luke08151> *or at http://en.wikipedia.beta.wmflabs.org/wiki/Talk:Flow_QA, this is a very big flow board
20:05 < Luke08151> I hope that helps you
20:22 < Luke08151> greg-g: Good, that this happens only at beta. Imagine that would happend at production....
20:35 <   Krenair> Why is Seconds_Behind_Master NULL...
20:36 <   Krenair> seems that means replication broke


20:50 <+marxarell> i'm seeing readonly errors from api calls to bc
20:50 <   Krenair> to beta?
20:50 <+   greg-g> :(
20:50 <+marxarell> yeah
20:50 <   Krenair> yes, it's read-only at the moment
20:50 <+marxarell> ah, ok
20:50 <   Krenair> replication from deployment-db1 to deployment-db2 is broken

20:54 <+marxarell> !log deployment-db2 shows slave io but slave sql failed on duplicate key
20:54 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
21:13 <+marxarell> !log deployment-db1 binlog deployment-db1-bin.000062 appears corrupt
21:13 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
22:18 <+marxarell> !log dump of deployment-db1 failed due to "View 'labswiki.bounce_records' references invalid table(s)"

Event Timeline

greg raised the priority of this task from to Unbreak Now!.
greg updated the task description. (Show Details)
greg subscribed.

The corrupt binlog was likely a result of /mnt filling up due to an earlier long/massive query that created a 70G+ file in /mnt/sqldata. One long term solution to mitigate corruption due to such unexpected tmpdir usage would be to move it to a volume/partition separate from datadir.

Replication is finally working again, after multiple errors due to views with invalid references in labswiki. No idea why it's setup the way it is—the entire database appears to consist of views of tables in deploymentwiki—but I was able to work around it by dumping the data and schema separately. (After a final look, doing the separate dumps may not have been strictly necessary, but it made finding the culprit dump queries much easier.)

Good, that this happens only at beta. Imagine that would happend at production....

I think this could not disrupt production wikis because flow is on a separate set of hosts. However, I would like the code owners to have a look at the code of Flow\Formatter\ContributionsQuery::queryRevisions to see if that is reproducible on production.

Also, I would like to work with RelEng to make Beta more similar to production (as part of other ongoing tasks). Let me know if I can help with anything related to the databases.

Thanks @jcrespo, would it be useful to have a conversation about what you're thinking about specifically? We (RelEng) probably won't be able to commit any time to it right now, but similar work (effectively the same work as you describe, afaict) is scheduled in our Q3 plans: https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Goals/201516Q3 (assuming plans don't change, which is a big assumption ;) ).

Good, that this happens only at beta. Imagine that would happend at production....

I think this could not disrupt production wikis because flow is on a separate set of hosts. However, I would like the code owners to have a look at the code of Flow\Formatter\ContributionsQuery::queryRevisions to see if that is reproducible on production.

I looked, and unfortunately that function is way too versatile: there's a 3x3 matrix of possible query types. I still have to analyze these queries to see what indexes they can use, but here's what the queries look like.

The basic structure of the query is this (things in [brackets] are optional, i.e. not always present):

SELECT *
FROM flow_revision
(joins go here)
WHERE
rev_user_wiki='somewiki'
AND workflow_wiki='somewiki'
[AND workflow_namespace=somenumber]
[AND rev_id < number | rev_id > number]
(additional where clauses go here)
LIMIT number
ORDER BY rev_id DESC;

There are 3 options for the joins:

SELECT *
FROM flow_revision
INNER JOIN flow_tree_revision ON tree_rev_id=rev_id
INNER JOIN flow_tree_node ON tree_descendant_id=tree_rev_descendant_id
INNER JOIN flow_workflow ON workflow_id=tree_ancestor_id

--- or ---

SELECT *
FROM flow_revision
INNER JOIN flow_workflow ON workflow_id=rev_type_id AND rev_type='header'

--- or ---

SELECT *
FROM flow_revision, flow_workflow, flow_tree_node
WHERE workflow_id=tree_ancestor_id
AND tree_descendant_id=rev_type_id
AND rev_type='post-summary'

Then there are three possible sets of where clauses that will be added:

AND rev_user_id IS NULL
AND rev_user_ip='something'

--- or ---

AND rev_user_id=number
AND rev_user_ip IS NULL

--- or ---

AND rev_user_id > number
AND rev_user_id NOT IN (number, number, number, ....)
AND rev_user_ip IS NULL

I'll analyze these later, but my hunch is that the latter query type with the NOT IN and the range scan on rev_user_id might be a problem.

Do we have a way to find out what the slow query was that caused this outage?

You might want to follow up on query optimization and code audit in a different task?

Do we have a way to find out what the slow query was that caused this outage?

Unfortunately, we don't have the full query. The slow query log is disabled and I didn't grab the full output of show full processlist—with a pager that wouldn't truncate it, at least—before killing the process. Perhaps we should experiment with the former and the right long_query_time to catch/analyze such trouble queries in the future. I'll open a follow-up task.

It appears there was a query logged by HHVM's SlowTimer 5 minutes prior to moment I killed the other one, in case this helps. It appears at least somewhat related, and may even be the same query if HHVM is logging these preemptively—I'm not familiar enough with the mechanism to say.

{
  "_index": "logstash-2015.10.23",
  "_type": "hhvm",
  "_id": "AVCV1YAja1EjumVdvU2M",
  "_score": null,
  "_source": {
    "message": "SlowTimer [59996ms] at runtime/ext_mysql: slow query: SELECT /* Flow\\\\Formatter\\\\ContributionsQuery::queryRevisions Luke081515 */  *  FROM `flow_revision`,`flow_workflow`,`flow_tree_node`   WHERE rev_user_id = '820' AND rev_user_ip IS NULL AND rev_user_wiki = 'enwiki' AND (rev_id < '#005G\\xE9\\xEFp\\\\0\\\\0\\\\0\\\\0\\\\0\\\\0') AND workflow_wiki = 'enwiki' AND (tree_descendant_id = rev_type_id) AND rev_type = 'post-summary'  ORDER BY rev_id DESC LIMIT 51",
    "@version": "1",
    "@timestamp": "2015-10-23T17:53:28.000Z",
    "type": "hhvm",
    "host": "deployment-mediawiki02",
    "priority": 11,
    "timestamp": "Oct 23 17:53:28",
    "logsource": "deployment-mediawiki02",
    "program": "hhvm",
    "severity": 3,
    "facility": 1,
    "facility_label": "user-level",
    "severity_label": "Error",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "_grokparsefailure",
      "normalized_message_trimmed"
    ],
    "normalized_message": "SlowTimer [59996ms] at runtime/ext_mysql: slow query: SELECT /* Flow\\\\Formatter\\\\ContributionsQuery::queryRevisions Luke081515 */  *  FROM `flow_revision`,`flow_workflow`,`flow_tree_node`   WHERE rev_user_id = '820' AND rev_user_ip IS NULL AND rev_user_wi"
  },
  "sort": [
    1445622808000
  ]
}

Looks like this is the same query as the one called out at T78671: Beta Cluster Special:Contributions lags by a long time and notes slow Flow queries, which is the one I suspected.

Looks like this is the same query as the one called out at T78671: Beta Cluster Special:Contributions lags by a long time and notes slow Flow queries, which is the one I suspected.

Spoke too soon: the query type I suspected of being bad was the one with a range scan on rev_user_id, this one has a range scan on rev_id. I'll analyze both.

hashar set Security to None.

Pro tip: you can syntax highlight IRC logs with in Markdown with lang=irc

Reopening since we still follow up on determining the root cause of the outage.

hashar renamed this task from Beta Cluster outage: deployment-db2 disk filled up, locked db replication to [postmortem] Beta Cluster outage: deployment-db2 disk filled up, locked db replication.Oct 30 2015, 10:50 PM
hashar lowered the priority of this task from Unbreak Now! to High.
hashar moved this task from To Triage to In-progress on the Beta-Cluster-Infrastructure board.

Reopening since we still follow up on determining the root cause of the outage.

This fell off our radar. What else should be done here?

Peter601980 renamed this task from [postmortem] Beta Cluster outage: deployment-db2 disk filled up, locked db replication to k.Dec 28 2015, 8:08 AM
Peter601980 closed this task as Invalid.
Peter601980 removed dduvall as the assignee of this task.
Peter601980 lowered the priority of this task from High to Lowest.
Peter601980 updated the task description. (Show Details)
Aklapper renamed this task from k to [postmortem] Beta Cluster outage: deployment-db2 disk filled up, locked db replication.Dec 28 2015, 9:30 AM
Aklapper reopened this task as Open.
Aklapper assigned this task to dduvall.
Aklapper raised the priority of this task from Lowest to High.
Aklapper updated the task description. (Show Details)
Aklapper added subscribers: hashar, Catrope, MZMcBride and 6 others.

The root cause was a Flow query which is tracked via T78671 . Nothing left to be done here.