18:15 < shinken-w> PROBLEM - Free space - all mounts on deployment-db2 is CRITICAL: CRITICAL: deployment-prep.deployment-db2.diskspace._mnt.byte_percentfree (<11.11%)
18:24 < wmf-insec> Project beta-update-databases-eqiad build #3855: FAILURE in 4 min 9 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/3855/
18:26 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:26 < shinken-w> PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:34 < shinken-w> PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:35 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:39 < shinken-w> RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 38951 bytes in 0.470 second response time
18:39 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38645 bytes in 1.207 second response time
18:41 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 38636 bytes in 0.530 second response time
18:41 < shinken-w> PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<60.00%)
18:41 < shinken-w> RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 30385 bytes in 1.020 second response time
18:45 < shinken-w> PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:47 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:47 < shinken-w> PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:48 <+twentyaft> is it a general problem with labs infrastructure? I don't see a pattern ...
18:49 <+ greg-g> https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor
18:49 <+ greg-g> lot of db errors
18:50 <+marxarell> "Error: 1021 Disk full (/mnt/tmp/#sql_6f5_1); waiting for someone to free some space..."
18:51 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:52 <+twentyaft> deployment-db2
18:52 <+twentyaft> has full /mnt
18:53 <+twentyaft> and there isn't much I see to delete
18:53 <+twentyaft> I can delte the error log file
18:54 <+ greg-g> :/
18:54 <+marxarell> looks like a massive temp disk table
18:54 <+marxarell> at /mnt/tmp/#sql_6f5_0.MAD
18:54 <+marxarell> 77G
18:54 <+twentyaft> so someone ran a very naughty query?
18:54 <+marxarell> likely
18:55 <+twentyaft> can we kill said query?
18:55 <+twentyaft> I don't know how to get root on mysql
18:55 <+marxarell> "| 15840440 | wikiadmin | 10.68.16.127:60495 | enwiki | Query | 3775 | Copying to tmp table | SELECT /* Flow\Formatter\ContributionsQuery::queryRevisions Luke081515 */ * F"
18:55 <+twentyaft> is the password stored somewhere I can look it up?
18:56 <+marxarell> twentyafterfour: you can sudo, then just `mysql`
18:56 <+marxarell> i'm guessing it's stored in root's my.cnf
18:56 <+twentyaft> oh I tried sudo mysql; didn't work, but sudo su; mysql; did work
18:58 <+twentyaft> so should we kill it? it's almost certainly not going to be able to complete since the disk can't be enlarged to accommodate ;)
18:58 <+marxarell> !log Killed mysql process 15840440 on account of its gargantuan temp file filling up /mnt
18:58 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
18:58 <+twentyaft> marxarelli: nice work
18:59 <+twentyaft> !log deleted atop.log.* files on deployment-bastion. when are we going to enlarge /var on this instance. grr
18:59 <+marxarell> take that, fiend!
18:59 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
19:00 < shinken-w> RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 38951 bytes in 0.520 second response time
19:00 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38652 bytes in 0.647 second response time
19:01 < shinken-w> RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK
19:02 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 38638 bytes in 0.512 second response time
19:55 < Krenair> State: Master has sent all binlog to slave; waiting for binlog to be updated
19:55 < Krenair> but on db2:
19:55 < Krenair> State: Waiting for master to send event
19:56 < Krenair> I wonder if that's correct..
19:59 < Luke08151> hm. If this error was produced by my last action, it could be only this one: http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ALog&type=rights&user=&page=User%3ASelenium+Echo+user+2&year=&month=-1&tagfilter=
19:59 < Luke08151> But I don't know, why this action can make a db crash
20:00 < Krenair> Luke081515, sounds like it was you loading a user's contributions
20:01 < Luke08151> db in cause of a read action? o.O
20:01 < Krenair> sure
20:03 < Luke08151> I looked up the contributions of selenium user 2 and selenium user. selenium user later, and I loaded the pages of him with Special:Nuke, but don't did an action
20:03 < Luke08151> maybe an error there?
20:04 < Luke08151> or it
20:04 < Luke08151> *or at http://en.wikipedia.beta.wmflabs.org/wiki/Talk:Flow_QA, this is a very big flow board
20:05 < Luke08151> I hope that helps you
20:22 < Luke08151> greg-g: Good, that this happens only at beta. Imagine that would happend at production....
20:35 < Krenair> Why is Seconds_Behind_Master NULL...
20:36 < Krenair> seems that means replication broke
20:50 <+marxarell> i'm seeing readonly errors from api calls to bc
20:50 < Krenair> to beta?
20:50 <+ greg-g> :(
20:50 <+marxarell> yeah
20:50 < Krenair> yes, it's read-only at the moment
20:50 <+marxarell> ah, ok
20:50 < Krenair> replication from deployment-db1 to deployment-db2 is broken
20:54 <+marxarell> !log deployment-db2 shows slave io but slave sql failed on duplicate key
20:54 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
21:13 <+marxarell> !log deployment-db1 binlog deployment-db1-bin.000062 appears corrupt
21:13 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
22:18 <+marxarell> !log dump of deployment-db1 failed due to "View 'labswiki.bounce_records' references invalid table(s)"