Change Details

18:15 < shinken-w> PROBLEM - Free space - all mounts on deployment-db2 is CRITICAL: CRITICAL: deployment-prep.deployment-db2.diskspace._mnt.byte_percentfree (<11.11%) 18:24 < wmf-insec> Project beta-update-databases-eqiad build #3855: FAILURE in 4 min 9 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/3855/ 18:26 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:26 < shinken-w> PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:34 < shinken-w> PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:35 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:39 < shinken-w> RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 38951 bytes in 0.470 second response time 18:39 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38645 bytes in 1.207 second response time 18:41 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 38636 bytes in 0.530 second response time 18:41 < shinken-w> PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<60.00%) 18:41 < shinken-w> RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 30385 bytes in 1.020 second response time 18:45 < shinken-w> PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:47 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:47 < shinken-w> PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:48 <+twentyaft> is it a general problem with labs infrastructure? I don't see a pattern ... 18:49 <+ greg-g> https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor 18:49 <+ greg-g> lot of db errors 18:50 <+marxarell> "Error: 1021 Disk full (/mnt/tmp/#sql_6f5_1); waiting for someone to free some space..." 18:51 < shinken-w> PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:52 <+twentyaft> deployment-db2 18:52 <+twentyaft> has full /mnt 18:53 <+twentyaft> and there isn't much I see to delete 18:53 <+twentyaft> I can delte the error log file 18:54 <+ greg-g> :/ 18:54 <+marxarell> looks like a massive temp disk table 18:54 <+marxarell> at /mnt/tmp/#sql_6f5_0.MAD 18:54 <+marxarell> 77G 18:54 <+twentyaft> so someone ran a very naughty query? 18:54 <+marxarell> likely 18:55 <+twentyaft> can we kill said query? 18:55 <+twentyaft> I don't know how to get root on mysql 18:55 <+marxarell> "| 15840440 | wikiadmin | 10.68.16.127:60495 | enwiki | Query | 3775 | Copying to tmp table | SELECT /* Flow\Formatter\ContributionsQuery::queryRevisions Luke081515 */ * F" 18:55 <+twentyaft> is the password stored somewhere I can look it up? 18:56 <+marxarell> twentyafterfour: you can sudo, then just `mysql` 18:56 <+marxarell> i'm guessing it's stored in root's my.cnf 18:56 <+twentyaft> oh I tried sudo mysql; didn't work, but sudo su; mysql; did work 18:58 <+twentyaft> so should we kill it? it's almost certainly not going to be able to complete since the disk can't be enlarged to accommodate ;) 18:58 <+marxarell> !log Killed mysql process 15840440 on account of its gargantuan temp file filling up /mnt 18:58 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master 18:58 <+twentyaft> marxarelli: nice work 18:59 <+twentyaft> !log deleted atop.log.* files on deployment-bastion. when are we going to enlarge /var on this instance. grr 18:59 <+marxarell> take that, fiend! 18:59 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master 19:00 < shinken-w> RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 38951 bytes in 0.520 second response time 19:00 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38652 bytes in 0.647 second response time 19:01 < shinken-w> RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK 19:02 < shinken-w> RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 38638 bytes in 0.512 second response time 19:55 < Krenair> State: Master has sent all binlog to slave; waiting for binlog to be updated 19:55 < Krenair> but on db2: 19:55 < Krenair> State: Waiting for master to send event 19:56 < Krenair> I wonder if that's correct.. 19:59 < Luke08151> hm. If this error was produced by my last action, it could be only this one: http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ALog&type=rights&user=&page=User%3ASelenium+Echo+user+2&year=&month=-1&tagfilter= 19:59 < Luke08151> But I don't know, why this action can make a db crash 20:00 < Krenair> Luke081515, sounds like it was you loading a user's contributions 20:01 < Luke08151> db in cause of a read action? o.O 20:01 < Krenair> sure 20:03 < Luke08151> I looked up the contributions of selenium user 2 and selenium user. selenium user later, and I loaded the pages of him with Special:Nuke, but don't did an action 20:03 < Luke08151> maybe an error there? 20:04 < Luke08151> or it 20:04 < Luke08151> *or at http://en.wikipedia.beta.wmflabs.org/wiki/Talk:Flow_QA, this is a very big flow board 20:05 < Luke08151> I hope that helps you 20:22 < Luke08151> greg-g: Good, that this happens only at beta. Imagine that would happend at production.... 20:35 < Krenair> Why is Seconds_Behind_Master NULL... 20:36 < Krenair> seems that means replication broke 20:50 <+marxarell> i'm seeing readonly errors from api calls to bc 20:50 < Krenair> to beta? 20:50 <+ greg-g> :( 20:50 <+marxarell> yeah 20:50 < Krenair> yes, it's read-only at the moment 20:50 <+marxarell> ah, ok 20:50 < Krenair> replication from deployment-db1 to deployment-db2 is broken 20:54 <+marxarell> !log deployment-db2 shows slave io but slave sql failed on duplicate key 20:54 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master 21:13 <+marxarell> !log deployment-db1 binlog deployment-db1-bin.000062 appears corrupt 21:13 < qa-morebo> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master 22:18 <+marxarell> !log dump of deployment-db1 failed due to "View 'labswiki.bounce_records' references invalid table(s)"