13:27 <phe> valhallasw`cloud, can you get a look if tools-webgrid-lighttpd-1201 has trouble?
13:28 <valhallasw`cloud> phe: what's wrong?
13:28 <phe> my tools return 404, web server seems up from qstat but webservice restart timeout trying to kill the server
13:28 <phe> it runs on tools-webgrid-lighttpd-1201 which seems down, I can't ssh to
13:29 <phe> other tools on 1201 seems to freeze too
Description
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | None | T124133 NFS overload is causing instances to freeze | |||
| Resolved | • chasemp | T122719 tools-webgrid-lighttpd-1201 webservices and ssh unaccessible |
Event Timeline
Comment Actions
root login also hangs:
debug1: Offering RSA public key: labs-root.id_rsa debug1: Server accepts key: pkalg ssh-rsa blen 277 debug1: Authentication succeeded (publickey). Authenticated to tools-webgrid-lighttpd-1201. (via proxy). debug1: channel 0: new [client-session] debug1: Requesting no-more-sessions@openssh.com debug1: Entering interactive session.
Currently-running webservices are
valhallasw@tools-bastion-01:~/accountingtools$ qmod -rq qhost -h tools-webgrid-lighttpd-1201 -j
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
tools-webgrid-lighttpd-1201.eqiad.wmflabs lx26-amd64 4 - 7.8G - 23.9G -
job-ID prior name user state submit/start at queue master ja-task-ID
----------------------------------------------------------------------------------------------
596 0.30601 lighttpd-p tools.potd-f r 12/30/2015 04:08:02 webgrid-li MASTER
640 0.30601 lighttpd-d tools.dexbot r 12/30/2015 04:08:51 webgrid-li MASTER
678 0.30600 lighttpd-w tools.wikida r 12/30/2015 04:09:29 webgrid-li MASTER
710 0.30600 lighttpd-c tools.catmon r 12/30/2015 04:10:05 webgrid-li MASTER
775 0.30600 lighttpd-w tools.wiktio r 12/30/2015 04:10:48 webgrid-li MASTER
818 0.30600 lighttpd-s tools.sdbot r 12/30/2015 04:11:35 webgrid-li MASTER
855 0.30600 lighttpd-t tools.tools- r 12/30/2015 04:12:07 webgrid-li MASTER
876 0.30600 lighttpd-y tools.yadkar dr 12/30/2015 04:12:38 webgrid-li MASTER
909 0.30600 lighttpd-t tools.tree-o r 12/30/2015 04:13:13 webgrid-li MASTER
959 0.30600 lighttpd-g tools.geohac r 12/30/2015 04:14:14 webgrid-li MASTER
1053 0.30599 lighttpd-h tools.heimda r 12/30/2015 04:15:17 webgrid-li MASTER
1102 0.30599 lighttpd-b tools.blahma r 12/30/2015 04:16:10 webgrid-li MASTER
1148 0.30599 lighttpd-b tools.betabo r 12/30/2015 04:17:05 webgrid-li MASTER
1182 0.30599 lighttpd-w tools.wikili r 12/30/2015 04:18:02 webgrid-li MASTER
1218 0.30599 lighttpd-c tools.cats-p r 12/30/2015 04:18:32 webgrid-li MASTER
1272 0.30599 lighttpd-p tools.projek dr 12/30/2015 04:19:33 webgrid-li MASTER
1371 0.30598 lighttpd-c tools.cobain r 12/30/2015 04:20:56 webgrid-li MASTER
1409 0.30598 lighttpd-u tools.url-co r 12/30/2015 04:21:44 webgrid-li MASTER
1443 0.30598 lighttpd-t tools.transl r 12/30/2015 04:22:19 webgrid-li MASTER
1462 0.30598 lighttpd-i tools.icalen r 12/30/2015 04:22:51 webgrid-li MASTER
52246 0.30300 lighttpd-p tools.phetoo dr 12/31/2015 08:23:28 webgrid-li MASTERAfter
qmod -rq "webgrid-lighttpd@tools-webgrid-lighttpd-1201.eqiad.wmflabs"
these are left:
876 0.30600 lighttpd-y tools.yadkar dr 12/30/2015 04:12:38 webgrid-li MASTER 1272 0.30599 lighttpd-p tools.projek dr 12/30/2015 04:19:33 webgrid-li MASTER 52246 0.30301 lighttpd-p tools.phetoo dr 12/31/2015 08:23:28 webgrid-li MASTER
I force-deleted those with
valhallasw@tools-bastion-01:~/accountingtools$ qdel -f 876 1272 52246 warning: valhallasw forced the deletion of job 876 warning: valhallasw forced the deletion of job 1272 warning: valhallasw forced the deletion of job 52246
which should bring the tools back online.
Comment Actions
same trouble but on ssh tools-webgrid-lighttpd-1202, ssh and my tools running on it freeze
Comment Actions
At https://wikitech.wikimedia.org/wiki/Special:NovaInstance, "get console output" gives "Failed to get console output for instance tools-webgrid-lighttpd-1201.". Trying to reboot gives "Failed to reboot instance tools-webgrid-lighttpd-1201."
Comment Actions
I still cannot ssh into that instance:
[tim@passepartout ~]$ ssh -v tools-webgrid-lighttpd-1201.tools.eqiad.wmflabs OpenSSH_7.1p2, OpenSSL 1.0.2e-fips 3 Dec 2015 debug1: Reading configuration data /home/tim/.ssh/config debug1: /home/tim/.ssh/config line 10: Applying options for *.eqiad.wmflabs debug1: /home/tim/.ssh/config line 16: Applying options for *.wmflabs debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 56: Applying options for * debug1: Control socket "/home/tim/.ssh/scfc@tools-webgrid-lighttpd-1201.tools.eqiad.wmflabs:22" does not exist debug1: Executing proxy command: exec ssh -a -q -W tools-webgrid-lighttpd-1201.tools.eqiad.wmflabs:22 bastion.wmflabs.org debug1: permanently_drop_suid: 1000 debug1: identity file /home/tim/.ssh/id_rsa type 1 debug1: key_load_public: No such file or directory debug1: identity file /home/tim/.ssh/id_rsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/tim/.ssh/id_dsa type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/tim/.ssh/id_dsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/tim/.ssh/id_ecdsa type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/tim/.ssh/id_ecdsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/tim/.ssh/id_ed25519 type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/tim/.ssh/id_ed25519-cert type -1 debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_7.1 ssh_exchange_identification: Connection closed by remote host [tim@passepartout ~]$