Page MenuHomePhabricator

deployment-fluorine becomes unresponsive frequently
Closed, ResolvedPublic

Description

21:57 < etonkovid> something is wrong with fluorine? ssh deployment-fluorine.eqiad.wmflabs -- does not do anything

gjg@deployment-tin:~$ ping deployment-fluorine
PING deployment-fluorine.deployment-prep.eqiad.wmflabs (10.68.16.198) 56(84) bytes of data.
64 bytes from deployment-fluorine.deployment-prep.eqiad.wmflabs (10.68.16.198): icmp_seq=1 ttl=64 time=0.174 ms
64 bytes from deployment-fluorine.deployment-prep.eqiad.wmflabs (10.68.16.198): icmp_seq=2 ttl=64 time=0.228 ms
64 bytes from deployment-fluorine.deployment-prep.eqiad.wmflabs (10.68.16.198): icmp_seq=3 ttl=64 time=0.209 ms
^C
--- deployment-fluorine.deployment-prep.eqiad.wmflabs ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 0.174/0.203/0.228/0.027 ms
gjg@deployment-tin:~$ ssh deployment-fluorine
[no response.....]
^C

Event Timeline

I pressed buttons and it seems to work now. Try again?

So I first tried a 'hard reboot' in horizon but didn't immediately notice anything, I then logged into silver and OS_TENANT_NAME=deployment-prep nova reboot deployment-fluorine and checked the action log again... It seems that I ended up rebooting it twice. But it worked.

Happened again. I worked around it by rebooting in wikitech, but shouldn't keep happening.

greg renamed this task from Can't ssh to deployment-fluorine to deployment-fluorine becomes unresponsive frequently.Jul 14 2016, 11:16 PM
greg removed AlexMonk-WMF as the assignee of this task.
greg triaged this task as High priority.

this happend again today but afa nova is concerned things are ok. tyler also noted he could connect to 22 but not ssh. I'm beginning to suspect something with the host itself is wonky and not nova

edit: I rebooted it as well :)

Hmmm, well one thing is /srv/ is full:

thcipriani@deployment-fluorine:/srv/mw-log$ df -h
Filesystem                          Size  Used Avail Use% Mounted on
/dev/vda1                            18G  2.1G   15G  13% /
udev                                3.9G  8.0K  3.9G   1% /dev
tmpfs                               799M  276K  799M   1% /run
none                                5.0M     0  5.0M   0% /run/lock
none                                3.9G     0  3.9G   0% /run/shm
/dev/mapper/vd-second--local--disk   61G   58G  4.0K 100% /srv

Deleted deployment-fluorine:/srv/mw-log/archive/*-20160{5,6}* freed 30 GB.

So far I can logout and still log back in since the last reboot, so there's something.

Change 299672 had a related patch set uploaded (by Thcipriani):
Use hiera for udp2log-mw logrotate count

https://gerrit.wikimedia.org/r/299672

Change 299672 merged by Dzahn:
Use hiera for udp2log-mw logrotate count

https://gerrit.wikimedia.org/r/299672

Patch merged. Donezors.