Page MenuHomePhabricator

deployment-fluorine becomes unresponsive frequently
Closed, ResolvedPublic

Description

21:57 < etonkovid> something is wrong with fluorine? ssh deployment-fluorine.eqiad.wmflabs -- does not do anything

gjg@deployment-tin:~$ ping deployment-fluorine
PING deployment-fluorine.deployment-prep.eqiad.wmflabs (10.68.16.198) 56(84) bytes of data.
64 bytes from deployment-fluorine.deployment-prep.eqiad.wmflabs (10.68.16.198): icmp_seq=1 ttl=64 time=0.174 ms
64 bytes from deployment-fluorine.deployment-prep.eqiad.wmflabs (10.68.16.198): icmp_seq=2 ttl=64 time=0.228 ms
64 bytes from deployment-fluorine.deployment-prep.eqiad.wmflabs (10.68.16.198): icmp_seq=3 ttl=64 time=0.209 ms
^C
--- deployment-fluorine.deployment-prep.eqiad.wmflabs ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 0.174/0.203/0.228/0.027 ms
gjg@deployment-tin:~$ ssh deployment-fluorine
[no response.....]
^C

Event Timeline

greg created this task.Jul 13 2016, 10:08 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 13 2016, 10:08 PM

I pressed buttons and it seems to work now. Try again?

Mattflaschen-WMF closed this task as Resolved.Jul 13 2016, 11:23 PM
Mattflaschen-WMF assigned this task to AlexMonk-WMF.

@AlexMonk-WMF and @greg - it miraculously works now. Thx!

So I first tried a 'hard reboot' in horizon but didn't immediately notice anything, I then logged into silver and OS_TENANT_NAME=deployment-prep nova reboot deployment-fluorine and checked the action log again... It seems that I ended up rebooting it twice. But it worked.

Mattflaschen-WMF reopened this task as Open.Jul 14 2016, 11:12 PM

Happened again. I worked around it by rebooting in wikitech, but shouldn't keep happening.

greg renamed this task from Can't ssh to deployment-fluorine to deployment-fluorine becomes unresponsive frequently.Jul 14 2016, 11:16 PM
greg removed AlexMonk-WMF as the assignee of this task.
greg triaged this task as High priority.
chasemp added a subscriber: chasemp.EditedJul 18 2016, 8:48 PM

this happend again today but afa nova is concerned things are ok. tyler also noted he could connect to 22 but not ssh. I'm beginning to suspect something with the host itself is wonky and not nova

edit: I rebooted it as well :)

Hmmm, well one thing is /srv/ is full:

thcipriani@deployment-fluorine:/srv/mw-log$ df -h
Filesystem                          Size  Used Avail Use% Mounted on
/dev/vda1                            18G  2.1G   15G  13% /
udev                                3.9G  8.0K  3.9G   1% /dev
tmpfs                               799M  276K  799M   1% /run
none                                5.0M     0  5.0M   0% /run/lock
none                                3.9G     0  3.9G   0% /run/shm
/dev/mapper/vd-second--local--disk   61G   58G  4.0K 100% /srv

Deleted deployment-fluorine:/srv/mw-log/archive/*-20160{5,6}* freed 30 GB.

So far I can logout and still log back in since the last reboot, so there's something.

Change 299672 had a related patch set uploaded (by Thcipriani):
Use hiera for udp2log-mw logrotate count

https://gerrit.wikimedia.org/r/299672

greg assigned this task to thcipriani.Aug 6 2016, 8:35 AM

Change 299672 merged by Dzahn:
Use hiera for udp2log-mw logrotate count

https://gerrit.wikimedia.org/r/299672

greg closed this task as Resolved.Aug 19 2016, 11:08 PM

Patch merged. Donezors.