Page MenuHomePhabricator

phlogiston-2 hangs every week
Closed, ResolvedPublic40 Story Points

Description

Steps to Reproduce:

  1. restart phlogiston-2
  2. wait a few days (maybe 6.5 days, if the log message corresponds to the failure)

Actual Results:
can't ssh to phlogiston-2 or browse to phlogiston-dev.

Expected Results:
Can ssh and browse.

Rebooting via the wikitech control panel works within a few minutes.

From the console log:

phlogiston-2 login: [557400.432106] INFO: task kworker/u16:0:6 blocked for more than 120 seconds.
[557400.434759]       Not tainted 3.13.0-77-generic #121-Ubuntu
[557400.435392] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[557400.437345] INFO: task jbd2/vda1-8:197 blocked for more than 120 seconds.
[557400.438123]       Not tainted 3.13.0-77-generic #121-Ubuntu
[557400.438752] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[557400.440052] INFO: task postgres:7260 blocked for more than 120 seconds.
[557400.440850]       Not tainted 3.13.0-77-generic #121-Ubuntu
[557400.441498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[557520.440078] INFO: task kworker/u16:0:6 blocked for more than 120 seconds.
[557520.442162]       Not tainted 3.13.0-77-generic #121-Ubuntu
[557520.442795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[557520.443857] INFO: task jbd2/vda1-8:197 blocked for more than 120 seconds.
[557520.444671]       Not tainted 3.13.0-77-generic #121-Ubuntu
[557520.445308] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[557520.446257] INFO: task atop:27460 blocked for more than 120 seconds.
[557520.446990]       Not tainted 3.13.0-77-generic #121-Ubuntu
[557520.447656] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[557520.448999] INFO: task postgres:7260 blocked for more than 120 seconds.
[557520.449763]       Not tainted 3.13.0-77-generic #121-Ubuntu
[557520.450394] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[557520.451321] INFO: task ntpq:7263 blocked for more than 120 seconds.
[557520.452053]       Not tainted 3.13.0-77-generic #121-Ubuntu
[557520.452702] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[557640.452082] INFO: task kworker/u16:0:6 blocked for more than 120 seconds.
[557640.454929]       Not tainted 3.13.0-77-generic #121-Ubuntu
[557640.455562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[557640.456621] INFO: task jbd2/vda1-8:197 blocked for more than 120 seconds.
[557640.457396]       Not tainted 3.13.0-77-generic #121-Ubuntu
[557640.458029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Event Timeline

Restricted Application added a project: User-bd808. · View Herald TranscriptMar 14 2016, 5:30 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
bd808 removed bd808 as the assignee of this task.Mar 14 2016, 6:35 PM
bd808 removed a project: User-bd808.
bd808 added a subscriber: bd808.

This looks to be a bursty i/o problem. The kernel is too busy doing something (probably flushing buffered disk writes to disk) to do anything else.

Do you have some cron job(s) that kick off and put a whole lot of pressure on ram and/or disk?

There is a daily cron job that runs every evening (PT). It doesn't max out RAM or disk; it does run one CPU at 100%. It runs big Postgresql queries, mostly. However, the same arrangement on a smaller server, phlogiston-1, has never hung like this.

meanwhile, I can't find the controls to restart it; did the wikitech interface change? The closest I can find is https://wikitech.wikimedia.org/wiki/Nova_Resource:Phlogiston.

Luke081515 moved this task from Triage to Backlog on the Cloud-Services board.Mar 25 2016, 3:00 PM

Possibly related: was in a very weird state today, with load at 66 growing to 68. PS -auxf led to a lot of these:

root     12928  0.0  0.2 108296 35692 ?        D    15:52   0:00 /usr/bin/ruby /usr/bin/puppet agent --onetime --no-daemonize -
root     13280  0.0  0.0  30604  2232 ?        D    16:16   0:00 apt-get update -qq                                           
root     13366  0.0  0.2 108296 35704 ?        D    16:22   0:00 /usr/bin/ruby /usr/bin/puppet agent --onetime --no-daemonize -
root     13718  0.0  0.0  30604  2228 ?        D    16:46   0:00 apt-get update -qq                                            
root     13801  0.0  0.2 108304 35688 ?        D    16:52   0:01 /usr/bin/ruby /usr/bin/puppet agent --onetime --no-daemonize -
root     14164  0.0  0.0  30604  2228 ?        D    17:16   0:00 apt-get update -qq                                            
root     14250  0.0  0.2 108296 35692 ?        D    17:22   0:01 /usr/bin/ruby /usr/bin/puppet agent --onetime --no-daemonize -
root     14602  0.0  0.0  30604  2228 ?        D    17:46   0:00 apt-get update -qq

Wouldn't respond to a kill or even reboot from the command line, but did reboot from wikitech control.

JAufrecht edited projects, added Phlogiston (Technical Debt); removed Phlogiston.
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 10:57 PM
JAufrecht set the point value for this task to 40.Apr 26 2016, 4:13 PM
chasemp triaged this task as Medium priority.May 31 2016, 3:18 PM
hashar closed this task as Resolved.Dec 7 2016, 12:44 PM
hashar claimed this task.
hashar added a subscriber: MoritzMuehlenhoff.

CI had the same issue with jbd2/vda blocking (T138281) and I am pretty sure it was due to a kernel soft lock T138281#2395843 then from a quote:

I haven't seen that kernel soft lock occurring for a while. I guess it was a bug in the kernel that ran on labvirt hosts.

That makes sense, the labvirt hosts were upgraded from 3.13 to 4.4 about a month ago.

So I am pretty sure phlogiston-2 is all fine now after the Linux Kernel has been upgraded on the OpenStack infrastructure (compute node).