Page MenuHomePhabricator

Beta cluster intermittent failures
Closed, ResolvedPublic

Description

Shinken has been blowing up -releng with errors since about 7pm PDT last night.

Intermittent reports of errors all morning.

Event Timeline

thcipriani raised the priority of this task from to Needs Triage.
thcipriani updated the task description. (Show Details)
thcipriani subscribed.
greg triaged this task as Unbreak Now! priority.Apr 23 2015, 3:25 PM
greg updated the task description. (Show Details)
greg set Security to None.
greg subscribed.

This problem is still ongoing, although @coren and @Andrew may have found the root cause: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917

After updating the kernel on labvirt1001 and labvirt1002, the following guests may be problem free:

  • deployment-bastion
  • deployment-cache-text02
  • deployment-elastic08
  • deployment-lucid-salt
  • deployment-memc03
  • deployment-pdf01
  • deployment-rsync01
  • deployment-salt
  • deployment-urldownloader
  • deployment-cache-upload02
  • deployment-db1
  • deployment-eventlogging02
  • deployment-mathoid
  • deployment-memc04
  • deployment-pdf02
  • deployment-sca01
  • deployment-videoscaler01

Overnight monitoring of those libvirt guests should tell us whether or not the root cause of the problem has been solved.

From T96905 it seems MySQL/MariaDB is not started on boot and deployment-db1 got rebooted on Thu Apr 23 23:53. I have restated MySQL :-)

This should be all fixed now; I'm not seeing the intermittent VM stalls anymore and all kernels have been upgraded to the fixed kernel.