Page MenuHomePhabricator

[Regression] QA: Puppet failing for Role::Ci::Slave::Browsertests/elasticsearch
Closed, ResolvedPublic

Description

Graph: Memory last week - Nagf

See also
https://bugzilla.wikimedia.org/show_bug.cgi?id=68260
https://bugzilla.wikimedia.org/show_bug.cgi?id=72014

As of ~ 23:00 UTC October 15, puppet is failing on integration-dev-precise due to an error in the puppet provision inside elasticsearch.

Before, on the morning of 2014-10-15:

Oct 15 07:01:02 integration-dev-precise puppet-agent[14037]: Sleeping for 26 seconds (splay is enabled)
Oct 15 07:01:28 integration-dev-precise puppet-agent[14037]: Retrieving plugin
Oct 15 07:01:30 integration-dev-precise puppet-agent[14037]: Loading facts
..
Oct 15 07:01:35 integration-dev-precise puppet-agent[14037]: Caching catalog for i-00000650.eqiad.wmflabs
Oct 15 07:01:36 integration-dev-precise puppet-agent[14037]: Applying configuration version '1413356223'
..
Oct 15 07:01:57 integration-dev-precise puppet-agent[14037]: hostname: integration-dev-precise
Oct 15 07:01:57 integration-dev-precise puppet-agent[14037]: (/Stage[main]/Role::Labs::Instance/Notify[hostname: integration-dev-precise]/message) defined 'message' as 'hostname: integration-dev-precise'
Oct 15 07:01:58 integration-dev-precise kernel: [1157554.617854] init: ganglia-monitor main process (14773) terminated with status 1
Oct 15 07:01:58 integration-dev-precise kernel: [1157554.617881] init: ganglia-monitor main process ended, respawning
Oct 15 07:01:58 integration-dev-precise puppet-agent[14037]: (/Stage[main]/Ganglia_new::Monitor::Service/Service[ganglia-monitor]/ensure) ensure changed 'stopped' to 'running'
Oct 15 07:01:58 integration-dev-precise puppet-agent[14037]: (/Stage[main]/Ganglia_new::Monitor::Service/Service[ganglia-monitor]) Unscheduling refresh on Service[ganglia-monitor]
Oct 15 07:01:58 integration-dev-precise kernel: [1157554.624667] init: ganglia-monitor main process (14774) terminated with status 1
Oct 15 07:01:58 integration-dev-precise kernel: [1157554.624694] init: ganglia-monitor main process ended, respawning
..
Oct 15 07:01:58 integration-dev-precise kernel: [1157554.678814] init: ganglia-monitor main process (14787) terminated with status 1
Oct 15 07:01:58 integration-dev-precise kernel: [1157554.678840] init: ganglia-monitor respawning too fast, stopped
Oct 15 07:02:01 integration-dev-precise puppet-agent[14037]: Finished catalog run in 25.22 seconds

After, closely before midnight 2014-10-16:

Oct 15 23:41:04 integration-dev-precise puppet-agent[5268]: Sleeping for 39 seconds (splay is enabled)
Oct 15 23:41:43 integration-dev-precise puppet-agent[5268]: Retrieving plugin
Oct 15 23:41:45 integration-dev-precise puppet-agent[5268]: Loading facts
..
Oct 15 23:41:53 integration-dev-precise puppet-agent[5268]: Caching catalog for i-00000650.eqiad.wmflabs
Oct 15 23:41:58 integration-dev-precise puppet-agent[5268]: Applying configuration version '1413416388'
..
Oct 15 23:42:37 integration-dev-precise puppet-agent[5268]: hostname: integration-dev-precise
Oct 15 23:42:37 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Role::Labs::Instance/Notify[hostname: integration-dev-precise]/message) defined 'message' as 'hostname: integration-dev-precise'
Oct 15 23:42:41 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Ganglia_new::Monitor::Service/Service[ganglia-monitor]/ensure) ensure changed 'stopped' to 'running'
Oct 15 23:42:41 integration-dev-precise kernel: [1217597.861552] init: ganglia-monitor main process (6548) terminated with status 1
Oct 15 23:42:41 integration-dev-precise kernel: [1217597.861576] init: ganglia-monitor main process ended, respawning
Oct 15 23:42:41 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Ganglia_new::Monitor::Service/Service[ganglia-monitor]) Unscheduling refresh on Service[ganglia-monitor]
Oct 15 23:42:41 integration-dev-precise kernel: [1217597.868022] init: ganglia-monitor main process (6549) terminated with status 1
Oct 15 23:42:41 integration-dev-precise kernel: [1217597.868049] init: ganglia-monitor main process ended, respawning
..
Oct 15 23:42:41 integration-dev-precise kernel: [1217597.930890] init: ganglia-monitor main process (6561) terminated with status 1
Oct 15 23:42:41 integration-dev-precise kernel: [1217597.930913] init: ganglia-monitor respawning too fast, stopped
Oct 15 23:42:41 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Role::Labs::Lvm::Mnt/Labs_lvm::Volume[second-local-disk]/Labs_lvm::Extend[/mnt]/Exec[extend-vd-/mnt]/returns) executed successfully
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Role::Ci::Slave::Browsertests/File[/var/lib/elasticsearch]) Not removing directory; use 'force' to override
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Role::Ci::Slave::Browsertests/File[/var/lib/elasticsearch]) Not removing directory; use 'force' to override
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: Could not remove existing file
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Role::Ci::Slave::Browsertests/File[/var/lib/elasticsearch]/ensure) change from directory to link failed: Could not remove existing file
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/var/log/elasticsearch/elasticsearch_index_search_slowlog.log]) Dependency File[/var/lib/elasticsearch] has failures: true
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/var/log/elasticsearch/elasticsearch_index_search_slowlog.log]) Skipping because of failed dependencies
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/var/log/elasticsearch/elasticsearch_index_indexing_slowlog.log]) Dependency File[/var/lib/elasticsearch] has failures: true
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/var/log/elasticsearch/elasticsearch_index_indexing_slowlog.log]) Skipping because of failed dependencies
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/etc/logrotate.d/elasticsearch]) Dependency File[/var/lib/elasticsearch] has failures: true
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/etc/logrotate.d/elasticsearch]) Skipping because of failed dependencies
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/etc/elasticsearch/elasticsearch.yml]) Dependency File[/var/lib/elasticsearch] has failures: true
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/etc/elasticsearch/elasticsearch.yml]) Skipping because of failed dependencies
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/var/log/elasticsearch/elasticsearch.log]) Dependency File[/var/lib/elasticsearch] has failures: true
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/var/log/elasticsearch/elasticsearch.log]) Skipping because of failed dependencies
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/etc/elasticsearch/logging.yml]) Dependency File[/var/lib/elasticsearch] has failures: true
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/etc/elasticsearch/logging.yml]) Skipping because of failed dependencies
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/etc/default/elasticsearch]) Dependency File[/var/lib/elasticsearch] has failures: true
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/File[/etc/default/elasticsearch]) Skipping because of failed dependencies
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/Service[elasticsearch]) Dependency File[/var/lib/elasticsearch] has failures: true
Oct 15 23:42:48 integration-dev-precise puppet-agent[5268]: (/Stage[main]/Elasticsearch/Service[elasticsearch]) Skipping because of failed dependencies
Oct 15 23:42:49 integration-dev-precise puppet-agent[5268]: Finished catalog run in 53.29 seconds

Attached: Graphs of the relevant time period from
https://tools.wmflabs.org/nagf/?project=integration#h_integration-dev-precise_memory


Attached:

nagf-mem.png (250×400 px, 13 KB)

Details

Reference
bz72255

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:50 AM
bzimport set Reference to bz72255.
bzimport added a subscriber: Unknown Object (MLST).

Created attachment 16815
Graph: Disk space last week - Nagf

The /mnt mount first appears during this puppet run.

Attached:

nagf-disk.png (250×400 px, 11 KB)

Created attachment 16816
Graph: Puppet runs last week - Nagf

Puppet starts failing at 23:00 UTC October 15.

Attached:

nagf-puppet.png (250×400 px, 9 KB)

zeljkofilipin set Security to None.
zeljkofilipin added a subscriber: hashar.
Krinkle claimed this task.

Haven't seen this error in the 2 instance re-creation sprints. Works for me.

greg raised the priority of this task from Low to Medium.Apr 30 2015, 3:49 PM
greg moved this task from Inbox to Done on the Browser-Tests-Infrastructure board.
greg lowered the priority of this task from Medium to Low.Apr 30 2015, 4:03 PM