elasticsearch 1.6.0 fails to start after reboot
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JanZerebecki
	Aug 18 2015, 6:52 PM

Description

The 1.6.0 version of the elasticsearch debian package used for CI on precise fails to start elasticsearch after a reboot because it requires /var/run/elasticsearch/ to be a directory. The debian package contains/var/run/elasticsearch/, but /var/run is a symlink to /run and that is on a tmpfs. And the start script doesn't create the directory when needed.

Details

	Subject	Repo	Branch	Lines +/-
	elasticsearch: ensure /var/run subdir exists	operations/puppet	production	+11 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		hashar	T109497 elasticsearch 1.6.0 fails to start after reboot
		Resolved		MoritzMuehlenhoff	T111781 Please backport ElasticSearch 1.7.x from wikimedia-trusty to wikimedia-precise for CI needs

Event Timeline

JanZerebecki created this task.Aug 18 2015, 6:52 PM

JanZerebecki raised the priority of this task from to High.

JanZerebecki updated the task description. (Show Details)

JanZerebecki added projects: Elasticsearch, Continuous-Integration-Infrastructure.

JanZerebecki added subscribers: JanZerebecki, thcipriani.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptAug 18 2015, 6:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Krenair subscribed.Aug 19 2015, 6:37 PM

greg added subscribers: greg, dduvall.Aug 19 2015, 6:43 PM

Krenair mentioned this in T109704: MediaWiki phpunit job failing with "Can't connect to local MySQL server".Aug 20 2015, 12:46 PM

Krinkle updated the task description. (Show Details)Aug 20 2015, 12:47 PM

Krinkle set Security to None.

hashar mentioned this in T110041: integration-slave-precise-1011 and integration-slave-precise-1014 went offline.Aug 24 2015, 2:48 PM

Change 233413 had a related patch set uploaded (by Hashar):
elasticsearch: ensure /var/run subdir exists

https://gerrit.wikimedia.org/r/233413

gerritbot added a project: Patch-For-Review.Aug 24 2015, 3:01 PM

https://gerrit.wikimedia.org/r/233413 teach puppet to create /var/run/elasticsearch before starting the service. The instance would still deadlock on reboot until puppet has a chance to run and create the dir.

At least that unlocks puppet. The proper fix is to have the init script to create the parent directory. Maybe a more recent Debian package fix the issue.

hashar claimed this task.Aug 24 2015, 3:05 PM

hashar moved this task from Untriaged to In-progress on the Continuous-Integration-Infrastructure board.

Applied a puppet workaround to force create the directory

This is fixed in the elasticsearch package 1.7.1 that is in the wikimedia repos for trusty. But not for precise, can we also add the newer version to the precise repo?

Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 27 2015, 1:22 PM

Or we can remove ElasticSearch from the Jenkins CI slaves. I think I got it installed for the CirrusSearch browser tests which are now running on Trusty. The only reason we could still need ES on Precise would be if CirrusSearch Zend tests requires it as a backend.

In T109497#1583546, @hashar wrote:

Or we can remove ElasticSearch from the Jenkins CI slaves. I think I got it installed for the CirrusSearch browser tests which are now running on Trusty. The only reason we could still need ES on Precise would be if CirrusSearch Zend tests requires it as a backend.

If we still need ES for some jobs, we can always setup specific slave server(s) for them. ES is currently the largest consumer of resident memory (1.4G) on most slaves and it would be great to free that up for the vast majority of jobs that don't require it.

I do not know which is the consensus here. Should we apply the patch or should we abandon precise/1.6 support? @JanZerebecki ?

Can we identify the jobs that actually need ElasticSearch? We could move the jobs to Trusty. A potential blocker would be a PHPUnit Zend job that requires an ES backend.

Short term it seems easy to just backport elasticsearch for precise-wikimedia.

Filled T111781 to request the backport to Precise. Then we will just upgrade the package on the Precise instances and abandon https://gerrit.wikimedia.org/r/233413

Yes doing T111781 is the easier short term fix. In the long run we want to abandon precise anyway.

MoritzMuehlenhoff closed subtask T111781: Please backport ElasticSearch 1.7.x from wikimedia-trusty to wikimedia-precise for CI needs as Resolved.Sep 9 2015, 8:12 AM

Upgraded them:

root@integration-saltmaster:~# salt '*precise*' pkg.install elasticsearch
integration-slave-precise-1014.integration.eqiad.wmflabs:
    ----------
    elasticsearch:
        ----------
        new:
            1.7.1
        old:
            1.6.0
integration-slave-precise-1012.integration.eqiad.wmflabs:
    ----------
    elasticsearch:
        ----------
        new:
            1.7.1
        old:
            1.6.0
integration-slave-precise-1011.integration.eqiad.wmflabs:
    ----------
    elasticsearch:
        ----------
        new:
            1.7.1
        old:
            1.6.0
integration-slave-precise-1013.integration.eqiad.wmflabs:
    ----------
    elasticsearch:
        ----------
        new:
            1.7.1
        old:
            1.6.0

Change 233413 abandoned by Hashar:
elasticsearch: ensure /var/run subdir exists

Reason:
I removed it from the integration puppetmaster. Precise now has ElasticSearch 1.7.1 which comes with a fix to the init script.

https://gerrit.wikimedia.org/r/233413

I have rebooted integration-slave-precise1014 and it came back.

Then I removed the Gerrit change https://gerrit.wikimedia.org/r/233413 from the puppetmaster, ran puppet and rebooted. Machine came back.

The issue though is now elasticsearch does not start :-D

Actually ElasticSearch is started and the machines reboot just fine. I rebooted all Precise slaves.

hashar moved this task from In-progress to Done on the Continuous-Integration-Infrastructure board.Sep 14 2015, 2:09 PM

greg added a project: Essential-Work.Sep 21 2015, 8:53 PM

elasticsearch 1.6.0 fails to start after rebootClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

elasticsearch 1.6.0 fails to start after reboot
Closed, ResolvedPublic
Actions

Related Objects
Search...