Page MenuHomePhabricator

elasticsearch 1.6.0 fails to start after reboot
Closed, ResolvedPublic

Description

The 1.6.0 version of the elasticsearch debian package used for CI on precise fails to start elasticsearch after a reboot because it requires /var/run/elasticsearch/ to be a directory. The debian package contains/var/run/elasticsearch/, but /var/run is a symlink to /run and that is on a tmpfs. And the start script doesn't create the directory when needed.

Event Timeline

JanZerebecki raised the priority of this task from to High.
JanZerebecki updated the task description. (Show Details)
JanZerebecki added subscribers: JanZerebecki, thcipriani.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krinkle set Security to None.

Change 233413 had a related patch set uploaded (by Hashar):
elasticsearch: ensure /var/run subdir exists

https://gerrit.wikimedia.org/r/233413

https://gerrit.wikimedia.org/r/233413 teach puppet to create /var/run/elasticsearch before starting the service. The instance would still deadlock on reboot until puppet has a chance to run and create the dir.

At least that unlocks puppet. The proper fix is to have the init script to create the parent directory. Maybe a more recent Debian package fix the issue.

Applied a puppet workaround to force create the directory

This is fixed in the elasticsearch package 1.7.1 that is in the wikimedia repos for trusty. But not for precise, can we also add the newer version to the precise repo?

Or we can remove ElasticSearch from the Jenkins CI slaves. I think I got it installed for the CirrusSearch browser tests which are now running on Trusty. The only reason we could still need ES on Precise would be if CirrusSearch Zend tests requires it as a backend.

Or we can remove ElasticSearch from the Jenkins CI slaves. I think I got it installed for the CirrusSearch browser tests which are now running on Trusty. The only reason we could still need ES on Precise would be if CirrusSearch Zend tests requires it as a backend.

If we still need ES for some jobs, we can always setup specific slave server(s) for them. ES is currently the largest consumer of resident memory (1.4G) on most slaves and it would be great to free that up for the vast majority of jobs that don't require it.

I do not know which is the consensus here. Should we apply the patch or should we abandon precise/1.6 support? @JanZerebecki ?

Can we identify the jobs that actually need ElasticSearch? We could move the jobs to Trusty. A potential blocker would be a PHPUnit Zend job that requires an ES backend.

Short term it seems easy to just backport elasticsearch for precise-wikimedia.

Yes doing T111781 is the easier short term fix. In the long run we want to abandon precise anyway.

Upgraded them:

root@integration-saltmaster:~# salt '*precise*' pkg.install elasticsearch
integration-slave-precise-1014.integration.eqiad.wmflabs:
    ----------
    elasticsearch:
        ----------
        new:
            1.7.1
        old:
            1.6.0
integration-slave-precise-1012.integration.eqiad.wmflabs:
    ----------
    elasticsearch:
        ----------
        new:
            1.7.1
        old:
            1.6.0
integration-slave-precise-1011.integration.eqiad.wmflabs:
    ----------
    elasticsearch:
        ----------
        new:
            1.7.1
        old:
            1.6.0
integration-slave-precise-1013.integration.eqiad.wmflabs:
    ----------
    elasticsearch:
        ----------
        new:
            1.7.1
        old:
            1.6.0

Change 233413 abandoned by Hashar:
elasticsearch: ensure /var/run subdir exists

Reason:
I removed it from the integration puppetmaster. Precise now has ElasticSearch 1.7.1 which comes with a fix to the init script.

https://gerrit.wikimedia.org/r/233413

I have rebooted integration-slave-precise1014 and it came back.

Then I removed the Gerrit change https://gerrit.wikimedia.org/r/233413 from the puppetmaster, ran puppet and rebooted. Machine came back.

The issue though is now elasticsearch does not start :-D

hashar claimed this task.

Actually ElasticSearch is started and the machines reboot just fine. I rebooted all Precise slaves.