The 1.6.0 version of the elasticsearch debian package used for CI on precise fails to start elasticsearch after a reboot because it requires /var/run/elasticsearch/ to be a directory. The debian package contains/var/run/elasticsearch/, but /var/run is a symlink to /run and that is on a tmpfs. And the start script doesn't create the directory when needed.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
elasticsearch: ensure /var/run subdir exists | operations/puppet | production | +11 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | hashar | T109497 elasticsearch 1.6.0 fails to start after reboot | |||
Resolved | MoritzMuehlenhoff | T111781 Please backport ElasticSearch 1.7.x from wikimedia-trusty to wikimedia-precise for CI needs |
Event Timeline
Change 233413 had a related patch set uploaded (by Hashar):
elasticsearch: ensure /var/run subdir exists
https://gerrit.wikimedia.org/r/233413 teach puppet to create /var/run/elasticsearch before starting the service. The instance would still deadlock on reboot until puppet has a chance to run and create the dir.
At least that unlocks puppet. The proper fix is to have the init script to create the parent directory. Maybe a more recent Debian package fix the issue.
This is fixed in the elasticsearch package 1.7.1 that is in the wikimedia repos for trusty. But not for precise, can we also add the newer version to the precise repo?
Or we can remove ElasticSearch from the Jenkins CI slaves. I think I got it installed for the CirrusSearch browser tests which are now running on Trusty. The only reason we could still need ES on Precise would be if CirrusSearch Zend tests requires it as a backend.
If we still need ES for some jobs, we can always setup specific slave server(s) for them. ES is currently the largest consumer of resident memory (1.4G) on most slaves and it would be great to free that up for the vast majority of jobs that don't require it.
I do not know which is the consensus here. Should we apply the patch or should we abandon precise/1.6 support? @JanZerebecki ?
Can we identify the jobs that actually need ElasticSearch? We could move the jobs to Trusty. A potential blocker would be a PHPUnit Zend job that requires an ES backend.
Short term it seems easy to just backport elasticsearch for precise-wikimedia.
Filled T111781 to request the backport to Precise. Then we will just upgrade the package on the Precise instances and abandon https://gerrit.wikimedia.org/r/233413
Yes doing T111781 is the easier short term fix. In the long run we want to abandon precise anyway.
Upgraded them:
root@integration-saltmaster:~# salt '*precise*' pkg.install elasticsearch integration-slave-precise-1014.integration.eqiad.wmflabs: ---------- elasticsearch: ---------- new: 1.7.1 old: 1.6.0 integration-slave-precise-1012.integration.eqiad.wmflabs: ---------- elasticsearch: ---------- new: 1.7.1 old: 1.6.0 integration-slave-precise-1011.integration.eqiad.wmflabs: ---------- elasticsearch: ---------- new: 1.7.1 old: 1.6.0 integration-slave-precise-1013.integration.eqiad.wmflabs: ---------- elasticsearch: ---------- new: 1.7.1 old: 1.6.0
Change 233413 abandoned by Hashar:
elasticsearch: ensure /var/run subdir exists
Reason:
I removed it from the integration puppetmaster. Precise now has ElasticSearch 1.7.1 which comes with a fix to the init script.
I have rebooted integration-slave-precise1014 and it came back.
Then I removed the Gerrit change https://gerrit.wikimedia.org/r/233413 from the puppetmaster, ran puppet and rebooted. Machine came back.
The issue though is now elasticsearch does not start :-D
Actually ElasticSearch is started and the machines reboot just fine. I rebooted all Precise slaves.