We should keep all our elasticsearch clusters synced to the same versions. Since the other clusters are being upgraded to 2.3 it is time to upgrade this cluster as well. Ideally we won't be upgrading kibana or logstash as part of this, but we need to test if the current versions will still work with elasticsearch 2. Goal is to have this completed by end of
June July 2016.
Production release plan. Aiming for week of July
- Import kibana 4.5 .deb to apt.wikimedia.org: https://gerrit.wikimedia.org/r/296477
- Merge elasticsearch configuration required for 2.x: https://gerrit.wikimedia.org/r/296475
- Record transient settings from /_cluster/settings to be reapplied after cluster restart
- Review and double check these are all sensible to re-apply on 2.3
- Verify with migration plugin that all indices >= logstash-2016.07.01 will start in 2.x
- install plugin to local ES instance: ./bin/plugin -i migration -u https://github.com/elastic/elasticsearch-migration/releases/download/v1.18/elasticsearch-migration-1.18.zip
- setup ssh tunnel to logstash: ssh -L 9222:localhost:9200 logstash1001.eqiad.wmnet
- Visit local version of site plugin, point at the tunneled elasticsearch server
- May need to delete indices that won't work, as long as we still have a reasonably complete set of recent logs
- Export .kibana index from deployment-logstash3.eqiad.wmflabs
- Announce that logstash.wikimedia.org will be intermittently available on ops list (and wikitech?)
- Disable icinga alerts for elasticsearch/logstash
- Disable puppet on all nodes so services don't come back before they are needed
- Delete all indices created before July 01.
- Double check again with migration plugin that the indices are all happy. If they are not elasticsearch 2.3 won't start.
- Shut down elasticsearch and logstash on logstash1001-1003.
- Wait 5 or 10 minutes to make sure all prod services really are ok without having access to elasticsearch. In theory since logstash input all comes via UDP I don't foresee any problems, but probably better safe than sorry.
- Delete indices that need to be re-imported from cleaned dumps:
- Force a flush of all indices with curl -s -XPOST 'localhost:9200/_flush/synced' to aid cluster recovery process
- Assuming nothing is on fire, shut down logstash1004-1006
- Manually install elasticsearch 2.3 .deb to logstash1001-1006
- Re-apply transient settings recorded/reviewed in the pre-check stage: https://phabricator.wikimedia.org/T136001#2437761
- Bring cluster back up. Wait for green.
- Verify logs are once again moving from logstash into elasticsearch indices
- re-enable puppet
- Import .kibana index exported from deployment-logstash3.eqiad.wmflabs
- Merge puppet change to upgrade from kibana3 to kibana4: https://gerrit.wikimedia.org/r/#/c/296279/
- Verify puppet runs and is happy.
- Veriify new dashboards are up and available. HTTP auth still works as expected. etc.
- Send announcement to ops list that things are in working order again.
- Import logstash-2016.07.04 through logstash-2016.07.14 data from dumps. (start with 2016.07.14 and work backwards)
- Import elasticsearch 2.3 .deb to apt.wikimedia.org: https://gerrit.wikimedia.org/r/#/c/283466/
- Delete kibana3 files from logstash1001-3: /srv/deployment/kibana/kibana
- Delete kibana3 config from logstash1001-3: /etc/kibana
- Follow similar process to upgrade deployment-logstash2.eqiad.wmflabs
- Delete deployment-logstash3.eqiad.wmflabs
- Remove temporary patch from deployment-puppetmaster: https://gerrit.wikimedia.org/r/#/c/295442/
In case of Fire:
But more seriously, once elasticsearch 1.x indices have been opened by elasticsearch 2.x there is no going back (without losing all the data). If there are concerns we could dump the last couple days of indices to file before doing the upgrade, but not sure that's really necessary.