Page MenuHomePhabricator

Devise a plan on how to upgrade to Elasticsearch 2.3 without turning user-facing search features off
Closed, ResolvedPublic

Description

Upgrade plan, starting with codfw cluster. Feel free to edit as necessary to better reflect reality, lessons learned while performing initial beta cluster migration, etc.

25-27 May, after wmf.3 branch cut:

  1. Update warmers on current 1.7 cluster
  2. Record the output of curl search.svc.codfw.wmnet/_cluster/settings to be re-applied
  3. Update plugins repository. Do not sync out to prod
  4. Pull new plugins to beta cluster
  5. Install elasticsearch 2.3.3 deb to beta cluster
  6. Bring down beta cluster elasticsearch servers
  7. Bring full cluster back up. All servers are master capable.
  8. Re-apply transient cluster settings recorded above
  9. At this point we are running new elasticsearch on old CirrusSearch code. Ensure writes are still going through.
  10. Merge es2.x branch to master (CirrusSearch, Elastica, and vendor). Merge individual patches for ApiFeatureUsage, Translate and GeoData. Push to gerrit and merge
  11. Manual testing. Ensure daily browser tests still pass.
  12. TODO: Is it possible to do more testing with beta cluster this week?

Monday, May 30:

  1. Prepare and merge patch disabling completion suggester index rebuilds (not needed elastic version check will prevent the script from running)
  2. Prepare mediawiki-config patch to send wmf.4 (expected to be cut May 31) search traffic to codfw cluster. https://gerrit.wikimedia.org/r/291257
  3. Swat out config patch sending wmf.4 searches to codfw with monday evening (SF) SWAT
  4. Pull plugins repository on terbium and sync out to all elasticsearch servers
    1. eqiad systems *must*not* be restarted after this has been done
  5. Copy ttmserver index from eqiad to codfw (16 to 20 min with adhoc bash script on terbium)
  6. Install elasticsearch 2.3.3 deb to all elasticsearch servers in codfw
  7. Bring full codfw cluster down
  8. Start all codfw master nodes
    1. Note that once a node has been started under 2.x we will no longer be able to restart that node as 1.7 without losing all the data.
  9. After a master has been decided bring up all codfw data nodes
  10. Ensure codfw cluster is green and that writes are draining from the job queue into the cluster

Tues:

  • Test on testwiki after train rolls forward. Monitor logs to ensure everything seems sane

Wed:

  • Be available during train rollout
  • Like tues, but non-wikipedia sites

Thurs:

  1. Be available during train deploy for any potential issues.
  2. Verify via grafana that all search traffic is on codfw.
  3. Prepare and merge patch enabling completion suggester index rebuilds

Fri/Mon:

  1. Prepare mediawiki-config patch to remove wmf.4 hacks and point all search traffic back at eqiad
  2. Follow same process to upgrade eqiad cluster as was used for codfw
  3. Ensure eqiad cluster is green and that writes are draining from the job queue into the cluster

Alternative ideas rejected:

  • Considered doing the upgrade outside the train, with a big "flip the switch" sending all traffic to codfw. Rejected as more dangerous than necessary.

Expected errors:

  • If the 1.7 code is talking to a 2.x cluster (index updates), any exception such as the normal DocumentMissingException will cause an Array to string conversion notice. These are acceptable.

Related Objects

Event Timeline

It's likely possible to backport some of the changes to the non es2.x branch. For some other features that were removed perhaps we can put together a patch to push out before we make the switchover that lets things continue to work while using deprecated (but not removed) features such as Filters. Will have to test, but seems plausible.

Let's get this one into the sprint. It was mentioned in today's standup that some discussions would need to happen.

Upgrade plan, starting with codfw cluster. Consider this a first draft to be discussed at offsite:

25-27 May, after wmf.3 branch cut:

  1. Update plugins repository. Do not sync out to prod
  2. Pull new plugins to beta cluster
  3. Install elasticsearch 2.3.2 deb to beta cluster
  4. Bring down beta cluster elasticsearch servers
  5. Bring full cluster back up. All servers are master capable.
  6. At this point we are running new elasticsearch on old CirrusSearch code. Ensure writes are still going through.
  7. Merge es2.x branch to master (CirrusSearch and Elastica). Merge individual patches for ApiFeatureUsage, Translate and GeoData. Push to gerrit and merge
  8. Manual testing. Ensure daily browser tests still pass.
  9. TODO: Is it possible to do more testing with beta cluster this week?

Monday, May 30:

  1. Prepare mediawiki-config patch to send wmf.4 (expected to be cut May 31) search traffic to codfw cluster.
  2. Swat out config patch sending wmf.4 searches to codfw with monday evening (SF) SWAT
  3. Pull plugins repository on terbium and sync out to all elasticsearch servers
    1. eqiad systems *must*not* be restarted after this has been done
  4. Install elasticsearch 2.3.2 deb to all elasticsearch servers in codfw
  5. Bring full codfw cluster down
  6. Start all codfw master nodes
    1. Note that once a node has been started under 2.x we will no longer be able to restart that node as 1.7 without losing all the data.
  7. After a master has been decided bring up all codfw data nodes
  8. Ensure codfw cluster is green and that writes are draining from the job queue into the cluster

Tues:

  • Test on testwiki after train rolls forward. Monitor logs to ensure everything seems sane

Wed:

  • Be available during train rollout
  • Like tues, but non-wikipedia sites

Thurs:

  1. Be available during train deploy for any potential issues.
  2. Verify via grafana that all search traffic is on codfw.

Fri/Mon:

  1. Prepare mediawiki-config patch to remove wmf.4 hacks and point all search traffic back at eqiad
  2. Install elasticsearch 2.3.2 deb to all elasticsearch servers in eqiad
  3. Bring full eqiad cluster down
  4. Start all eqiad master nodes
  5. After a master has been decided bring up all eqiad data nodes
  6. Ensure eqiad cluster is green and that writes are draining from the job queue into the cluster

Alternative ideas rejected:

  • Considered doing the upgrade outside the train, with a big "flip the switch" sending all traffic to codfw. Rejected as more dangerous than necessary.

Expected errors:

  • If the 1.7 code is talking to a 2.x cluster (index updates), any exception such as the normal DocumentMissingException will cause an Array to string conversion notice. These are acceptable.

@aude @Gehel @Smalyshev @dcausse Above is my first draft plan of rolling out elasticsearch 2.x upgrade. Please post any thoughts here, or bring them up at the offfsite in our unstructured afternoon hacking time.

  1. Merge es2.x branch to master (CirrusSearch and Elastica). Push to gerrit and merge

Did you check other extensions deployed on the WMF cluster for compatibility? I see the following seem to have nontrivial references to "Elastica":

  • Translate
  • ApiFeatureUsage
  • GeoData
  • Flow

It looks tasks have been filed for Translate (T124423) and GeoData (T133428), but I'm not seeing one for Flow or ApiFeatureUsage.

Is there a document anywhere of just what is incompatible between the current version and the new version?

We also need to coordinate the upgrade of the Logstash cluster. There *should* be no issue, but I have not tested yet. We don't want any issue with logstash during the Cirrus cluster update, so I propose doing it once the Cirrus upgrade is fully complete. We should still update the logstash beta cluster at the same time as the Cirrus beta cluster.

@bd808: does this sound good to you?

We also need to coordinate the upgrade of the Logstash cluster. There *should* be no issue, but I have not tested yet. We don't want any issue with logstash during the Cirrus cluster update, so I propose doing it once the Cirrus upgrade is fully complete. We should still update the logstash beta cluster at the same time as the Cirrus beta cluster.

@bd808: does this sound good to you?

I looked at the Logstash + Elasticsearch 2.0 upgrade notes a few days ago. The only potential issue I saw for the Logstash clusters was the prohibition against field names to containing the . character. I have not audited our mappings to see if that will cause any problems. I do agree that the beta cluster should be updated first. I can't guarantee that a lack of mapping conflicts in the beta cluster indices will prove that we will not have any issues in production. I think that we have inputs in production which have no beta cluster equivalent.

As far as I know, no one has done any testing of the versions of Logstash and Kibana we run against Elasticsearch 2.x, so there may be other issues lurking. If we need to upgrade either that's a much bigger project. Modern versions of Kibana are very very different from the old branch that we run including known issues with display timezones and the introduction of a required node service.

In general the upgrade of the Elasticsearch cluster backing Logstash probably deserves its own set of tickets.

@Anomie sorry to not get back earlier, was out traveling all last week. Flow, while it has some Elasticsearch code, doesn't actually use it. It was an experiment that didn't end up getting used.

I took a look over the ApiFeatureUsage extension and it should continue to work as is, only issue is it will start spamming logs with deprecation warnings. I've put up a patch at https://gerrit.wikimedia.org/r/290250

moved plan into the main ticket description so anyone can edit it

Mentioned in SAL [2016-05-26T07:31:35Z] <gehel> deployment-prep deploying new elasticsearch plugins (T133124)

Mentioned in SAL [2016-05-26T07:36:13Z] <dcausse> deployment-prep elastic: updating cirrussearch warmers (T133124)

Mentioned in SAL [2016-05-26T07:48:08Z] <gehel> deployment-prep upgrading elasticsearch to 2.3.3 and restarting (T133124)

dcausse updated the task description. (Show Details)

Change 291257 had a related patch set uploaded (by EBernhardson):
Send wmf.4 search traffic to codfw

https://gerrit.wikimedia.org/r/291257

Change 291257 merged by jenkins-bot:
Send wmf.4 search and ttmserver traffic to codfw

https://gerrit.wikimedia.org/r/291257