Page MenuHomePhabricator

[EPIC] Upgrade elasticsearch cluster supporting logging to 2.3
Closed, ResolvedPublic

Description

We should keep all our elasticsearch clusters synced to the same versions. Since the other clusters are being upgraded to 2.3 it is time to upgrade this cluster as well. Ideally we won't be upgrading kibana or logstash as part of this, but we need to test if the current versions will still work with elasticsearch 2. Goal is to have this completed by end of June July 2016.

Production release plan. Aiming for week of July 5th 18th:

Pre-check:

Upgrade Elasticsearch:

  • Announce that logstash.wikimedia.org will be intermittently available on ops list (and wikitech?)
  • Disable icinga alerts for elasticsearch/logstash
  • Disable puppet on all nodes so services don't come back before they are needed
  • Delete all indices created before July 01.
  • Double check again with migration plugin that the indices are all happy. If they are not elasticsearch 2.3 won't start.
  • Shut down elasticsearch and logstash on logstash1001-1003.
    • Wait 5 or 10 minutes to make sure all prod services really are ok without having access to elasticsearch. In theory since logstash input all comes via UDP I don't foresee any problems, but probably better safe than sorry.
  • Delete indices that need to be re-imported from cleaned dumps:
    • logstash-2016.07.14
    • logstash-2016.07.13
    • logstash-2016.07.12
    • logstash-2016.07.11
    • logstash-2016.07.10
    • logstash-2016.07.09
    • logstash-2016.07.08
    • logstash-2016.07.07
    • logstash-2016.07.06
    • logstash-2016.07.05
    • logstash-2016.07.04
  • Force a flush of all indices with curl -s -XPOST 'localhost:9200/_flush/synced' to aid cluster recovery process
  • Assuming nothing is on fire, shut down logstash1004-1006
  • Manually install elasticsearch 2.3 .deb to logstash1001-1006
  • Re-apply transient settings recorded/reviewed in the pre-check stage: https://phabricator.wikimedia.org/T136001#2437761
  • Bring cluster back up. Wait for green.
  • Verify logs are once again moving from logstash into elasticsearch indices
  • re-enable puppet

Kibana:

  • Import .kibana index exported from deployment-logstash3.eqiad.wmflabs
  • Merge puppet change to upgrade from kibana3 to kibana4: https://gerrit.wikimedia.org/r/#/c/296279/
  • Verify puppet runs and is happy.
  • Veriify new dashboards are up and available. HTTP auth still works as expected. etc.
  • Send announcement to ops list that things are in working order again.

Later:

  • Import logstash-2016.07.04 through logstash-2016.07.14 data from dumps. (start with 2016.07.14 and work backwards)
  • Import elasticsearch 2.3 .deb to apt.wikimedia.org: https://gerrit.wikimedia.org/r/#/c/283466/
  • Delete kibana3 files from logstash1001-3: /srv/deployment/kibana/kibana
  • Delete kibana3 config from logstash1001-3: /etc/kibana
  • Follow similar process to upgrade deployment-logstash2.eqiad.wmflabs
  • Delete deployment-logstash3.eqiad.wmflabs
  • Remove temporary patch from deployment-puppetmaster: https://gerrit.wikimedia.org/r/#/c/295442/

In case of Fire:
¯\_(ツ)_/¯

But more seriously, once elasticsearch 1.x indices have been opened by elasticsearch 2.x there is no going back (without losing all the data). If there are concerns we could dump the last couple days of indices to file before doing the upgrade, but not sure that's really necessary.

Related Objects

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 23 2016, 3:13 PM
bd808 added a comment.May 23 2016, 4:12 PM

We also need to coordinate the upgrade of the Logstash cluster. There *should* be no issue, but I have not tested yet. We don't want any issue with logstash during the Cirrus cluster update, so I propose doing it once the Cirrus upgrade is fully complete. We should still update the logstash beta cluster at the same time as the Cirrus beta cluster.

@bd808: does this sound good to you?

I looked at the Logstash + Elasticsearch 2.0 upgrade notes a few days ago. The only potential issue I saw for the Logstash clusters was the prohibition against field names to containing the . character. I have not audited our mappings to see if that will cause any problems. I do agree that the beta cluster should be updated first. I can't guarantee that a lack of mapping conflicts in the beta cluster indices will prove that we will not have any issues in production. I think that we have inputs in production which have no beta cluster equivalent.

As far as I know, no one has done any testing of the versions of Logstash and Kibana we run against Elasticsearch 2.x, so there may be other issues lurking. If we need to upgrade either that's a much bigger project. Modern versions of Kibana are very very different from the old branch that we run including known issues with display timezones and the introduction of a required node service.

In general the upgrade of the Elasticsearch cluster backing Logstash probably deserves its own set of tickets.

Danny_B renamed this task from Upgrade elasticsearch cluster supporting logging to 2.3 to Upgrade elasticsearch cluster supporting logging to 2.3 (tracking).May 23 2016, 4:12 PM
Deskana triaged this task as Medium priority.May 26 2016, 10:17 PM
Deskana added a project: Discovery.
Deskana moved this task from needs triage to This Quarter on the Discovery-Search board.
debt moved this task from This Quarter to Up Next on the Discovery-Search board.Jun 2 2016, 10:09 PM
debt renamed this task from Upgrade elasticsearch cluster supporting logging to 2.3 (tracking) to [EPIC} Upgrade elasticsearch cluster supporting logging to 2.3 (tracking).Jun 7 2016, 10:11 PM
debt added a project: Epic.
debt removed a project: Tracking-Neverending.
EBernhardson renamed this task from [EPIC} Upgrade elasticsearch cluster supporting logging to 2.3 (tracking) to [EPIC} Upgrade elasticsearch cluster supporting logging to 2.3 .Jun 7 2016, 10:12 PM
debt renamed this task from [EPIC} Upgrade elasticsearch cluster supporting logging to 2.3 to [EPIC] Upgrade elasticsearch cluster supporting logging to 2.3 .Jun 7 2016, 10:30 PM
EBernhardson added a subscriber: dcausse.EditedJun 28 2016, 10:23 PM

Moved to ticket description so anyone can edit.

EBernhardson updated the task description. (Show Details)Jul 5 2016, 3:58 PM
EBernhardson updated the task description. (Show Details)Jul 5 2016, 4:00 PM

scheduled with @Gehel for Thursday, July 7 @9am pacific / 4pm GMT.

Gehel added a comment.Jul 7 2016, 3:43 PM

Elasticsearch cluster settings to re-apply after upgrade:

curl -XPUT localhost:9200/_cluster/settings -d '{
    "transient" : {
        "indices.recovery.translog_size" : "16mb",
        "indices.recovery.concurrent_streams" : "16",
        "indices.recovery.concurrent_small_file_streams" : "8",
        "indices.recovery.translog_ops" : "10000",
        "indices.recovery.max_bytes_per_sec" : "120mb",
        "indices.recovery.file_chunk_size" : "1024k",
        "cluster.routing.allocation.cluster_concurrent_rebalance" : "4",
        "cluster.routing.allocation.node_concurrent_recoveries" : "4"
    }
}'
Gehel updated the task description. (Show Details)Jul 7 2016, 3:51 PM

Mentioned in SAL [2016-07-07T16:24:51Z] <gehel> starting elasticsearch and kibana upgrade on logstash cluster (T136001)

Invalid fields / mappings have appeared in the latest index as well. The de-dotting did not work properly. We are taking dumps of the last 7 days of data to clean them and re-import after upgrade.

Change 298115 had a related patch set uploaded (by BryanDavis):
Fix de_dot to process keys with falsey values

https://gerrit.wikimedia.org/r/298115

bd808 added a comment.Jul 9 2016, 3:58 PM

De-dot fix test results:

$ mwscript eval.php enwiki
> wfDebugLog('redis', 'de-dot test', 'all', ['foo.bar.true' => true, 'foo.bar.false' => false, 'foo.bar.0' => 0, 'foo.bar.1' => 1]);

before

{
  "_index": "logstash-2016.07.09",
  "_type": "mediawiki",
  "_id": "AVXP9pmTUJNpVbIylOfp",
  "_score": null,
  "_source": {
    "message": "de-dot test",
    "@version": 1,
    "@timestamp": "2016-07-09T14:01:31.000Z",
    "type": "mediawiki",
    "host": "deployment-tin",
    "level": "INFO",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "channel": "redis",
    "normalized_message": "de-dot test",
    "wiki": "enwiki",
    "mwversion": "1.28.0-alpha",
    "reqId": "d543772db0e9b24fa0f3a15b",
    "foo.bar.false": false,
    "private": false,
    "foo_bar_true": true,
    "foo_bar_0": 0,
    "foo_bar_1": 1
  },
  "sort": [
    1468072891000
  ]
}

after

{
  "_index": "logstash-2016.07.09",
  "_type": "mediawiki",
  "_id": "AVXQWzkoUJNpVbIyl7NA",
  "_score": null,
  "_source": {
    "message": "de-dot test",
    "@version": 1,
    "@timestamp": "2016-07-09T15:51:26.000Z",
    "type": "mediawiki",
    "host": "deployment-tin",
    "level": "INFO",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "channel": "redis",
    "normalized_message": "de-dot test",
    "wiki": "enwiki",
    "mwversion": "1.28.0-alpha",
    "reqId": "db6479168779a7c8b5175dbb",
    "private": false,
    "foo_bar_true": true,
    "foo_bar_false": false,
    "foo_bar_0": 0,
    "foo_bar_1": 1
  },
  "sort": [
    1468079486000
  ]
}

Change 298115 merged by BryanDavis:
Fix de_dot to process keys with falsey values

https://gerrit.wikimedia.org/r/298115

Mentioned in SAL [2016-07-09T19:50:37Z] <bd808> restarted logstash on logstash1001 for de-dot plugin update (T136001)

Mentioned in SAL [2016-07-09T19:52:16Z] <bd808> restarted logstash on logstash1002 for de-dot plugin update (T136001)

Mentioned in SAL [2016-07-09T19:54:10Z] <bd808> restarted logstash on logstash1003 for de-dot plugin update (T136001)

Change 298295 had a related patch set uploaded (by BryanDavis):
logstash: Update default mappings for Elasticsearch 2.x

https://gerrit.wikimedia.org/r/298295

bd808 added a comment.Jul 12 2016, 3:52 PM

Applying the default mapping change helped, but we still have several conflicting mappings in the logstash-2016.07.12 index:

  • Mapping for field eventlogging:code conflicts with: mediawiki:code. Check parameters: doc_values, index, norms.enabled
    • Caused by the new mapping not creating the same mappings for values seen initially as strings and values seen initially as other types
  • Mapping for field aqs:err_code conflicts with: kartotherian:err_code. Check parameters: doc_values, index, norms.enabled
    • Caused by the new mapping not creating the same mappings for values seen initially as strings and values seen initially as other types
  • Mapping for field mediawiki:response conflicts with: mml:response. Check parameters: norms.enabled, type
    • The mml.response field is an object rather than an string.

The string/long mapping fix can be made in https://gerrit.wikimedia.org/r/#/c/298295. I hoped that it would act recursively but that is apparently not correct.

The mml.response issue will need to be corrected in a Logstash filter by renaming the mml field to something other than response (probably response_object).

Change 298295 merged by Gehel:
logstash: Update default mappings for Elasticsearch 2.x

https://gerrit.wikimedia.org/r/298295

bd808 updated the task description. (Show Details)Jul 18 2016, 4:21 PM
bd808 updated the task description. (Show Details)

Mentioned in SAL [2016-07-18T19:04:11Z] <gehel> starting elasticsearch upgrade for logstash (T136001)

EBernhardson updated the task description. (Show Details)Jul 18 2016, 7:11 PM
EBernhardson updated the task description. (Show Details)
Gehel updated the task description. (Show Details)Jul 18 2016, 7:24 PM
EBernhardson updated the task description. (Show Details)Jul 18 2016, 7:26 PM
EBernhardson updated the task description. (Show Details)Jul 18 2016, 7:38 PM
bd808 updated the task description. (Show Details)Jul 18 2016, 7:40 PM
EBernhardson updated the task description. (Show Details)Jul 18 2016, 8:10 PM
EBernhardson updated the task description. (Show Details)Jul 18 2016, 8:12 PM
EBernhardson updated the task description. (Show Details)Jul 18 2016, 9:16 PM
Gehel updated the task description. (Show Details)Jul 18 2016, 9:41 PM
debt closed this task as Resolved.Jul 21 2016, 6:12 PM
debt claimed this task.