Page MenuHomePhabricator

ELK 5.x deployment plan
Closed, ResolvedPublic

Description

Perform both upgrades on a monday, as both updates will have minor downtime of the logging service and we don't want to lose logs related to the train rolling forward.

Deploy logstash 5.x, around april 10th.

  1. Take logstash out of the experimental apt repo and upload to main
  2. merge and git-deploy new logstash plugins
  3. Merge puppet patch for logstash 5.x configuration
  4. force a puppet run on logstash100[123]
  5. apt-get install logstash on logstash100[123]. Unfortunately because there is no LVS balancing writes we may lose a couple things sent to hosts that are being upgraded.
  6. Check logs are still coming in. Verify no problems recorded in /var/lib/logstash. Note there are some existing errors about Gelfd that are expected (T161563)

Assuming no hiccups, deploy elasticsearch 5.x and kibana 5.x april 24th

  1. Check that all indices on logstash have been created with elasticsearch >= 2.x. Delete anything old.
  2. Record the output of curl logstash1001.eqiad.wmnet/_cluster/settings to be re-applied. Evaluate if any of these settings should instead be moved into puppet as permanent settings.
  3. Merge elasticsearch plugins to operations/software/elasticsearch/plugins
  4. Pull new elasticsearch plugins to beta cluster
  5. Install elasticsearch .deb's across cluster
  6. Shut down elasticsearch on all nodes
  7. Remount /var/lib/elasticsearch/production-logstash-eqiad to /srv/elasticsearch/production-logstash-eqiad on logstash100[456]
    1. Can this be done prior to everything else, one node at a time?
  8. Bring up one logstash data node, make sure everything is happy
  9. Bring up the rest of the logstash cluster
  10. Install new kibana deb on logstash100[123]
  11. Double check logstash installed the newly deployed 5.x versions of index templates. If not manually install logstash and apifeatureusage templates via REST api.

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 31 2017, 4:25 PM
EBernhardson updated the task description. (Show Details)Apr 3 2017, 5:51 PM
EBernhardson updated the task description. (Show Details)

@Gehel @dcausse Please review, see if this makes sense as a deployment plan.

Can't argue on logstash deployment but this looks good to me, few comments concerning elasticsearch:

  • we should probably do step 1 asap in case something important needs to be migrated
  • concerning point 2: search.svc.codfw.wmnet is certainly a typo? looking at logstash1003.eqiad.wmnet cluster settings I see numerous transient settings:
{
  "persistent": {
    "action": {
      "destructive_requires_name": "true"
    }
  },
  "transient": {
    "cluster": {
      "routing": {
        "allocation": {
          "cluster_concurrent_rebalance": "4",
          "node_concurrent_recoveries": "4",
          "enable": "all"
        }
      }
    },
    "indices": {
      "recovery": {
        "concurrent_streams": "16",
        "translog_size": "16mb",
        "translog_ops": "10000",
        "concurrent_small_file_streams": "8",
        "max_bytes_per_sec": "120mb",
        "file_chunk_size": "1024k"
      }
    }
  }
}

They all seem to be here to speed-up recovery, are they still needed?

EBernhardson updated the task description. (Show Details)Apr 6 2017, 5:51 PM
EBernhardson updated the task description. (Show Details)Apr 10 2017, 7:22 PM

Mentioned in SAL (#wikimedia-operations) [2017-04-10T19:38:26Z] <gehel> starting logstash upgrade - some log messages will be lost! - T161908

Mentioned in SAL (#wikimedia-operations) [2017-04-10T19:38:48Z] <gehel> disabling puppet on logstash1* - T161908

Gehel added a comment.EditedApr 10 2017, 7:43 PM

Final plan for logstash upgrade looks like:

  1. disable puppet on logstash cluster
  2. merge puppet change
  3. run puppet on salt masters (to activate git-fat for plugin deployment)
  4. merge plugins
  5. deploy plugins
  6. for each server (logstash100[123]):
    1. stop logstash
    2. mask logstash
    3. apt-get install logstash
    4. puppet run
    5. unmask and restart logstash

Perhaps worth looking into later:

  • Initial git deploy sync failed becaused the logstash hosts all had the remote set to mira. Manually adjusted to point to tin instead
  • After a succesfull git deploy the files that git-fat should have initialized were still empty. Manually ran git-fat pull on the logstash hosts to fix

Mentioned in SAL (#wikimedia-operations) [2017-04-10T20:29:59Z] <gehel> upgrading logstash on logstash1001 - T161908

Mentioned in SAL (#wikimedia-operations) [2017-04-10T21:13:51Z] <gehel> running puppet on logstash1001 to deploy new logstash plugins - T161908

Mentioned in SAL (#wikimedia-operations) [2017-04-10T21:17:40Z] <gehel> logstash upgrade on logstash1001 completed - T161908

Mentioned in SAL (#wikimedia-operations) [2017-04-10T21:22:15Z] <gehel> upgrading logstash on logstash1002 - T161908

Mentioned in SAL (#wikimedia-operations) [2017-04-10T21:31:26Z] <gehel> upgrading logstash on logstash1003 - T161908

Mentioned in SAL (#wikimedia-operations) [2017-04-10T21:33:44Z] <gehel> logstash upgrade on all logstash1* nodes completed- T161908

debt added a subscriber: debt.Apr 11 2017, 5:13 PM

Awaiting more patch releases on April 24 to finish this out.

Deskana triaged this task as Medium priority.Apr 18 2017, 5:10 PM
ayounsi added a subscriber: ayounsi.May 5 2017, 7:22 AM

Change 352590 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] ELK - upgrade reprepro to version 5.3.2 of ELK

https://gerrit.wikimedia.org/r/352590

Change 352590 merged by Gehel:
[operations/puppet@production] ELK - upgrade reprepro to version 5.3.2 of ELK

https://gerrit.wikimedia.org/r/352590

Change 352605 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elastic - simplify configuration of elastic.co reprepro repositories

https://gerrit.wikimedia.org/r/352605

Change 352605 merged by Gehel:
[operations/puppet@production] elastic - simplify configuration of elastic.co reprepro repositories

https://gerrit.wikimedia.org/r/352605

Change 352608 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] ELK - logstash package does not follow the same version naming

https://gerrit.wikimedia.org/r/352608

Change 352608 merged by Gehel:
[operations/puppet@production] ELK - logstash package does not follow the same version naming

https://gerrit.wikimedia.org/r/352608

Mentioned in SAL (#wikimedia-operations) [2017-05-08T19:28:02Z] <gehel> starting ELK (logstash) upgrade - T161908

Change 352648 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] logstash upgrade to elasticsearch 5

https://gerrit.wikimedia.org/r/352648

Change 352648 merged by Gehel:
[operations/puppet@production] logstash upgrade to elasticsearch 5

https://gerrit.wikimedia.org/r/352648

Mentioned in SAL (#wikimedia-operations) [2017-05-08T19:47:21Z] <gehel> logstash / elasticsearch downtime coming up - T161908

Mentioned in SAL (#wikimedia-operations) [2017-05-08T20:02:50Z] <gehel> restarting elasticsearch on logstash cluster after upgrade - T161908

Mentioned in SAL (#wikimedia-operations) [2017-05-08T20:21:56Z] <gehel> upgrading kibana on logstash cluster - T161908

Mentioned in SAL (#wikimedia-operations) [2017-05-08T20:27:51Z] <gehel> restarted kibana on logstash cluster - T161908

Change 352666 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] kibana - cleanup of /opt/kibana has been done

https://gerrit.wikimedia.org/r/352666

Mentioned in SAL (#wikimedia-operations) [2017-05-08T20:37:42Z] <gehel> silencing elasticsearch shard incinga check, recovery after upgrade is going to take a long time - T161908

Change 352666 merged by Gehel:
[operations/puppet@production] kibana - cleanup of /opt/kibana has been done

https://gerrit.wikimedia.org/r/352666

All elasticsearch instances have been migrated to 5.x. The 5.1 to 5.3 upgrade still needs to happen, but this is another task.

debt closed this task as Resolved.May 30 2017, 5:23 PM
debt claimed this task.