Page MenuHomePhabricator

Replace and expand Elasticsearch/Kafka storage in eqiad and upgrade the cluster from Debian jessie to stretch
Closed, ResolvedPublic

Description

Proposed plan to complete this task:

  • Put new storage systems in production with stretch
    • Install new systems with stretch (shipment and racking pending, T210498)
    • Add hosts to elasticsearch cluster
    • Add hosts to kafka cluster
  • Take old storage hosts out of service for elasticsearch
    • Migrate all ES indices onto new hosts
    • Retire ES on old hosts
  • Take old storage hosts out of service for kafka
    • migrate logstash1004 to logstash1010
    • migrate logstash1005 to logstash1011
    • migrate logstash1006 to logstash1012
    • All producers/consumers pointed to new kafka brokers
  • Replace logstash jessie VMs (logstash100[789]) with stretch VMs

Event Timeline

fgiunchedi triaged this task as Normal priority.Jan 16 2019, 11:12 AM
fgiunchedi created this task.
herron moved this task from Backlog to Working on on the User-herron board.Jan 22 2019, 9:24 PM

Change 490401 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: move role::ls::eventlogging to profile::logstash::collector

https://gerrit.wikimedia.org/r/490401

Change 490601 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: use default distribution for logstash100[789]

https://gerrit.wikimedia.org/r/490601

Change 490602 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] scap: use logstash service name for logstash_host

https://gerrit.wikimedia.org/r/490602

Change 490601 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: use default distribution for logstash100[789]

https://gerrit.wikimedia.org/r/490601

Change 490602 merged by Filippo Giunchedi:
[operations/puppet@production] scap: use logstash1008 for logstash_host

https://gerrit.wikimedia.org/r/490602

Mentioned in SAL (#wikimedia-operations) [2019-02-14T14:45:34Z] <godog> depool and stop logstash1009 for stretch reimage - T213898

Change 490401 merged by Herron:
[operations/puppet@production] logstash: move role::ls::eventlogging to profile::logstash::collector

https://gerrit.wikimedia.org/r/490401

On stretch by default we're installing elasticsearch-curator from stretch which is at version 4.2, instead the package needs to be at version >= 5 and thus come from stretch-wikimedia.

Change 490809 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: force use elasticsearch-curator 5

https://gerrit.wikimedia.org/r/490809

Change 490809 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: force use elasticsearch-curator 5

https://gerrit.wikimedia.org/r/490809

Change 491794 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Revert "scap: use logstash1008 for logstash_host"

https://gerrit.wikimedia.org/r/491794

Change 491794 merged by Filippo Giunchedi:
[operations/puppet@production] Revert "scap: use logstash1008 for logstash_host"

https://gerrit.wikimedia.org/r/491794

Mentioned in SAL (#wikimedia-operations) [2019-02-20T16:36:10Z] <godog> depool and reimage logstash1008 with stretch - T213898

Mentioned in SAL (#wikimedia-operations) [2019-02-21T13:57:21Z] <godog> depool and reimage logstash1007 - T213898

fgiunchedi updated the task description. (Show Details)Feb 21 2019, 2:38 PM
herron updated the task description. (Show Details)Feb 25 2019, 3:43 PM

Change 492695 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: remove elasticsearch role from logstash100[456]

https://gerrit.wikimedia.org/r/492695

Mentioned in SAL (#wikimedia-operations) [2019-02-25T21:11:03Z] <herron> turning down elasticsearch service on logstash100[456] (data has been migrated to logstash101[012]) T213898

Change 492769 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: add logstash101[012] to unicast hosts

https://gerrit.wikimedia.org/r/492769

Change 492769 merged by Herron:
[operations/puppet@production] logstash: add logstash101[012] to unicast hosts

https://gerrit.wikimedia.org/r/492769

Mentioned in SAL (#wikimedia-operations) [2019-02-25T23:15:15Z] <herron> service restarts to make logstash101[012] master eligible are taking longer than expected, leaving elasticsearch on logstash100[456] enabled overnight T213898

Mentioned in SAL (#wikimedia-operations) [2019-02-26T16:09:46Z] <herron> elasticsearch stopped on logstash100[456] T213898

Change 492695 merged by Herron:
[operations/puppet@production] logstash: remove elasticsearch role from logstash100[456]

https://gerrit.wikimedia.org/r/492695

Change 493098 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: shrink es cluster back to 3 nodes, remove retired hosts

https://gerrit.wikimedia.org/r/493098

Change 493102 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: add kafka logging role to logstash101[012]

https://gerrit.wikimedia.org/r/493102

herron updated the task description. (Show Details)Feb 27 2019, 3:23 PM

Change 493102 abandoned by Herron:
logstash: add kafka logging role to logstash101[012]

Reason:
in favor of replacing hosts in rolling fashion

https://gerrit.wikimedia.org/r/493102

Change 493290 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-logging: replace logstash1004 with logstash1010

https://gerrit.wikimedia.org/r/493290

Mentioned in SAL (#wikimedia-operations) [2019-02-27T19:26:51Z] <herron> replacing kafka on logstash1004 with logstash1010 T213898

Change 493290 merged by Herron:
[operations/puppet@production] kafka-logging: replace logstash1004 with logstash1010

https://gerrit.wikimedia.org/r/493290

Change 493303 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash1010: add ipv6 mapped address

https://gerrit.wikimedia.org/r/493303

Change 493303 merged by Herron:
[operations/puppet@production] logstash1010: add ipv6 mapped address

https://gerrit.wikimedia.org/r/493303

Change 493306 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash100[12]: add ipv6 mapped address

https://gerrit.wikimedia.org/r/493306

Change 493306 merged by Herron:
[operations/puppet@production] logstash101[12]: add ipv6 mapped address

https://gerrit.wikimedia.org/r/493306

kafka service from logstash1004 has been migrated to logstash1010, and logstash1004 is now transitioned to spare::system.

herron updated the task description. (Show Details)Feb 27 2019, 10:49 PM
herron updated the task description. (Show Details)

Change 493429 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-logging: replace logstash1005 with logstash1011

https://gerrit.wikimedia.org/r/493429

Change 493430 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: disable notifications on logstash1005 and logstash1011

https://gerrit.wikimedia.org/r/493430

Change 493430 merged by Herron:
[operations/puppet@production] logstash: disable notifications on logstash1005 and logstash1011

https://gerrit.wikimedia.org/r/493430

Change 493440 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] rsyslog: replace logstash1004 with logstash1010 in kafka_shipper

https://gerrit.wikimedia.org/r/493440

Change 493440 merged by Herron:
[operations/puppet@production] rsyslog: replace logstash1004 with logstash1010 in kafka_shipper

https://gerrit.wikimedia.org/r/493440

Mentioned in SAL (#wikimedia-operations) [2019-02-28T16:27:56Z] <herron> migrating kafka on logstash1005 to logstash1011 T213898

Mentioned in SAL (#wikimedia-operations) [2019-02-28T16:28:00Z] <herron> migrating kafka on logstash1005 to logstash1011 T213898

Change 493429 merged by Herron:
[operations/puppet@production] kafka-logging: replace logstash1005 with logstash1011

https://gerrit.wikimedia.org/r/493429

Mentioned in SAL (#wikimedia-operations) [2019-02-28T17:51:46Z] <herron> logstash1011 kafka now in sync. transitioning logstash1005 to spare system T213898

herron renamed this task from Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch to Replace and expand Elasticsearch/Kafka storage in eqiad and upgrade the cluster from Debian jessie to stretch.Feb 28 2019, 5:52 PM
herron updated the task description. (Show Details)

Change 493471 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kafka-logging: replace logstash1006 with logstash1012

https://gerrit.wikimedia.org/r/493471

Change 493476 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: disable notifications on logstash1006 and logstash1012

https://gerrit.wikimedia.org/r/493476

Change 493476 merged by Herron:
[operations/puppet@production] logstash: disable notifications on logstash1006 and logstash1012

https://gerrit.wikimedia.org/r/493476

Change 493477 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] rsyslog: replace logstash1005 with logstash1011 in kafka_shipper

https://gerrit.wikimedia.org/r/493477

Change 493477 merged by Herron:
[operations/puppet@production] rsyslog: replace logstash1005 with logstash1011 in kafka_shipper

https://gerrit.wikimedia.org/r/493477

Mentioned in SAL (#wikimedia-operations) [2019-02-28T18:52:06Z] <herron> migrating logstash1006 kafka to logstash1012 T213898

Change 493471 merged by Herron:
[operations/puppet@production] kafka-logging: replace logstash1006 with logstash1012

https://gerrit.wikimedia.org/r/493471

herron updated the task description. (Show Details)Mar 1 2019, 10:37 PM
herron updated the task description. (Show Details)

Change 494224 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] rsyslog: replace logstash1006 with logstash1012 in kafka_shipper

https://gerrit.wikimedia.org/r/494224

Change 494224 merged by Herron:
[operations/puppet@production] rsyslog: replace logstash1006 with logstash1012 in kafka_shipper

https://gerrit.wikimedia.org/r/494224

herron updated the task description. (Show Details)Mar 4 2019, 2:52 PM
herron closed this task as Resolved.Mar 4 2019, 2:59 PM
herron claimed this task.

Service migration and OS upgrade work is complete with ES and Kafka services running from logstash101[012], and frontend VMs logstash100[789] upgraded to stretch.

Tracking hardware retirement in task T217556