Page MenuHomePhabricator

Logstash: add SSD tier to ELK7 cluster
Closed, ResolvedPublic

Description

With new logstash ES hosts racked and installed (in T240881 and T240882), it's time to configure an SSD indexing tier for the ELK7 cluster.

Here's a checklist in rough sequence aiming to do this without unwanted shard relocation.

Event Timeline

herron triaged this task as Medium priority.Mar 11 2020, 12:16 AM
herron created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 11 2020, 12:16 AM

Change 578653 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] elasticsearch: add 'disktype' param to configure node.attr.disktype

https://gerrit.wikimedia.org/r/578653

herron updated the task description. (Show Details)Mar 11 2020, 12:24 AM

Change 579019 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: let new logstash machines use role(insetup)

https://gerrit.wikimedia.org/r/579019

Change 579019 merged by Dzahn:
[operations/puppet@production] site: let new logstash machines use role(insetup)

https://gerrit.wikimedia.org/r/579019

Change 578653 merged by Herron:
[operations/puppet@production] elasticsearch: add 'disktype' param to configure node.attr.disktype

https://gerrit.wikimedia.org/r/578653

herron updated the task description. (Show Details)Mar 12 2020, 5:13 PM

Change 579338 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] ELK7: require disktype "hdd" for new indices

https://gerrit.wikimedia.org/r/579338

Change 579340 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd"

https://gerrit.wikimedia.org/r/579340

Change 579422 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: add curator job to require disktype hdd after 7 days

https://gerrit.wikimedia.org/r/579422

fgiunchedi moved this task from Inbox to In progress on the observability board.Mar 16 2020, 2:18 PM
herron updated the task description. (Show Details)Mar 17 2020, 5:22 PM
herron updated the task description. (Show Details)Mar 17 2020, 5:30 PM
herron updated the task description. (Show Details)

Change 579338 merged by Herron:
[operations/puppet@production] ELK7: require disktype "hdd" for new indices

https://gerrit.wikimedia.org/r/579338

Change 579340 merged by Herron:
[operations/puppet@production] logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd"

https://gerrit.wikimedia.org/r/579340

herron updated the task description. (Show Details)Mar 25 2020, 8:17 PM

Change 583438 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] ELK7: require disktype "ssd" for new indices

https://gerrit.wikimedia.org/r/583438

Change 583438 merged by Herron:
[operations/puppet@production] ELK7: require disktype "ssd" for new indices

https://gerrit.wikimedia.org/r/583438

herron updated the task description. (Show Details)Mar 25 2020, 8:40 PM

Change 579422 merged by Herron:
[operations/puppet@production] ELk7: add curator job to require disktype hdd after 7 days

https://gerrit.wikimedia.org/r/579422

herron closed this task as Resolved.Mar 27 2020, 5:21 PM
herron updated the task description. (Show Details)
Dzahn added a subscriber: Dzahn.Jun 2 2020, 11:13 AM

logtash2028 is reporting as failed SSH since 2 days. There is noting in SAL or an open ticket. Notifications are disabled but that could be from previous reinstall.

All other checks are UNKNOWN. Is it known at all?

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=logstash2028

herron added a comment.Jun 2 2020, 6:21 PM

logtash2028 is reporting as failed SSH since 2 days. There is noting in SAL or an open ticket. Notifications are disabled but that could be from previous reinstall.

All other checks are UNKNOWN. Is it known at all?

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=logstash2028

Strange. This host was semi-responsive over serial console for a short period, then hung up completely. I power cycled it via ipmi and it's back online now. Looking at the time this began there was a massive spike in iowait https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=1590867150087&to=1591120410087&var-server=logstash2028&var-datasource=codfw%20prometheus%2Fops&var-cluster=logstash but I do not see a clear reason for the spike in the logs leading up to the failure

Dzahn added a comment.Jun 2 2020, 7:26 PM

Ah, thanks Herron!