Page MenuHomePhabricator

Create a cookbook for managing Logstash cluster restarts
Open, Needs TriagePublic

Description

A rolling restart of Logstash after a filter configuration, custom script, or index template change is a fairly common task. In order to do so safely, the most clear indication of whether or not there is a problem is to monitor pipeline throughput.

The normal rolling restart procedure:

  1. Take all selected Logstash instances and batch them into execution groups by percentage per site
  2. Disable Puppet on execution group
    1. This is to prevent Puppet from starting Logstash prematurely
  3. Stop Logstash on the execution group
  4. Watch Prometheus metric
    1. Event throughput on execution group should drop to 0
  5. Enable and run Puppet
    1. Skippable option
  6. Start Logstash
    1. Check and ensure Logstash is running. Puppet may do this action if not skipped in (5).
  7. Watch Prometheus metric
    1. Event throughput on execution group should be > 0
    2. On timeout:
      1. Stop Logstash on the execution group
      2. Disable Puppet on the execution group
      3. Notify the user
      4. Exit
  8. Ensure Puppet is enabled on execution group
  9. Pick next execution group and GOTO: (2)

Related: