A rolling restart of Logstash after a filter configuration, custom script, or index template change is a fairly common task. In order to do so safely, the most clear indication of whether or not there is a problem is to monitor pipeline throughput.
The normal rolling restart procedure:
- Take all selected Logstash instances and batch them into execution groups by percentage per site
- Disable Puppet on execution group
- This is to prevent Puppet from starting Logstash prematurely
- Stop Logstash on the execution group
- Watch Prometheus metric
- Event throughput on execution group should drop to 0
- Enable and run Puppet
- Skippable option
- Start Logstash
- Check and ensure Logstash is running. Puppet may do this action if not skipped in (5).
- Watch Prometheus metric
- Event throughput on execution group should be > 0
- On timeout:
- Stop Logstash on the execution group
- Disable Puppet on the execution group
- Notify the user
- Exit
- Ensure Puppet is enabled on execution group
- Pick next execution group and GOTO: (2)
Related: