The pyrra-filesystem-notify-thanos.path fails with the following message: FIRING: [2x] SystemdUnitFailed: pyrra-filesystem-notify-thanos.path on titan1001:9100.
The purpose of the pyrra-filesystem-notify-thanos.path service is to reload Thanos every time Pyrra outputs a new rule file however, it's not expected to be reloaded frequently in rapid succession.
Relevant IRC logs:
jinxer-wm> FIRING: [2x] SystemdUnitFailed: pyrra-filesystem-notify-thanos.path on titan1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed 1:54 PM <denisse> ^This happened yesterday. I think we'll need to modify the service. 1:55 PM <herron> yeah its hitting the start limit 1:56 PM <denisse> herron: I'm thinking of incresing the Start Limit, to something like "StartLimitBurst=10" and "StartLimitIntervalSec=60". What do you think? 1:57 PM <denisse> Another option would be to throttle the service restart but I think that we could lose information by doing that. 1:58 PM <denisse> I'm creating a task to track the issue and discuss possible solutions further. 1:58 PM <herron> I'm thinking something like TriggerLimitIntervalSec too 1:59 PM <herron> the idea is for thanos to be reloaded when pyrra outputs a new rule file, which happens several times quickly when onboarding a new slo 2:00 PM <denisse> I've created a task to track it, adding more info to it: https://phabricator.wikimedia.org/T364645 2:00 PM <herron> but we don't really want thanos to be sent a reload however many times in rapid succession 2:00 PM <herron> denisse: thanks 2:01 PM <denisse> Thanks for sharing context on the purpose of the service, I'm adding that info to the task. 2:04 PM <denisse> Looking at the `systemd` docs the `TriggerLimitIntervalSec` seems like a feasible solution too! 2:05 PM <denisse> However, I'm worried that it may require manual intervention because of how it behaves: "If the limit is hit, the socket unit is placed into a failure mode, and will not be connectible anymore until restarted." 2:06 PM <denisse> I think that `PollLimitIntervalSec` may be a better option for this as it would apply a temporary slowdown as compared to the permanent failure state that `TriggerLimitIntervalSec` would put the unit in. https://www.freedesktop.org/software/systemd/man/latest/systemd.socket.html#PollLimitIntervalSec= 2:06 PM <denisse> What do you think?