Page MenuHomePhabricator

The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit
Closed, ResolvedPublic

Description

The pyrra-filesystem-notify-thanos.path fails with the following message: FIRING: [2x] SystemdUnitFailed: pyrra-filesystem-notify-thanos.path on titan1001:9100.

The purpose of the pyrra-filesystem-notify-thanos.path service is to reload Thanos every time Pyrra outputs a new rule file however, it's not expected to be reloaded frequently in rapid succession.

Relevant IRC logs:

jinxer-wm> FIRING: [2x] SystemdUnitFailed: pyrra-filesystem-notify-thanos.path on titan1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
1:54 PM <denisse> ^This happened yesterday. I think we'll need to modify the service.
1:55 PM <herron> yeah its hitting the start limit
1:56 PM <denisse> herron: I'm thinking of incresing the Start Limit, to something like "StartLimitBurst=10" and "StartLimitIntervalSec=60". What do you think?
1:57 PM <denisse> Another option would be to throttle the service restart but I think that we could lose information by doing that.
1:58 PM <denisse> I'm creating a task to track the issue and discuss possible solutions further.
1:58 PM <herron> I'm thinking something like TriggerLimitIntervalSec too
1:59 PM <herron> the idea is for thanos to be reloaded when pyrra outputs a new rule file, which happens several times quickly when onboarding a new slo
2:00 PM <denisse> I've created a task to track it, adding more info to it: https://phabricator.wikimedia.org/T364645
2:00 PM <herron> but we don't really want thanos to be sent a reload however many times in rapid succession
2:00 PM <herron> denisse: thanks
2:01 PM <denisse> Thanks for sharing context on the purpose of the service, I'm adding that info to the task.
2:04 PM <denisse> Looking at the `systemd` docs the `TriggerLimitIntervalSec` seems like a feasible solution too!
2:05 PM <denisse> However, I'm worried that it may require manual intervention because of how it behaves: "If the limit is hit, the socket unit is placed into a failure mode, and will not be connectible anymore until restarted."
2:06 PM <denisse> I think that `PollLimitIntervalSec` may be a better option for this as it would apply a temporary slowdown as compared to the permanent failure state that `TriggerLimitIntervalSec` would put the unit in. https://www.freedesktop.org/software/systemd/man/latest/systemd.socket.html#PollLimitIntervalSec=
2:06 PM <denisse> What do you think?

Event Timeline

andrea.denisse renamed this task from The `pyrra-filesystem-notify-thanos.path` fails on titan1001 after reaching the start limit to The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit.May 10 2024, 6:01 PM

This started happening because I added a grouping workaround in the parent task where essentially the grouping is done by puppet instead of pyrra itself. It results in pyrra generating more output config files, e.g. now slo-$site.yaml vs what used to be slo.yaml

I'm wondering if increasing the start limits on pyrra-filesystem-notify-thanos.path along with adding and ExecStartPre=/bin/sleep 10 to the thanos-rule-reload.service would be a good enough approach to reload thanos rule only once when several pyrra output config files are changed rapidly. Thoughts?

A bit of a different route, though I can't remember if we have tried telling pyrra filesystem about "prometheus" being thanos-rule on localhost? i.e. --prometheus-url http://localhost:17902/rule/ ? In other words let pyrra filesystem effectively do the reload

A bit of a different route, though I can't remember if we have tried telling pyrra filesystem about "prometheus" being thanos-rule on localhost? i.e. --prometheus-url http://localhost:17902/rule/ ? In other words let pyrra filesystem effectively do the reload

This would be ideal, although as-is the prometheus-url is used to build links for each panel in the pyrra UI to prometheus (thanos in our case) for detailed query/metric view.

FWIW I opened https://github.com/pyrra-dev/pyrra/issues/986 about this last year which got some support but hasn't made any progress yet.

Change #1031050 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra-filesystem: increase StartLimits and delay notified unit

https://gerrit.wikimedia.org/r/1031050

A bit of a different route, though I can't remember if we have tried telling pyrra filesystem about "prometheus" being thanos-rule on localhost? i.e. --prometheus-url http://localhost:17902/rule/ ? In other words let pyrra filesystem effectively do the reload

This would be ideal, although as-is the prometheus-url is used to build links for each panel in the pyrra UI to prometheus (thanos in our case) for detailed query/metric view.

FWIW I opened https://github.com/pyrra-dev/pyrra/issues/986 about this last year which got some support but hasn't made any progress yet.

Thank you, indeed that's probably that issue I was vaguely remembering.

To clarify I'm referring to --prometheus-url http://localhost:17902/rule/ to pyrra filesystem, not pyrra api; AFAICS the prometheus client in filesystem is used only for reload so it should work as expected. The reload will work on titan hosts that do currently run thanos-rule, and will fail on hosts that don't run it which is benign

Thank you, indeed that's probably that issue I was vaguely remembering.

To clarify I'm referring to --prometheus-url http://localhost:17902/rule/ to pyrra filesystem, not pyrra api; AFAICS the prometheus client in filesystem is used only for reload so it should work as expected. The reload will work on titan hosts that do currently run thanos-rule, and will fail on hosts that don't run it which is benign

I see what you mean! Ok yes lets try it, I'll upload a patch

Change #1031492 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra-filesystem: set prom url to local thanos rule instance

https://gerrit.wikimedia.org/r/1031492

Change #1031492 merged by Herron:

[operations/puppet@production] pyrra-filesystem: set prom url to local thanos rule instance

https://gerrit.wikimedia.org/r/1031492

herron claimed this task.

Deployed an SLO change to Pyrra just now and looking much better -- No alerts, and thanos rule was automatically reloaded shortly after pyrra filesystem 😎

May 14 18:11:02 titan1001 systemd[1]: Started pyrra-filesystem.service - Pyrra filesystem operator and backend for the API..
May 14 18:11:02 titan1001 pyrra-filesystem[1006680]: level=info ts=2024-05-14T18:11:02.662356201Z caller=main.go:142 msg="using Prometheus" url=http://localhost:17902/rule/
May 14 18:11:02 titan1001 pyrra-filesystem[1006680]: level=info ts=2024-05-14T18:11:02.662430013Z caller=filesystem.go:131 msg="watching directory for changes" directory=/etc/pyrra/config
May 14 18:11:02 titan1001 pyrra-filesystem[1006680]: level=info ts=2024-05-14T18:11:02.662600494Z caller=filesystem.go:261 msg="starting up HTTP API" address=:9444

May 14 18:11:07 titan1001 thanos-rule[952]: level=info ts=2024-05-14T18:11:07.963527179Z caller=rule.go:895 component=rules msg="reload rule files" numFiles=61

Change #1031050 abandoned by Herron:

[operations/puppet@production] pyrra-filesystem: increase StartLimits and delay notified unit

Reason:

went with If9b087d05909aa9235e35c32ee8a0662a2cdff2f instead

https://gerrit.wikimedia.org/r/1031050