The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	andrea.denisse
	May 10 2024, 5:59 PM

Description

The pyrra-filesystem-notify-thanos.path fails with the following message: FIRING: [2x] SystemdUnitFailed: pyrra-filesystem-notify-thanos.path on titan1001:9100.

The purpose of the pyrra-filesystem-notify-thanos.path service is to reload Thanos every time Pyrra outputs a new rule file however, it's not expected to be reloaded frequently in rapid succession.

Relevant IRC logs:

jinxer-wm> FIRING: [2x] SystemdUnitFailed: pyrra-filesystem-notify-thanos.path on titan1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
1:54 PM <denisse> ^This happened yesterday. I think we'll need to modify the service.
1:55 PM <herron> yeah its hitting the start limit
1:56 PM <denisse> herron: I'm thinking of incresing the Start Limit, to something like "StartLimitBurst=10" and "StartLimitIntervalSec=60". What do you think?
1:57 PM <denisse> Another option would be to throttle the service restart but I think that we could lose information by doing that.
1:58 PM <denisse> I'm creating a task to track the issue and discuss possible solutions further.
1:58 PM <herron> I'm thinking something like TriggerLimitIntervalSec too
1:59 PM <herron> the idea is for thanos to be reloaded when pyrra outputs a new rule file, which happens several times quickly when onboarding a new slo
2:00 PM <denisse> I've created a task to track it, adding more info to it: https://phabricator.wikimedia.org/T364645
2:00 PM <herron> but we don't really want thanos to be sent a reload however many times in rapid succession
2:00 PM <herron> denisse: thanks
2:01 PM <denisse> Thanks for sharing context on the purpose of the service, I'm adding that info to the task.
2:04 PM <denisse> Looking at the `systemd` docs the `TriggerLimitIntervalSec` seems like a feasible solution too!
2:05 PM <denisse> However, I'm worried that it may require manual intervention because of how it behaves: "If the limit is hit, the socket unit is placed into a failure mode, and will not be connectible anymore until restarted."
2:06 PM <denisse> I think that `PollLimitIntervalSec` may be a better option for this as it would apply a temporary slowdown as compared to the permanent failure state that `TriggerLimitIntervalSec` would put the unit in. https://www.freedesktop.org/software/systemd/man/latest/systemd.socket.html#PollLimitIntervalSec=
2:06 PM <denisse> What do you think?

Details

	Subject	Repo	Branch	Lines +/-
	pyrra-filesystem: increase StartLimits and delay notified unit	operations/puppet	production	+5 -0
	pyrra-filesystem: set prom url to local thanos rule instance	operations/puppet	production	+4 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T290924 Tooling for end-of-quarter SLO reporting
Open	None	T302995 Transition to Pyrra for SLO Visualization and Management
Resolved	herron	T364645 The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit

Event Timeline

andrea.denisse created this task.May 10 2024, 5:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2024, 5:59 PM

herron updated the task description. (Show Details)May 10 2024, 6:00 PM

andrea.denisse renamed this task from The `pyrra-filesystem-notify-thanos.path` fails on titan1001 after reaching the start limit to The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit.May 10 2024, 6:01 PM

herron added a parent task: T302995: Transition to Pyrra for SLO Visualization and Management.May 10 2024, 6:02 PM

andrea.denisse updated the task description. (Show Details)May 10 2024, 6:03 PM

andrea.denisse updated the task description. (Show Details)May 10 2024, 6:10 PM

This started happening because I added a grouping workaround in the parent task where essentially the grouping is done by puppet instead of pyrra itself. It results in pyrra generating more output config files, e.g. now slo-$site.yaml vs what used to be slo.yaml

I'm wondering if increasing the start limits on pyrra-filesystem-notify-thanos.path along with adding and ExecStartPre=/bin/sleep 10 to the thanos-rule-reload.service would be a good enough approach to reload thanos rule only once when several pyrra output config files are changed rapidly. Thoughts?

A bit of a different route, though I can't remember if we have tried telling pyrra filesystem about "prometheus" being thanos-rule on localhost? i.e. --prometheus-url http://localhost:17902/rule/ ? In other words let pyrra filesystem effectively do the reload

In T364645#9789322, @fgiunchedi wrote:

A bit of a different route, though I can't remember if we have tried telling pyrra filesystem about "prometheus" being thanos-rule on localhost? i.e. --prometheus-url http://localhost:17902/rule/ ? In other words let pyrra filesystem effectively do the reload

This would be ideal, although as-is the prometheus-url is used to build links for each panel in the pyrra UI to prometheus (thanos in our case) for detailed query/metric view.

FWIW I opened https://github.com/pyrra-dev/pyrra/issues/986 about this last year which got some support but hasn't made any progress yet.

Change #1031050 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra-filesystem: increase StartLimits and delay notified unit

https://gerrit.wikimedia.org/r/1031050

gerritbot added a project: Patch-For-Review.May 13 2024, 7:23 PM

In T364645#9790603, @herron wrote:

In T364645#9789322, @fgiunchedi wrote:

A bit of a different route, though I can't remember if we have tried telling pyrra filesystem about "prometheus" being thanos-rule on localhost? i.e. --prometheus-url http://localhost:17902/rule/ ? In other words let pyrra filesystem effectively do the reload

This would be ideal, although as-is the prometheus-url is used to build links for each panel in the pyrra UI to prometheus (thanos in our case) for detailed query/metric view.

FWIW I opened https://github.com/pyrra-dev/pyrra/issues/986 about this last year which got some support but hasn't made any progress yet.

Thank you, indeed that's probably that issue I was vaguely remembering.

To clarify I'm referring to --prometheus-url http://localhost:17902/rule/ to pyrra filesystem, not pyrra api; AFAICS the prometheus client in filesystem is used only for reload so it should work as expected. The reload will work on titan hosts that do currently run thanos-rule, and will fail on hosts that don't run it which is benign

In T364645#9793355, @fgiunchedi wrote:

Thank you, indeed that's probably that issue I was vaguely remembering.

To clarify I'm referring to --prometheus-url http://localhost:17902/rule/ to pyrra filesystem, not pyrra api; AFAICS the prometheus client in filesystem is used only for reload so it should work as expected. The reload will work on titan hosts that do currently run thanos-rule, and will fail on hosts that don't run it which is benign

I see what you mean! Ok yes lets try it, I'll upload a patch

Change #1031492 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra-filesystem: set prom url to local thanos rule instance

https://gerrit.wikimedia.org/r/1031492

Change #1031492 merged by Herron:

[operations/puppet@production] pyrra-filesystem: set prom url to local thanos rule instance

https://gerrit.wikimedia.org/r/1031492

Deployed an SLO change to Pyrra just now and looking much better -- No alerts, and thanos rule was automatically reloaded shortly after pyrra filesystem 😎

May 14 18:11:02 titan1001 systemd[1]: Started pyrra-filesystem.service - Pyrra filesystem operator and backend for the API..
May 14 18:11:02 titan1001 pyrra-filesystem[1006680]: level=info ts=2024-05-14T18:11:02.662356201Z caller=main.go:142 msg="using Prometheus" url=http://localhost:17902/rule/
May 14 18:11:02 titan1001 pyrra-filesystem[1006680]: level=info ts=2024-05-14T18:11:02.662430013Z caller=filesystem.go:131 msg="watching directory for changes" directory=/etc/pyrra/config
May 14 18:11:02 titan1001 pyrra-filesystem[1006680]: level=info ts=2024-05-14T18:11:02.662600494Z caller=filesystem.go:261 msg="starting up HTTP API" address=:9444

May 14 18:11:07 titan1001 thanos-rule[952]: level=info ts=2024-05-14T18:11:07.963527179Z caller=rule.go:895 component=rules msg="reload rule files" numFiles=61

Change #1031050 abandoned by Herron:

[operations/puppet@production] pyrra-filesystem: increase StartLimits and delay notified unit

Reason:

went with If9b087d05909aa9235e35c32ee8a0662a2cdff2f instead

https://gerrit.wikimedia.org/r/1031050

Maintenance_bot removed a project: Patch-For-Review.May 14 2024, 6:31 PM

The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limitClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit
Closed, ResolvedPublic
Actions

Related Objects
Search...