Page MenuHomePhabricator

Finalize Charts SLO
Closed, ResolvedPublic

Description

https://wikitech.wikimedia.org/wiki/SLO/Charts is almost finalized, but not quite. We need to look at the data that we've gathered so far to get an idea for what target values would be realistic, then commit to those.

  • Decide on a maximum acceptable latency for the chart-renderer service. This will be the cutoff value in the combined latency-availability SLI for the service.
  • Decide on a target percentage for the latency-availability SLI for the service (what percentage of requests should complete faster than the cutoff value)
  • Decide on a target percentage for the availability SLI for client-side rendering
  • Add a graph for the combined latency-availability SLI to the Grafana dashboard
  • Update https://wikitech.wikimedia.org/wiki/SLO/Charts with these values and mark it as complete
  • Define the SLOs in Pyrra

Details

Related Changes in Gerrit:

Event Timeline

@Catrope Hi! Any update on the timeline for this task? :)

@egardner Hi! I see that Roan is on holidays, do you have any update on this? Thanks in advance :)

ovasileva triaged this task as Medium priority.Aug 4 2025, 4:15 PM

Getting back to this, next step is adding the Pyrra config. We are experimenting with a new revision-based config, so I may wait a couple of days more to avoid the transition of Charts to it :)

To recap, these are the SLIs:

chart-renderer service: Combined latency-availability

The percentage of all requests that complete within 200 milliseconds and receive a non-error response (HTTP 200).

In the dashboard the relevant graph seems to be this one, but I am not sure where the metric comes from (namely, where it is emitted from). The chart-renderer service is deployed in k8s and it is behind the Istio Gateway, we usually use these metrics for Pyrra. What do you think?

Chart client side rendering: Availability

The percentage of client-side rendering attempts that successfully display a chart

In this case the metrics to use seem to be mediawiki_Chart_render_failure_total and mediawiki_Chart_render_end_total, am I correct?

chart-renderer service: Combined latency-availability

The percentage of all requests that complete within 200 milliseconds and receive a non-error response (HTTP 200).

In the dashboard the relevant graph seems to be this one, but I am not sure where the metric comes from (namely, where it is emitted from). The chart-renderer service is deployed in k8s and it is behind the Istio Gateway, we usually use these metrics for Pyrra. What do you think?

I have no idea where either of those metrics come from or how they are different. I think @CDanis might be the one who set up the dashboard originally so they might know. My guess is that the Istio metrics you linked to are probably fine.

Chart client side rendering: Availability

The percentage of client-side rendering attempts that successfully display a chart

In this case the metrics to use seem to be mediawiki_Chart_render_failure_total and mediawiki_Chart_render_end_total, am I correct?

Yes, that is correct. Specifically, render_failure_total is the number of failed renders, and render_end_total is the number of successful renders, ao the success percentage is render_end_total / (render_end_total + render_failure_total). That formula is used in the Grafana chart too (except that it graphs the failure percentage instead).

Catrope reassigned this task from Jdlrobson-WMF to elukey.
Catrope moved this task from Needs Refinement to Radar on the Reader Growth Team board.

The only remaining work here is for @elukey to configure the SLO in Pyrra, so reopening this task and assigning to him.

Thanks a lot for the patience folks, we have stopped onboarding new SLOs in Pyrra temporarily while we figure out T403729. We are comparing the results with another tool in T404171, so we are trying to avoid onboarding something on Pyrra to then do it again on the new tool. I hope that this will be solved soon!

Luca, do you want an early test subject for the Sloth trial?

Luca, do you want an early test subject for the Sloth trial?

Definitely, the first use case will be Citoid so we can make a comparison with Pyrra, but Charts could definitely be the second one!

Change #1190620 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] pyrra: add the Charts SLOs

https://gerrit.wikimedia.org/r/1190620

Change #1190620 merged by Elukey:

[operations/puppet@production] pyrra: add the Charts SLOs

https://gerrit.wikimedia.org/r/1190620

Dashboards look good to me, let's call this finalized!