Page MenuHomePhabricator

Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring
Open, HighPublic

Description

The Wikifunctions SLO's error budge is currently very red:

The trend seems good, but from the Grafana dashboard it seems clear that at the moment we are not meeting the expected availability. We should figure out if this is something related to wikifunctions itself (to be fixed and hence a show stopper for future releases) or if we have a problem with the metrics/pyrra-config.

Event Timeline

elukey triaged this task as High priority.
cmassaro renamed this task from Review the Wikifunction's SLO to Review the Wikifunctions SLO.Oct 22 2025, 5:40 PM
cmassaro renamed this task from Review the Wikifunctions SLO to Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

I took a look to the Grafana dashboard and the error ratio trend has been improving since Oct 1st:

Screenshot From 2025-10-23 11-37-16.png (910×1 px, 65 KB)

From the SAL it seems matching with the eqiad K8s cluster being upgraded to the new version (so completely wiped, redeployed and upgraded).

Then it got down to a minimum, to then rise again around Oct 9th early UTC (https://sal.toolforge.org/production?p=5&q=&d=2025-10-09 shows some MW deployments but I can't tell if anything is connected/related).

The error budget is rising, as described in How_to_read_a_Pyrra_dashboard? because the last datapoints are better afaics. If WF follows this trend we should see the error budget increasing more and more.

The main question is: is the trend related to traffic or to improvements that have been happening?

Tried to plot:

In the latter there is a similar patter shown, and overall the latency seems better in the past days (no more above-10s spikes).

The improvement seems to have stopped, we are still seeing a positive error budget but the error rate seems still fluctuating between what is accepted by the SLO target and what it is not. Given the fact that the error budget was totally burned out, we should focus on reliability first and identify two things:

  1. Is the SLO trend highlighted above reflecting in the WF's signals/metrics? Namely, is it something that matches with the performances that you are seeing?
  2. Depending on the above, we may have an SLI metric problem to solve :)

After a chat with the AW team last week I tried to follow up again on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1192609, and I may have got what James is trying to do. I'll write down my understanding:

  • The metric that we are using in Pyrra, mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket, is generated by MediaWiki and it takes into account the time took by the Wikifunctions service on k8s plus the time that it takes for the response to get back to MediaWiki.
  • The Wikifunctions k8s service is set to take at most 10s to render a call/request.
  • The Back-end API combined latency-availability SLI states that we are counting the requests that take more that 10s to be rendered.

So at the moment the SLI metric that we are using may count requests hitting 10s to render on the k8s backend 10s+x, with x equal to some non-negligible amount of ms that MediaWiki takes to get the reply back from k8s. If we could avoid this extra ms in the SLI metric we'd probably see a nice and green SLO. An option would be to use a different set of SLI metrics, namely the Istio ones (Istio is the gateway that sits between envoy/wf-pods and MediaWiki). For example, this is the dashboard and this is the p99 graph.

The main problem that I see though is respecting the following SLO:

The percentage of all requests that complete within the 10s threshold and receive a non-error response, defined as above, shall be at least the limit.

10 s limit enforced by k8s request logic.

If K8s enforces a maximum 10s timeout, IIUC this means that no request could be possibly logged taking more than 10s by how the service is configured. Hence the SLO should be solid 100% green, but what are we measuring? My understanding is that we should keep the number of 10s timeouts as low as possible, that is what the current SLI is measuring. Am I missing anything?

Had a chat with David and other folks from the AW team, and the rock-solid-always-100% target is what the aim to verify that everything works as they expect, to then iterate on more precise values.

Caveat if we use Istio metrics though: T392886 lists the bucket list, and the one below 10s is 5s, so maybe it is too big of a jump for the final value. Now that prod is running a more up-to-date Istio we could in theory try to customize the bucket list for each service, in that case we'll be able to create what we need. I'll follow up on this if you like the idea, lemme know!

@Jdforrester-WMF to review and confirm or revise what I've said above :)

After a chat with the AW team last week I tried to follow up again on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1192609, and I may have got what James is trying to do. I'll write down my understanding:

  • The metric that we are using in Pyrra, mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket, is generated by MediaWiki and it takes into account the time took by the Wikifunctions service on k8s plus the time that it takes for the response to get back to MediaWiki.
  • The Wikifunctions k8s service is set to take at most 10s to render a call/request.
  • The Back-end API combined latency-availability SLI states that we are counting the requests that take more that 10s to be rendered.

So at the moment the SLI metric that we are using may count requests hitting 10s to render on the k8s backend 10s+x, with x equal to some non-negligible amount of ms that MediaWiki takes to get the reply back from k8s. If we could avoid this extra ms in the SLI metric we'd probably see a nice and green SLO. An option would be to use a different set of SLI metrics, namely the Istio ones (Istio is the gateway that sits between envoy/wf-pods and MediaWiki). For example, this is the dashboard and this is the p99 graph.

The main problem that I see though is respecting the following SLO:

The percentage of all requests that complete within the 10s threshold and receive a non-error response, defined as above, shall be at least the limit.

10 s limit enforced by k8s request logic.

If K8s enforces a maximum 10s timeout, IIUC this means that no request could be possibly logged taking more than 10s by how the service is configured. Hence the SLO should be solid 100% green, but what are we measuring? My understanding is that we should keep the number of 10s timeouts as low as possible, that is what the current SLI is measuring. Am I missing anything?

Is the k8s-level 10s configurable (e.g. to 10.2)? If not, it seems that the easiest path forward is just to reduce our internal service timeouts to 10s-x, no?

@cmassaro I think it is probably something in the docker image / WF service itself, I haven't found a k8s configuration that triggers the 10s timeout yet.

I reviewed T392886 for the Istio metrics, and changing the buckets below 10s (like adding more when you'll be ready to lower the threshold in the SLO etc..) is not super easy, so it is definitely quicker to do it in mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket if you need a special time granularity.

@cmassaro I think it is probably something in the docker image / WF service itself, I haven't found a k8s configuration that triggers the 10s timeout yet.

I reviewed T392886 for the Istio metrics, and changing the buckets below 10s (like adding more when you'll be ready to lower the threshold in the SLO etc..) is not super easy, so it is definitely quicker to do it in mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket if you need a special time granularity.

Ah, apologies, I was asking about the bucket (which it sounds like we can configure). The 10s timeout on our side is configured in helm.

Exactly yes, it is here. We also have an extra envoy timeout set to 15s IIUC as extra fence, but the one that counts is the orchestrator's one.

@DSantamaria while we decide the best approach, it would be great to also discuss the path that we'll take after this first round of configurations. IIUC from our last discussion the 10s bucket is only the first step, to then find a more suitable/realistic target for the SLO (that wouldn't be always report an error budget of 100%). Does AW have a number in mind? What steps are we going to take to find the right value?

@elukey No, we do not have a number in mind. Our approach is going to be to iterate on those metrics (not just that one), but @Jdforrester-WMF can correct me if I am mistaken.

Exactly yes, it is here. We also have an extra envoy timeout set to 15s IIUC as extra fence, but the one that counts is the orchestrator's one.

So, this is still very unclear to me. If I understand correctly, there are two options:

  • we reduce the timeout you've linked here to 9700ms, OR
  • somehow, mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket gets changed and the 10s bucket becomes 10.2s or something.

The former is very easy for us to handle on our side. Is that the ideal path forward, or is it possible (and desirable) to do the latter?

Exactly yes, it is here. We also have an extra envoy timeout set to 15s IIUC as extra fence, but the one that counts is the orchestrator's one.

So, this is still very unclear to me. If I understand correctly, there are two options:

  • we reduce the timeout you've linked here to 9700ms, OR
  • somehow, mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket gets changed and the 10s bucket becomes 10.2s or something.

The former is very easy for us to handle on our side. Is that the ideal path forward, or is it possible (and desirable) to do the latter?

Yes I think the timeout reduction is ok, my comment for mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket was more related to the future when you'll need to set new latency buckets :)

@elukey No, we do not have a number in mind. Our approach is going to be to iterate on those metrics (not just that one), but @Jdforrester-WMF can correct me if I am mistaken.

@DSantamaria totally fine for me, but to be on the same page: the SLO will not be considered done until we find the real/final value, because otherwise we'll watch wrong reliability metrics. I am available through the whole refinement/iterations to provide SRE support :)

@Jdforrester-WMF @cmassaro one thing that I am wondering - what is the HTTP response code that Wikifunctions returns when a request hits the 10s timeout?

Change #1205263 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[operations/deployment-charts@master] Bump the orchestrator timeout down a skosh.

https://gerrit.wikimedia.org/r/1205263

Change #1205263 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Bump the orchestrator timeout down a skosh

https://gerrit.wikimedia.org/r/1205263

We've now deployed with the slightly reduced timeout. I hope to see the SLO number at 100% in the coming days, but let's see.

@Jdforrester-WMF @cmassaro one thing that I am wondering - what is the HTTP response code that Wikifunctions returns when a request hits the 10s timeout?

I believe it's a 504.

Looks like the new timeout value is still too generous. Might need to bump it down to 9s. @Jdforrester-WMF , what do you think?

Looks like the new timeout value is still too generous. Might need to bump it down to 9s. @Jdforrester-WMF , what do you think?

Seems fine to me.

We've now deployed with the slightly reduced timeout. I hope to see the SLO number at 100% in the coming days, but let's see.

@Jdforrester-WMF @cmassaro one thing that I am wondering - what is the HTTP response code that Wikifunctions returns when a request hits the 10s timeout?

I believe it's a 504.

Okok and just to be sure, this will be displayed as HTTP 200 by mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket right?

Looks like the new timeout value is still too generous. Might need to bump it down to 9s. @Jdforrester-WMF , what do you think?

While checking https://grafana.wikimedia.org/goto/G0VG73mDR?orgId=1 I didn't see any evidence of this, what metric are you checking? Not opposed to the change to 9s, just wanted to know the rationale to better understand the process :)

We've now deployed with the slightly reduced timeout. I hope to see the SLO number at 100% in the coming days, but let's see.

@Jdforrester-WMF @cmassaro one thing that I am wondering - what is the HTTP response code that Wikifunctions returns when a request hits the 10s timeout?

I believe it's a 504.

Okok and just to be sure, this will be displayed as HTTP 200 by mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket right?

Looks like the new timeout value is still too generous. Might need to bump it down to 9s. @Jdforrester-WMF , what do you think?

While checking https://grafana.wikimedia.org/goto/G0VG73mDR?orgId=1 I didn't see any evidence of this, what metric are you checking? Not opposed to the change to 9s, just wanted to know the rationale to better understand the process :)

Given what was said in the meeting, it sounds like the change to 9s won't necessarily help.

To recapitulate, it sounds like you'll keep tracking this down on your side, and we should look for any requests that crossed the 10s threshold. Is that right?

Today I tried to take a look to spikes like the following, shown in Grafana by Pyrra metrics:

Screenshot From 2025-11-25 15-38-46.png (2×3 px, 565 KB)

I tracked down the issue to the underlying mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_count metric, that we expect to be a counter but in reality it is not:

Screenshot From 2025-11-25 15-39-12.png (1×4 px, 338 KB)

(https://grafana.wikimedia.org/goto/EH6MoRZDR?orgId=1)

The timing matches perfectly the last time that the statsd-exporter pod has been deployed in codfw:

root@deploy2002:~# kubectl get pods -n mw-wikifunctions 
NAME                                             READY   STATUS    RESTARTS   AGE
mw-wikifunctions.codfw.group0-5c88b7c957-2sndz   8/8     Running   0          4m56s
mw-wikifunctions.codfw.group0-5c88b7c957-772l2   8/8     Running   0          4m47s
mw-wikifunctions.codfw.group1-84b497c77f-85n2k   8/8     Running   0          4m59s
mw-wikifunctions.codfw.group1-84b497c77f-dq2sp   8/8     Running   0          4m52s
mw-wikifunctions.codfw.group1-84b497c77f-kvqmk   8/8     Running   0          4m43s
mw-wikifunctions.codfw.group1-84b497c77f-ltkff   8/8     Running   0          4m21s
mw-wikifunctions.codfw.group1-84b497c77f-tkrvq   8/8     Running   0          4m36s
mw-wikifunctions.codfw.group1-84b497c77f-xs76f   8/8     Running   0          4m29s
mw-wikifunctions.codfw.group2-794b4b8454-6z6vs   8/8     Running   0          4m58s
statsd-exporter-prometheus-7c8d4f77c9-hn6mx      1/1     Running   0          25d           <============================

And this is expected: the pod collects metrics from the other mw-wikifunctions ones, and then it offers a Prometheus HTTP API to report the aggregates. It doesn't hold any previous status, so all the counters start from the moment the pod is created :)

Pyrra uses increase/rate Prometheus functions that work consistently only if the metric is a counter (so never decreasing), meanwhile they can show weird spikes on metric reset. This is not the first time this happens, we are exploring ideas like using clamp/deriv Prometheus functions but Pyrra at the moment doesn't support them. I don't have a good solution in mind at the moment, I'll have a chat with the SLO WG and I'll report back!

Change #1211177 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: wikifunctions: add eqiad/codfw site variants

https://gerrit.wikimedia.org/r/1211177

https://gerrit.wikimedia.org/r/1211177 Elukey Patchset 1 11:50 AM I think it could be a good test but I would try to explain why we get the difference outlined in https://w.wiki/GHoH, because IIUC we should really see the drops in the first place. Maybe there is something extra that we are not seeing?

Found a potential lead -- For the most part these dips seem to correlate with thanos-rule rule evaluation aborts

Here's an updated view of https://w.wiki/GHoH with thanos_rule_evaluation_with_warnings_total included as the last panel https://w.wiki/GJ6m

Screenshot 2025-11-25 at 2.13.52 PM.png (2×4 px, 694 KB)

Thanos ruler points to query-frontend as its thanos querier:

/usr/bin/thanos rule ... --query http://localhost:16902
...
/usr/bin/thanos query-frontend ... --http-address 0.0.0.0:16902 ...

Since the Query Frontend was relying on a bad cache (T411273: Thanos (store|query-frontend) memcached cache in bad status), I suspect that the dips in the recorded rules could have been caused by incorrect data.

High level summary: while reviewing the Pyrra's availability graphs with the Abstract Wikipedia team we noticed several things that didn't make sense, like short severe drops affecting the downstream error budget calculations as well. After an investigation with Observability, it seems that Thanos, the system that Pyrra uses to create efficient and more compact time series / recording rules from the SLI metrics, has some consistency issues with its internal caching and sometimes it ends up storing the wrong datapoints/values in its long term storage.
The Wikifunction's SLI metrics seem to be the most affected ones, we are going to investigate the issue and report back a more permanent solution. Sadly the old time series, already in the Thanos long term storage, cannot be easily refreshed/overwritten so history may need to remain like it is.

Please note that the long term error budge trend, while still not precise, has still some meaning: I checked the pyrra dashboard this morning and I noticed a clear regression happening from the 29th onward. @cmassaro is there anything that matches the regression on your side? New code/features/etc..?

Change #1211177 abandoned by Herron:

[operations/puppet@production] pyrra: wikifunctions: add eqiad/codfw site variants

https://gerrit.wikimedia.org/r/1211177

Updating the task after the last chat. The current dashboard seems recovering really well, and it is trending towards 100%:

Screenshot From 2026-01-13 12-21-05.png (2×3 px, 323 KB)

The error budget drop was related to a k8s pod issue, where the container running python and executing wikifunctions sometimes got locked up failing to reply to clients, but the k8s readiness probe failed to recognized it.