Page MenuHomePhabricator

ErrorBudgetBurn (sticky-headers, part 2)
Closed, ResolvedPublic

Description

Common information

  • alertname: ErrorBudgetBurn
  • exhaustion: 2w
  • long: 1d
  • recorder: thanos-rule@main
  • revision: 1
  • service: mpic
  • severity: warning
  • short: 2h
  • slo: xlab-standalone-event-system-success-rate-v1
  • source: thanos
  • team: experiment-platform

Firing alerts


  • alertname: ErrorBudgetBurn
  • exhaustion: 2w
  • long: 1d
  • recorder: thanos-rule@main
  • revision: 1
  • service: mpic
  • severity: warning
  • short: 2h
  • slo: xlab-standalone-event-system-success-rate-v1
  • source: thanos
  • team: experiment-platform
  • Source

Event Timeline

The SLO on the Test Kitchen EventGate side of things is currently projected to be at risk, but errors do, as @cjming noted on chat seem to have tapered off considerably (we saw initial error budget alerting via T412448: ErrorBudgetBurn (sticky-headers) but as discussed there a backport was pushed to fix the main problem).

Screenshot 2025-12-11 at 5.42.48 PM.png (500×3 px, 212 KB)

I think we can give it some time and see if it recovers.

A hastily constructed query suggests the overall error rate higher than we'd want to tolerate on a longer term basis, causing sub-99.75% success versus the hoped for 99.9% success.

Screenshot 2025-12-11 at 5.48.29 PM.png (1×3 px, 401 KB)

But, keep in mind the errors are measured via a centralized counter, whereas the non-errors are counted via product_metrics.web_base. This is intentional and known, as data_platform.pp notes:

# For EventGate events, system errors (for numerator before complement)
# may be from arbitrary streams via schema fragments, but in general the
# desired destination stream will be product_metrics.web_base (for
# denominator). This means the error ratio may exaggerate SLO misses, which is
# okay in the context of the event system success rate measurement.

We can see in a different query for the originally flapping stream that the error rate does exist on that stream.

Screenshot 2025-12-11 at 6.19.00 PM.png (1×3 px, 429 KB)

Screenshot 2025-12-11 at 6.21.52 PM.png (1×3 px, 331 KB)

It's just that this experiment in T412146: Launch Mobile Expanded Sections on non-English wikis is running on some sites with reasonably bigger traffic at 10% sampling and so this is appearing to seep through to the global metric we're monitoring.

Screenshot 2025-12-11 at 6.26.18 PM.png (1×1 px, 231 KB)

Indeed, if we recalculate so as to exclude the sticky header schema, mediawiki.product_metrics.readerexperiments_stickyheaders, for the errors, then we're doing okay.

Screenshot 2025-12-11 at 8.08.46 PM.png (1×3 px, 472 KB)

However, we do have some errors occurring elsewhere that are resulting in us being a bit close to 99.9% success for comfort, so we may need to come up with a way to handle that in terms of adjusting recording rules / adding new Prometheus metric label emission or possibly handling even closer to the fundamental root of the problem (in or near concerns regarding cookie expiration, for example, which is a knotty problem).

sum by (stream, error_type) (rate(eventgate_validation_errors_total{service="eventgate-analytics-external", prometheus="k8s", error_type="HoistingError"}[1d]))

{error_type="HoistingError", stream="mediawiki.product_metrics.readerexperiments_imagebrowsing"}
0.00003474153218396851
{error_type="HoistingError", stream="mediawiki.product_metrics.readerexperiments_stickyheaders"}
0.04143786160336746
{error_type="HoistingError", stream="product_metrics.web_base"}
0.007747897720029626

# MalformedHeaderError showed no errors for the previous day
dr0ptp4kt renamed this task from ErrorBudgetBurn to ErrorBudgetBurn (sticky-headers, part 2).Dec 12 2025, 2:22 AM
dr0ptp4kt claimed this task.