User Details
- User Since
- May 30 2017, 5:25 PM (445 w, 5 d)
- Availability
- Available
- IRC Nick
- herron
- LDAP User
- Herron
- MediaWiki User
- Unknown
Mon, Dec 8
Updated the wikifunctions slot pilot SLO to enable low priority "ticket" alerting
Thu, Dec 4
onboarded wikifunctions today as well with config:
Tue, Dec 2
Mon, Dec 1
Made a couple more adjustments to the dashboard to clean up the rolling window portion
Agreed, looks good!
Tue, Nov 25
https://gerrit.wikimedia.org/r/1211177 Elukey Patchset 1 11:50 AM I think it could be a good test but I would try to explain why we get the difference outlined in https://w.wiki/GHoH, because IIUC we should really see the drops in the first place. Maybe there is something extra that we are not seeing?
Mon, Nov 24
Thu, Nov 20
Tue, Nov 18
Mon, Nov 17
Thanks for the summary, before we pull the trigger on the revert could we try a couple alternatives?
Nov 14 2025
I've set up the spare disks already present in titan1001 as an 800G lvm volume to host /srv. just kicked off an initial sync, and after thats complete will depool titan1001 to stop services for a final sync and remount. After that we can add the backing devices for the previous /srv filesystem (/dev/md2) into this LVM volume as well and effectively double our capacity.
Nov 10 2025
FWIW T326419 has some details about the last rebalance on kafka-logging
Nov 5 2025
sloth editcheck has been backfilled for range --start=2025-06-01T00:00:00Z --end=2025-11-01T00:00:00Z
# xlab.yml
version: "prometheus/v1"
service: "xlab"
labels:
owner: "sre"
slos:
- name: "xlab-standalone-event-validation-success-rate"
objective: 95
description: "xlab standalone event validation success rate"
sli:
events:
error_query: |
sum(
rate(eventgate_validation_errors_total{service="eventgate-analytics-external", stream="product_metrics.web_base",
error_type=~"HoistingError|MalformedHeaderError|ValidationError", prometheus="k8s"}[{{.window}}])
)
or vector(0)
total_query: |
sum(
rate(eventgate_events_produced_total{service="eventgate-analytics-external", stream="product_metrics.web_base", prometheus="k8s"}[{{.window}}])
) +
sum(
rate(eventgate_validation_errors_total{service="eventgate-analytics-external", error_type=~"HoistingError|MalformedHeaderError", prometheus="k8s"}[{{.window}}])
)
or vector (0)
alerting:
page_alert:
disable: true
ticket_alert:
disable: true$ sudo docker run --rm -v $(pwd):/data ghcr.io/slok/sloth:latest generate -i /data/xlab.yml -o /data/xlab-out.yaml INFO[0000] SLO period windows loaded svc=alert.WindowsRepo version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d windows=2 INFO[0000] Plugins loaded sli-plugins=0 slo-plugins=11 version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d
Off hand the sloth detail dashboards "month error budget burn chart" panel uses Grafana built-ins in the "relative time" and "time shift" to fix the panel on the current month.
# editcheck.yml
version: "prometheus/v1"
service: "edit-check"
labels:
owner: "sre"
slos:
- name: "edit-check-pre-save-checks-ratio"
objective: 99.0
description: "Edit check pre save checks"
sli:
events:
error_query: |
sum(
rate(editcheck_sli_presavechecks_shown_vs_available_total[{{.window}}])
)
total_query: |
sum(
rate(editcheck_sli_presavechecks_available_total[{{.window}}])
)
alerting:
page_alert:
disable: true
ticket_alert:
disable: true# tonecheck.yml
version: "prometheus/v1"
service: "tonecheck"
labels:
owner: "sre"
slos:
- name: "tone-check-availability"
objective: 95.0
description: "Tone check pre save checks"
sli:
events:
error_query: |
sum(
rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway",
app="istio-ingressgateway", destination_canonical_service="edit-check-predictor",
response_code=~"5..", prometheus="k8s-mlserve" }[{{.window}}])
)
total_query: |
sum(
rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway",
app="istio-ingressgateway", destination_canonical_service="edit-check-predictor",
prometheus="k8s-mlserve" }[{{.window}}])
)
alerting:
page_alert:
disable: true
ticket_alert:
disable: trueNov 4 2025
I've upgraded the prometheus postgres exporters across the fleet to the version from trixie which is capable of replica monitoring. I also gave prometheus pg_monitor privileges where called for by the updated exporter. Next will be sorting out replication alerts using the updated metrics.
Oct 23 2025
We now have two thanos rule instances running, "main" (the pre-existing instance) and a new instance called "pilot"
Oct 15 2025
Had a quick chat with @Vgutierrez and I've just copied the package to trixie-wikimedia
Oct 14 2025
Added details to the spreadsheet thanks!
Oct 7 2025
Oct 6 2025
@elukey would you be able to rule out if this is related to tegola? I see a sharp rise in thanos swift-proxy utilization on Oct 2 that seems to correlate with IRC discussion about tegola maintenance and seeing today a lot of errors following the pattern below in the swift-proxy logs alongside thanos
Oct 1 2025
Sep 30 2025
Sep 15 2025
Backfill process has been documented in https://wikitech.wikimedia.org/wiki/Thanos#Backfilling_Metrics and used successfully several times. Resolving!
Sep 10 2025
Tonecheck metrics have been backfilled with a clean history
Sep 8 2025
6 weeks worth of metrics have been backfilled
Aug 18 2025
Along with this we could explore if adding alertmanager Grafana datasources would be worthwhile for viewing and/or sending alerts, how that might overlap or compliment karma, e.g. for browsing alerts and silences, etc
Jul 25 2025
T400071#11034605 steps through a backfill process with an ad-hoc prometheus (and ad-hoc sidecar) that worked to upload backfilled blocks to Thanos.
Seeing some success with the prometheus compactor and sidecar workaround. I've been able to upload backfilled blocks to Thanos in a way that at least partially works.
Jul 24 2025
Trying today with an ad-hoc prometheus instance to compact the overlapping blocks before uploading
Jul 23 2025
This morning I've done:
Jul 22 2025
(generating backfill blocks today, I'll keep updating with the commands used for future reference)
herron@turquoise:~/tmp/tonecheck$ cat add_replica_backfill_labels.sh #!/bin/bash # this may be duplicated effort since block upload sets external label of the same will verify on next backfill
Jul 21 2025
Jul 8 2025
Today I reviewed a sampling of our published SLO docs and while some do make mention of 'datacenter' and specific names like 'eqiad' 'codfw', I didn't see a case where we explicitly document if the targets are per-site or all sites. I did find in the varnish SLO mention of potentially both (per-site and aggregate) which is an interesting case to cover as well. And of course it can vary per-SLO. Overall seems a bit of a grey area that we could clarify. I think simplifying like you describe is worth trying, and IMO as we do let's update the docs to make it more clear about the datacenter scope that's being implemented and alerted on.
Jul 7 2025
I think we could do it, but before committing to the change could we expand a bit on rationale and side-effects/use cases?
Jun 30 2025
Opensearch reporting https://docs.opensearch.org/docs/latest/reporting/report-dashboard-index/ (already in place) checks each of these boxes with the exception of NDJSON.
Optimistically resolving as we've tuned the window for istio slos to 4w (from 12w)
Jun 25 2025
Jun 24 2025
There doesn't appear to be a feature to generate a notification (push/sms/email/otherwise) on the acknowledge action in splunk oncall. There is the ability to integrate it via a webhook, but that doesn't solve the SMS handling piece by itself.
declining this since I doubt we'll make changes to support these queries, and we can hack around it using results from the labels api
Jun 18 2025
Jun 17 2025
The requested access has been merged and will be fully deployed within 30 minutes. I'll go ahead and resolve this but please don't hesitate to re-open if any followup is needed. Thanks!
Hi @AndyRussG_volunteer you should have just received an email regarding kerberos, and I'll update the account data to reflect krb: present now as well
Thanks! I've just emailed you as well for the out of band verification step