Bumped the grafana VMs in codfw and eqiad to 12G (up from 4) and used this as an opportunity to increase VCPUs to 2 as well
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Wed, Jan 14
In T412229#11513890, @Jhancock.wm wrote:@herron two things
- do you mind if i rack this in the new expansion cage at codfw?
Dec 17 2025
In T412493#11456060, @RLazarus wrote:I think we can customize which fields go into the title, to add the slo field in there as you suggest, but I don't know how exactly -- I see in alertmanager.yml.erb that there's a title urlparam but I can't immediately tell what its format is. @herron, any ideas?
Dec 16 2025
In T412842#11466171, @gerritbot wrote:Change #1218821 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] arclamp: reduce compress days
Dec 8 2025
Updated the wikifunctions slot pilot SLO to enable low priority "ticket" alerting
Dec 4 2025
onboarded wikifunctions today as well with config:
Dec 2 2025
Dec 1 2025
Made a couple more adjustments to https://grafana.wikimedia.org/d/slot-pilot-slo-detail/sloth-s-l-o-detail to clean up the rolling window portion
Agreed, looks good!
Nov 25 2025
https://gerrit.wikimedia.org/r/1211177 Elukey Patchset 1 11:50 AM I think it could be a good test but I would try to explain why we get the difference outlined in https://w.wiki/GHoH, because IIUC we should really see the drops in the first place. Maybe there is something extra that we are not seeing?
Nov 24 2025
Nov 20 2025
In T405946#11390399, @RobH wrote:We don't want to move anything the day before a holiday or weekend, as it doesn't allow for a followup fix if anything strange occurs. Additionally I'll be out of the office on December 1st and 2nd. As a result, the following migration dates are available and we can move any or all of your 12 hosts in a single day (or more) depending on your teams service needs. With a number of the hosts remaining being primary/secondary to one another, I am going to imagine it is best to move all redundant nodes on one day, and then move the primary nodes a day or two later.
Nov 18 2025
In T409310#11363845, @elukey wrote:@herron Hi! Could you please backfill slo:period_error_budget_remaining:ratio too? I see that the time series start from Oct 27th, this is the rolling window metric and I'd like to see how it looks over a quarter.
Nov 17 2025
Thanks for the summary, before we pull the trigger on the revert could we try a couple alternatives?
Nov 14 2025
In T410152#11374841, @herron wrote:I've set up the spare disks already present in titan1001 as an 800G lvm volume to host /srv. just kicked off an initial sync, and after thats complete will depool titan1001 to stop services for a final sync and remount. After that we can add the backing devices for the previous /srv filesystem (/dev/md2) into this LVM volume as well and effectively double our capacity.
I've set up the spare disks already present in titan1001 as an 800G lvm volume to host /srv. just kicked off an initial sync, and after thats complete will depool titan1001 to stop services for a final sync and remount. After that we can add the backing devices for the previous /srv filesystem (/dev/md2) into this LVM volume as well and effectively double our capacity.
Nov 10 2025
FWIW T326419 has some details about the last rebalance on kafka-logging
Nov 5 2025
sloth editcheck has been backfilled for range --start=2025-06-01T00:00:00Z --end=2025-11-01T00:00:00Z
# xlab.yml
version: "prometheus/v1"
service: "xlab"
labels:
owner: "sre"
slos:
- name: "xlab-standalone-event-validation-success-rate"
objective: 95
description: "xlab standalone event validation success rate"
sli:
events:
error_query: |
sum(
rate(eventgate_validation_errors_total{service="eventgate-analytics-external", stream="product_metrics.web_base",
error_type=~"HoistingError|MalformedHeaderError|ValidationError", prometheus="k8s"}[{{.window}}])
)
or vector(0)
total_query: |
sum(
rate(eventgate_events_produced_total{service="eventgate-analytics-external", stream="product_metrics.web_base", prometheus="k8s"}[{{.window}}])
) +
sum(
rate(eventgate_validation_errors_total{service="eventgate-analytics-external", error_type=~"HoistingError|MalformedHeaderError", prometheus="k8s"}[{{.window}}])
)
or vector (0)
alerting:
page_alert:
disable: true
ticket_alert:
disable: true$ sudo docker run --rm -v $(pwd):/data ghcr.io/slok/sloth:latest generate -i /data/xlab.yml -o /data/xlab-out.yaml INFO[0000] SLO period windows loaded svc=alert.WindowsRepo version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d windows=2 INFO[0000] Plugins loaded sli-plugins=0 slo-plugins=11 version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d
Off hand the sloth detail dashboards "month error budget burn chart" panel uses Grafana built-ins in the "relative time" and "time shift" to fix the panel on the current month.
# editcheck.yml
version: "prometheus/v1"
service: "edit-check"
labels:
owner: "sre"
slos:
- name: "edit-check-pre-save-checks-ratio"
objective: 99.0
description: "Edit check pre save checks"
sli:
events:
error_query: |
sum(
rate(editcheck_sli_presavechecks_shown_vs_available_total[{{.window}}])
)
total_query: |
sum(
rate(editcheck_sli_presavechecks_available_total[{{.window}}])
)
alerting:
page_alert:
disable: true
ticket_alert:
disable: true# tonecheck.yml
version: "prometheus/v1"
service: "tonecheck"
labels:
owner: "sre"
slos:
- name: "tone-check-availability"
objective: 95.0
description: "Tone check pre save checks"
sli:
events:
error_query: |
sum(
rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway",
app="istio-ingressgateway", destination_canonical_service="edit-check-predictor",
response_code=~"5..", prometheus="k8s-mlserve" }[{{.window}}])
)
total_query: |
sum(
rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway",
app="istio-ingressgateway", destination_canonical_service="edit-check-predictor",
prometheus="k8s-mlserve" }[{{.window}}])
)
alerting:
page_alert:
disable: true
ticket_alert:
disable: trueNov 4 2025
I've upgraded the prometheus postgres exporters across the fleet to the version from trixie which is capable of replica monitoring. I also gave prometheus pg_monitor privileges where called for by the updated exporter. Next will be sorting out replication alerts using the updated metrics.
Oct 23 2025
We now have two thanos rule instances running, "main" (the pre-existing instance) and a new instance called "pilot"
Oct 15 2025
Had a quick chat with @Vgutierrez and I've just copied the package to trixie-wikimedia
Oct 14 2025
Added details to the spreadsheet thanks!
Oct 7 2025
In T406496#11249377, @elukey wrote:@herron it was definitely Tegola, I was doing a cache refresh before the codfw cluster is repooled (it is a one-off that we do in these situations), that meant the re-creation of 90M tiles :(
I see that the metrics are better now, but we are going to repool the codfw cluster soon (so it will serve live traffic etc..). Lemme know if it is a concern, and/or if the metrics are good now. We can probably try to be more gentle with the cache refresh, it is a k8s cron that takes a long time to run but that can be parallelized easily (we run it from multiple pods).
Oct 6 2025
@elukey would you be able to rule out if this is related to tegola? I see a sharp rise in thanos swift-proxy utilization on Oct 2 that seems to correlate with IRC discussion about tegola maintenance and seeing today a lot of errors following the pattern below in the swift-proxy logs alongside thanos
Oct 1 2025
Sep 30 2025
Sep 15 2025
Backfill process has been documented in https://wikitech.wikimedia.org/wiki/Thanos#Backfilling_Metrics and used successfully several times. Resolving!
Sep 10 2025
Tonecheck metrics have been backfilled with a clean history
Sep 8 2025
6 weeks worth of metrics have been backfilled
Aug 18 2025
Along with this we could explore if adding alertmanager Grafana datasources would be worthwhile for viewing and/or sending alerts, how that might overlap or compliment karma, e.g. for browsing alerts and silences, etc
Jul 25 2025
T400071#11034605 steps through a backfill process with an ad-hoc prometheus (and ad-hoc sidecar) that worked to upload backfilled blocks to Thanos.
Seeing some success with the prometheus compactor and sidecar workaround. I've been able to upload backfilled blocks to Thanos in a way that at least partially works.
Jul 24 2025
Trying today with an ad-hoc prometheus instance to compact the overlapping blocks before uploading
Jul 23 2025
In T349521#9706188, @fgiunchedi wrote:Following up from a chat yesterday:
The idea of creating backfilled blocks is sound, although I think we can get away with uploading said blocks straight to thanos (making sure we're using distinct labels with e.g. recoder=backfill) and they will be compacted and available as usual (to be tested!)
This morning I've done:
Jul 22 2025
(generating backfill blocks today, I'll keep updating with the commands used for future reference)
herron@turquoise:~/tmp/tonecheck$ cat add_replica_backfill_labels.sh #!/bin/bash # this may be duplicated effort since block upload sets external label of the same will verify on next backfill
Jul 21 2025
Jul 8 2025
Today I reviewed a sampling of our published SLO docs and while some do make mention of 'datacenter' and specific names like 'eqiad' 'codfw', I didn't see a case where we explicitly document if the targets are per-site or all sites. I did find in the varnish SLO mention of potentially both (per-site and aggregate) which is an interesting case to cover as well. And of course it can vary per-SLO. Overall seems a bit of a grey area that we could clarify. I think simplifying like you describe is worth trying, and IMO as we do let's update the docs to make it more clear about the datacenter scope that's being implemented and alerted on.
Jul 7 2025
I think we could do it, but before committing to the change could we expand a bit on rationale and side-effects/use cases?
Jun 30 2025
Opensearch reporting https://docs.opensearch.org/docs/latest/reporting/report-dashboard-index/ (already in place) checks each of these boxes with the exception of NDJSON.
Optimistically resolving as we've tuned the window for istio slos to 4w (from 12w)
Jun 25 2025
Jun 24 2025
There doesn't appear to be a feature to generate a notification (push/sms/email/otherwise) on the acknowledge action in splunk oncall. There is the ability to integrate it via a webhook, but that doesn't solve the SMS handling piece by itself.
declining this since I doubt we'll make changes to support these queries, and we can hack around it using results from the labels api