Page MenuHomePhabricator

herron (Keith Herron)
Site Reliability Engineer

Today

  • No visible events.

Tomorrow

  • No visible events.

Wednesday

  • No visible events.

User Details

User Since
May 30 2017, 5:25 PM (445 w, 5 d)
Availability
Available
IRC Nick
herron
LDAP User
Herron
MediaWiki User
Unknown

Recent Activity

Mon, Dec 8

herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Mon, Dec 8, 8:06 PM · SRE-SLO
herron added a comment to T409310: Sloth: onboard subset of existing SLOs to pilot.

Updated the wikifunctions slot pilot SLO to enable low priority "ticket" alerting

Mon, Dec 8, 8:05 PM · SRE-SLO

Thu, Dec 4

herron added a comment to T409310: Sloth: onboard subset of existing SLOs to pilot.

onboarded wikifunctions today as well with config:

Thu, Dec 4, 4:59 PM · SRE-SLO
herron updated the task description for T409310: Sloth: onboard subset of existing SLOs to pilot.
Thu, Dec 4, 4:58 PM · SRE-SLO

Tue, Dec 2

herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Tue, Dec 2, 4:52 PM · SRE-SLO

Mon, Dec 1

herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Mon, Dec 1, 9:01 PM · SRE-SLO
herron added a comment to T409312: Sloth: adapt default month view to quarter view (pilot).

Made a couple more adjustments to the dashboard to clean up the rolling window portion

Mon, Dec 1, 8:37 PM · SRE-SLO
herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Mon, Dec 1, 4:15 PM · SRE-SLO
herron closed T409312: Sloth: adapt default month view to quarter view (pilot), a subtask of T404171: Evaluate Sloth as a possible replacement for Pyrra, as Resolved.
Mon, Dec 1, 4:13 PM · SRE-SLO
herron closed T409312: Sloth: adapt default month view to quarter view (pilot) as Resolved.

Agreed, looks good!

Mon, Dec 1, 4:13 PM · SRE-SLO
herron renamed T409312: Sloth: adapt default month view to quarter view (pilot) from Sloth: adapt default month view to quarter view to Sloth: adapt default month view to quarter view (pilot).
Mon, Dec 1, 4:12 PM · SRE-SLO
herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Mon, Dec 1, 4:11 PM · SRE-SLO

Tue, Nov 25

herron added a comment to T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

https://gerrit.wikimedia.org/r/1211177 Elukey Patchset 1 11:50 AM I think it could be a good test but I would try to explain why we get the difference outlined in https://w.wiki/GHoH, because IIUC we should really see the drops in the first place. Maybe there is something extra that we are not seeing?

Tue, Nov 25, 7:41 PM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q2 (Oct–Dec))

Mon, Nov 24

herron created T410933: Add Druid as a Private Grafana Datasource.
Mon, Nov 24, 6:26 PM · Observability-Metrics, SRE Observability (FY2025/2026-Q3), SRE

Thu, Nov 20

herron reassigned T405946: eqiad row C/D Observability host migrations from herron to RobH.

We don't want to move anything the day before a holiday or weekend, as it doesn't allow for a followup fix if anything strange occurs. Additionally I'll be out of the office on December 1st and 2nd. As a result, the following migration dates are available and we can move any or all of your 12 hosts in a single day (or more) depending on your teams service needs. With a number of the hosts remaining being primary/secondary to one another, I am going to imagine it is best to move all redundant nodes on one day, and then move the primary nodes a day or two later.

Thu, Nov 20, 2:42 AM · observability, SRE, DC-Ops, ops-eqiad

Tue, Nov 18

herron added a comment to T409310: Sloth: onboard subset of existing SLOs to pilot.

@herron Hi! Could you please backfill slo:period_error_budget_remaining:ratio too? I see that the time series start from Oct 27th, this is the rolling window metric and I'd like to see how it looks over a quarter.

Tue, Nov 18, 3:32 PM · SRE-SLO

Mon, Nov 17

herron added a comment to T410152: Disk space saturation (/srv) on Titan hosts.

Thanks for the summary, before we pull the trigger on the revert could we try a couple alternatives?

Mon, Nov 17, 6:20 PM · SRE Observability (FY2025/2026-Q2)

Nov 14 2025

herron added a comment to T410152: Disk space saturation (/srv) on Titan hosts.

I've set up the spare disks already present in titan1001 as an 800G lvm volume to host /srv. just kicked off an initial sync, and after thats complete will depool titan1001 to stop services for a final sync and remount. After that we can add the backing devices for the previous /srv filesystem (/dev/md2) into this LVM volume as well and effectively double our capacity.

Nov 14 2025, 7:52 PM · SRE Observability (FY2025/2026-Q2)
herron added a comment to T410152: Disk space saturation (/srv) on Titan hosts.

I've set up the spare disks already present in titan1001 as an 800G lvm volume to host /srv. just kicked off an initial sync, and after thats complete will depool titan1001 to stop services for a final sync and remount. After that we can add the backing devices for the previous /srv filesystem (/dev/md2) into this LVM volume as well and effectively double our capacity.

Nov 14 2025, 4:26 PM · SRE Observability (FY2025/2026-Q2)

Nov 10 2025

herron placed T407185: Fix Kafka replicas skew up for grabs.

FWIW T326419 has some details about the last rebalance on kafka-logging

Nov 10 2025, 3:21 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Data-Engineering-Radar, serviceops, Observability-Logging, Data-Engineering

Nov 5 2025

herron updated the task description for T409310: Sloth: onboard subset of existing SLOs to pilot.
Nov 5 2025, 10:05 PM · SRE-SLO
herron added a comment to T409310: Sloth: onboard subset of existing SLOs to pilot.

sloth editcheck has been backfilled for range --start=2025-06-01T00:00:00Z --end=2025-11-01T00:00:00Z

Nov 5 2025, 10:05 PM · SRE-SLO
herron updated the task description for T409310: Sloth: onboard subset of existing SLOs to pilot.
Nov 5 2025, 9:49 PM · SRE-SLO
herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Nov 5 2025, 9:48 PM · SRE-SLO
herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Nov 5 2025, 4:44 PM · SRE-SLO
herron updated the task description for T409310: Sloth: onboard subset of existing SLOs to pilot.
Nov 5 2025, 4:15 PM · SRE-SLO
herron updated the task description for T409310: Sloth: onboard subset of existing SLOs to pilot.
Nov 5 2025, 3:58 PM · SRE-SLO
herron added a comment to T409310: Sloth: onboard subset of existing SLOs to pilot.
# xlab.yml
version: "prometheus/v1"
service: "xlab"
labels:
  owner: "sre"
slos:
  - name: "xlab-standalone-event-validation-success-rate"
    objective: 95
    description: "xlab standalone event validation success rate"
    sli:
      events:
        error_query: |
          sum(
              rate(eventgate_validation_errors_total{service="eventgate-analytics-external", stream="product_metrics.web_base",
                   error_type=~"HoistingError|MalformedHeaderError|ValidationError", prometheus="k8s"}[{{.window}}])
          )
          or vector(0)
        total_query: |
            sum(
                rate(eventgate_events_produced_total{service="eventgate-analytics-external", stream="product_metrics.web_base", prometheus="k8s"}[{{.window}}])
            ) +
            sum(
                rate(eventgate_validation_errors_total{service="eventgate-analytics-external", error_type=~"HoistingError|MalformedHeaderError", prometheus="k8s"}[{{.window}}])
            )
            or vector (0)
    alerting:
      page_alert:
        disable: true
      ticket_alert:
        disable: true
$ sudo docker run --rm -v $(pwd):/data ghcr.io/slok/sloth:latest generate -i /data/xlab.yml -o /data/xlab-out.yaml
INFO[0000] SLO period windows loaded                     svc=alert.WindowsRepo version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d windows=2
INFO[0000] Plugins loaded                                sli-plugins=0 slo-plugins=11 version=3a24ef6384adfac15af1b8d5898e7a05bed2f5f0 window=30d
Nov 5 2025, 3:43 PM · SRE-SLO
herron added a comment to T409312: Sloth: adapt default month view to quarter view (pilot).

Off hand the sloth detail dashboards "month error budget burn chart" panel uses Grafana built-ins in the "relative time" and "time shift" to fix the panel on the current month.

Nov 5 2025, 3:31 PM · SRE-SLO
herron created T409312: Sloth: adapt default month view to quarter view (pilot).
Nov 5 2025, 3:24 PM · SRE-SLO
herron added a comment to T409310: Sloth: onboard subset of existing SLOs to pilot.
# editcheck.yml
version: "prometheus/v1"
service: "edit-check"
labels:
  owner: "sre"
slos:
  - name: "edit-check-pre-save-checks-ratio"
    objective: 99.0
    description: "Edit check pre save checks"
    sli:
      events:
        error_query: |
          sum(
            rate(editcheck_sli_presavechecks_shown_vs_available_total[{{.window}}])
          )
        total_query: |
          sum(
            rate(editcheck_sli_presavechecks_available_total[{{.window}}])
          )
    alerting:
      page_alert:
        disable: true
      ticket_alert:
        disable: true
Nov 5 2025, 3:18 PM · SRE-SLO
herron added a comment to T409310: Sloth: onboard subset of existing SLOs to pilot.
# tonecheck.yml
version: "prometheus/v1"
service: "tonecheck"
labels:
  owner: "sre"
slos:
  - name: "tone-check-availability"
    objective: 95.0
    description: "Tone check pre save checks"
    sli:
      events:
        error_query: |
          sum(
            rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway",
            app="istio-ingressgateway", destination_canonical_service="edit-check-predictor",
            response_code=~"5..", prometheus="k8s-mlserve" }[{{.window}}])
          )
        total_query: |
          sum(
            rate(istio_requests_total{source_workload_namespace="istio-system", source_workload="istio-ingressgateway",
            app="istio-ingressgateway", destination_canonical_service="edit-check-predictor",
            prometheus="k8s-mlserve" }[{{.window}}])
          )
    alerting:
      page_alert:
        disable: true
      ticket_alert:
        disable: true
Nov 5 2025, 3:17 PM · SRE-SLO
herron created T409310: Sloth: onboard subset of existing SLOs to pilot.
Nov 5 2025, 3:17 PM · SRE-SLO
herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Nov 5 2025, 3:11 PM · SRE-SLO

Nov 4 2025

herron added a comment to T374839: Port postgresql replication check to Prometheus/Alertmanager.

I've upgraded the prometheus postgres exporters across the fleet to the version from trixie which is capable of replica monitoring. I also gave prometheus pg_monitor privileges where called for by the updated exporter. Next will be sorting out replication alerts using the updated metrics.

Nov 4 2025, 6:45 PM · PostgreSQL, Observability-Alerting

Oct 23 2025

herron closed T406054: Thanos: support multiple ruler instances as Resolved.

We now have two thanos rule instances running, "main" (the pre-existing instance) and a new instance called "pilot"

Oct 23 2025, 8:50 PM · SRE Observability (FY2025/2026-Q1), SRE-SLO

Oct 15 2025

herron added a comment to T407320: Package benthos/redpanda for trixie.

Had a quick chat with @Vgutierrez and I've just copied the package to trixie-wikimedia

Oct 15 2025, 3:32 PM · Observability-Logging, Traffic

Oct 14 2025

herron added a comment to T405946: eqiad row C/D Observability host migrations.

Added details to the spreadsheet thanks!

Oct 14 2025, 6:27 PM · observability, SRE, DC-Ops, ops-eqiad

Oct 7 2025

herron added a comment to T406496: ThanosCompactHasNotRun: Thanos Compact has not uploaded anything for last 24 hours..

@herron it was definitely Tegola, I was doing a cache refresh before the codfw cluster is repooled (it is a one-off that we do in these situations), that meant the re-creation of 90M tiles :(

I see that the metrics are better now, but we are going to repool the codfw cluster soon (so it will serve live traffic etc..). Lemme know if it is a concern, and/or if the metrics are good now. We can probably try to be more gentle with the cache refresh, it is a k8s cron that takes a long time to run but that can be parallelized easily (we run it from multiple pods).

Oct 7 2025, 3:34 PM · Data-Persistence, SRE Observability (FY2025/2026-Q2), Observability-Metrics

Oct 6 2025

herron updated subscribers of T406496: ThanosCompactHasNotRun: Thanos Compact has not uploaded anything for last 24 hours..

@elukey would you be able to rule out if this is related to tegola? I see a sharp rise in thanos swift-proxy utilization on Oct 2 that seems to correlate with IRC discussion about tegola maintenance and seeing today a lot of errors following the pattern below in the swift-proxy logs alongside thanos

Oct 6 2025, 7:55 PM · Data-Persistence, SRE Observability (FY2025/2026-Q2), Observability-Metrics

Oct 1 2025

herron claimed T397757: Kafkamon -> Bookworm.
Oct 1 2025, 2:38 PM · Observability-Logging, SRE Observability (FY2025/2026-Q1)

Sep 30 2025

herron moved T406054: Thanos: support multiple ruler instances from Inbox to FY2025/2026-Q1 on the SRE Observability board.
Sep 30 2025, 5:12 PM · SRE Observability (FY2025/2026-Q1), SRE-SLO
herron created T406054: Thanos: support multiple ruler instances.
Sep 30 2025, 5:12 PM · SRE Observability (FY2025/2026-Q1), SRE-SLO

Sep 15 2025

herron closed T349521: Prometheus/Pyrra: establish backfill process for recording rules as Resolved.

Backfill process has been documented in https://wikitech.wikimedia.org/wiki/Thanos#Backfilling_Metrics and used successfully several times. Resolving!

Sep 15 2025, 3:50 PM · SRE-SLO, Patch-For-Review, User-herron, Observability-Metrics
herron closed T349521: Prometheus/Pyrra: establish backfill process for recording rules, a subtask of T302995: Transition to Pyrra for SLO Visualization and Management, as Resolved.
Sep 15 2025, 3:50 PM · Patch-For-Review, User-herron, Observability-Metrics

Sep 10 2025

herron closed T400071: Clear & Backfill Tonecheck Pyrra Metrics, a subtask of T349521: Prometheus/Pyrra: establish backfill process for recording rules, as Resolved.
Sep 10 2025, 1:58 PM · SRE-SLO, Patch-For-Review, User-herron, Observability-Metrics
herron closed T400071: Clear & Backfill Tonecheck Pyrra Metrics as Resolved.

Tonecheck metrics have been backfilled with a clean history

Sep 10 2025, 1:58 PM · SRE-SLO, Observability-Metrics
herron triaged T400071: Clear & Backfill Tonecheck Pyrra Metrics as Medium priority.
Sep 10 2025, 1:57 PM · SRE-SLO, Observability-Metrics

Sep 8 2025

herron closed T400073: Clear & Backfill citoid Pyrra Metrics, a subtask of T349521: Prometheus/Pyrra: establish backfill process for recording rules, as Resolved.
Sep 8 2025, 6:28 PM · SRE-SLO, Patch-For-Review, User-herron, Observability-Metrics
herron closed T400073: Clear & Backfill citoid Pyrra Metrics as Resolved.

6 weeks worth of metrics have been backfilled

Sep 8 2025, 6:28 PM · SRE-SLO, Observability-Metrics
herron triaged T400073: Clear & Backfill citoid Pyrra Metrics as Medium priority.
Sep 8 2025, 6:26 PM · SRE-SLO, Observability-Metrics

Aug 18 2025

herron added a comment to T401908: Define a policy for Grafana Alerting.

Along with this we could explore if adding alertmanager Grafana datasources would be worthwhile for viewing and/or sending alerts, how that might overlap or compliment karma, e.g. for browsing alerts and silences, etc

Aug 18 2025, 3:35 PM · SRE Observability (FY2025/2026-Q1), Grafana

Jul 25 2025

herron added a comment to T349521: Prometheus/Pyrra: establish backfill process for recording rules.

T400071#11034605 steps through a backfill process with an ad-hoc prometheus (and ad-hoc sidecar) that worked to upload backfilled blocks to Thanos.

Jul 25 2025, 2:44 PM · SRE-SLO, Patch-For-Review, User-herron, Observability-Metrics
herron added a comment to T400071: Clear & Backfill Tonecheck Pyrra Metrics.

Seeing some success with the prometheus compactor and sidecar workaround. I've been able to upload backfilled blocks to Thanos in a way that at least partially works.

Jul 25 2025, 1:26 PM · SRE-SLO, Observability-Metrics

Jul 24 2025

herron added a comment to T400071: Clear & Backfill Tonecheck Pyrra Metrics.

Trying today with an ad-hoc prometheus instance to compact the overlapping blocks before uploading

Jul 24 2025, 5:16 PM · SRE-SLO, Observability-Metrics

Jul 23 2025

herron added a comment to T349521: Prometheus/Pyrra: establish backfill process for recording rules.

Following up from a chat yesterday:

The idea of creating backfilled blocks is sound, although I think we can get away with uploading said blocks straight to thanos (making sure we're using distinct labels with e.g. recoder=backfill) and they will be compacted and available as usual (to be tested!)

Jul 23 2025, 6:21 PM · SRE-SLO, Patch-For-Review, User-herron, Observability-Metrics
herron added a comment to T400071: Clear & Backfill Tonecheck Pyrra Metrics.

This morning I've done:

Jul 23 2025, 1:48 PM · SRE-SLO, Observability-Metrics

Jul 22 2025

herron added a comment to T400071: Clear & Backfill Tonecheck Pyrra Metrics.

(generating backfill blocks today, I'll keep updating with the commands used for future reference)

herron@turquoise:~/tmp/tonecheck$ cat add_replica_backfill_labels.sh
#!/bin/bash
# this may be duplicated effort since block upload sets external label of the same will verify on next backfill
Jul 22 2025, 3:14 PM · SRE-SLO, Observability-Metrics

Jul 21 2025

herron created T400073: Clear & Backfill citoid Pyrra Metrics.
Jul 21 2025, 2:50 PM · SRE-SLO, Observability-Metrics
herron created T400071: Clear & Backfill Tonecheck Pyrra Metrics.
Jul 21 2025, 2:46 PM · SRE-SLO, Observability-Metrics

Jul 8 2025

herron added a comment to T398534: Reduce the pyrra's multi-dc configurations where it makes sense.

Today I reviewed a sampling of our published SLO docs and while some do make mention of 'datacenter' and specific names like 'eqiad' 'codfw', I didn't see a case where we explicitly document if the targets are per-site or all sites. I did find in the varnish SLO mention of potentially both (per-site and aggregate) which is an interesting case to cover as well. And of course it can vary per-SLO. Overall seems a bit of a grey area that we could clarify. I think simplifying like you describe is worth trying, and IMO as we do let's update the docs to make it more clear about the datacenter scope that's being implemented and alerted on.

Jul 8 2025, 3:52 PM · SRE-SLO

Jul 7 2025

herron added a comment to T398534: Reduce the pyrra's multi-dc configurations where it makes sense.

I think we could do it, but before committing to the change could we expand a bit on rationale and side-effects/use cases?

Jul 7 2025, 1:59 PM · SRE-SLO

Jun 30 2025

herron moved T394069: Rendering Graph's as images times out on Grafana 11 from FY2024/2025-Q4 to FY2025/2026-Q1 on the SRE Observability board.
Jun 30 2025, 6:44 PM · SRE Observability (FY2025/2026-Q1)
herron moved T372845: Migrate all o11y services to nftables from FY2024/2025-Q4 to FY2025/2026-Q1 on the SRE Observability board.
Jun 30 2025, 6:44 PM · SRE Observability (FY2025/2026-Q1), Observability-Metrics
herron moved T393894: New version of Grafana makes it not possible to remove option in long list of values from FY2024/2025-Q4 to FY2025/2026-Q1 on the SRE Observability board.
Jun 30 2025, 6:43 PM · SRE Observability (FY2025/2026-Q1), Grafana
herron moved T388506: Implement a less noisy way to remove nrpe checks (without UNKNOWN spam) from FY2024/2025-Q4 to FY2025/2026-Q1 on the SRE Observability board.
Jun 30 2025, 6:43 PM · SRE Observability (FY2025/2026-Q1), Observability-Alerting
herron moved T396626: Hardware retirement Graphite Infrastructure (ETA June 2026) from FY2024/2025-Q4 to FY2025/2026-Q1 on the SRE Observability board.
Jun 30 2025, 6:41 PM · SRE Observability (FY2025/2026-Q1), MW-1.45-notes (1.45.0-wmf.1; 2025-05-13), Technical-Debt, Observability-Metrics
herron moved T392886: Revisit default Istio histogram buckets from FY2024/2025-Q4 to FY2025/2026-Q1 on the SRE Observability board.
Jun 30 2025, 6:41 PM · SRE Observability (FY2025/2026-Q1), Patch-For-Review, Observability-Metrics
herron moved T390196: Deploy and document a method to dump logs from logstash from FY2024/2025-Q4 to FY2025/2026-Q1 on the SRE Observability board.
Jun 30 2025, 6:40 PM · SRE Observability (FY2025/2026-Q1), Observability-Logging
herron added a comment to T390196: Deploy and document a method to dump logs from logstash.

Opensearch reporting https://docs.opensearch.org/docs/latest/reporting/report-dashboard-index/ (already in place) checks each of these boxes with the exception of NDJSON.

Jun 30 2025, 6:39 PM · SRE Observability (FY2025/2026-Q1), Observability-Logging
herron moved T321808: Port all Icinga checks to Prometheus/Alertmanager from FY2024/2025-Q4 to FY2025/2026-Q1 on the SRE Observability board.
Jun 30 2025, 5:42 PM · SRE Observability (FY2025/2026-Q1), Observability-Alerting
herron moved T372242: Alert on unscrapable pods from FY2024/2025-Q4 to FY2025/2026-Q1 on the SRE Observability board.
Jun 30 2025, 5:42 PM · SRE Observability (FY2025/2026-Q1), Observability-Alerting, serviceops, Kubernetes
herron closed T387350: liftwing SLO performance issues, a subtask of T302995: Transition to Pyrra for SLO Visualization and Management, as Resolved.
Jun 30 2025, 5:40 PM · Patch-For-Review, User-herron, Observability-Metrics
herron closed T387350: liftwing SLO performance issues as Resolved.

Optimistically resolving as we've tuned the window for istio slos to 4w (from 12w)

Jun 30 2025, 5:40 PM · SRE Observability (FY2024/2025-Q4), SRE-SLO, Observability-Metrics

Jun 25 2025

herron edited projects for T397756: Kafka-logging -> Bookworm, added: Observability-Logging; removed observability.
Jun 25 2025, 2:05 PM · Observability-Logging, SRE Observability (FY2025/2026-Q1)
herron edited projects for T397757: Kafkamon -> Bookworm, added: Observability-Logging; removed observability.
Jun 25 2025, 2:05 PM · Observability-Logging, SRE Observability (FY2025/2026-Q1)

Jun 24 2025

herron moved T397757: Kafkamon -> Bookworm from FY2024/2025-Q4 to FY2025/2026-Q1 on the SRE Observability board.
Jun 24 2025, 7:28 PM · Observability-Logging, SRE Observability (FY2025/2026-Q1)
herron moved T397756: Kafka-logging -> Bookworm from FY2024/2025-Q4 to FY2025/2026-Q1 on the SRE Observability board.
Jun 24 2025, 7:28 PM · Observability-Logging, SRE Observability (FY2025/2026-Q1)
herron renamed T397757: Kafkamon -> Bookworm from Kafkamon -> bookworm to Kafkamon -> Bookworm.
Jun 24 2025, 6:49 PM · Observability-Logging, SRE Observability (FY2025/2026-Q1)
herron updated the task description for T353912: Observability Bookworm upgrades.
Jun 24 2025, 6:49 PM · SRE Observability (FY2025/2026-Q1), observability, Patch-For-Review
herron created T397757: Kafkamon -> Bookworm.
Jun 24 2025, 6:48 PM · Observability-Logging, SRE Observability (FY2025/2026-Q1)
herron created T397756: Kafka-logging -> Bookworm.
Jun 24 2025, 6:47 PM · Observability-Logging, SRE Observability (FY2025/2026-Q1)
herron triaged T396894: monitoring ACKs should be delivered via SMS as Low priority.
Jun 24 2025, 6:43 PM · SRE Observability, SRE
herron changed the status of T396894: monitoring ACKs should be delivered via SMS from Open to Stalled.

There doesn't appear to be a feature to generate a notification (push/sms/email/otherwise) on the acknowledge action in splunk oncall. There is the ability to integrate it via a webhook, but that doesn't solve the SMS handling piece by itself.

Jun 24 2025, 6:43 PM · SRE Observability, SRE
herron moved T393966: Update WDQS SLO lag queries to reflect graph split changes from Inbox to Radar on the observability board.
Jun 24 2025, 6:22 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), User-Elukey, Essential-Work, SRE-SLO, observability
herron closed T383923: Prometheus: queries matching on {__name__} error out on larger instances as Declined.

declining this since I doubt we'll make changes to support these queries, and we can hack around it using results from the labels api

Jun 24 2025, 4:51 PM · Observability-Metrics

Jun 18 2025

herron changed the status of T391714: Review logging cluster merge pressure, a subtask of T390215: Logstash is overwhelmed, from Open to Stalled.
Jun 18 2025, 2:44 PM · SRE Observability (FY2025/2026-Q1), Patch-For-Review, Observability-Logging
herron changed the status of T391714: Review logging cluster merge pressure from Open to Stalled.
Jun 18 2025, 2:44 PM · Patch-For-Review, Observability-Logging
herron changed the status of T391714: Review logging cluster merge pressure, a subtask of T391687: Consider sharding big logging indices, from Open to Stalled.
Jun 18 2025, 2:44 PM · Observability-Logging
herron triaged T383923: Prometheus: queries matching on {__name__} error out on larger instances as Low priority.
Jun 18 2025, 2:41 PM · Observability-Metrics
herron changed the status of T383923: Prometheus: queries matching on {__name__} error out on larger instances from Open to Stalled.
Jun 18 2025, 2:41 PM · Observability-Metrics
herron moved T397099: Grant Access to NDA LDAP for DerHexer from Backlog to NDA Pending on the LDAP-Access-Requests board.
Jun 18 2025, 1:33 PM · SRE, LDAP-Access-Requests

Jun 17 2025

herron closed T397004: Requesting access to analytics-privatedata-users for AndyRussG as Resolved.

The requested access has been merged and will be fully deployed within 30 minutes. I'll go ahead and resolve this but please don't hesitate to re-open if any followup is needed. Thanks!

Jun 17 2025, 6:03 PM · SRE, SRE-Access-Requests
herron added a member for WMF-NDA: AndyRussG_volunteer.
Jun 17 2025, 6:02 PM
herron added a comment to T397004: Requesting access to analytics-privatedata-users for AndyRussG.

Hi @AndyRussG_volunteer you should have just received an email regarding kerberos, and I'll update the account data to reflect krb: present now as well

Jun 17 2025, 5:58 PM · SRE, SRE-Access-Requests
herron updated the task description for T395917: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE).
Jun 17 2025, 5:40 PM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests, LDAP-Access-Requests
herron added a comment to T397099: Grant Access to NDA LDAP for DerHexer.

Change #1160216 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] admin: add ldap_only entry for derhexer

https://gerrit.wikimedia.org/r/1160216

Jun 17 2025, 5:39 PM · SRE, LDAP-Access-Requests
herron updated the task description for T397004: Requesting access to analytics-privatedata-users for AndyRussG.
Jun 17 2025, 5:17 PM · SRE, SRE-Access-Requests
herron updated the task description for T397004: Requesting access to analytics-privatedata-users for AndyRussG.
Jun 17 2025, 5:17 PM · SRE, SRE-Access-Requests
herron added a comment to T397004: Requesting access to analytics-privatedata-users for AndyRussG.

Thanks! I've just emailed you as well for the out of band verification step

Jun 17 2025, 4:56 PM · SRE, SRE-Access-Requests