Page MenuHomePhabricator

RKemper (Ryan Kemper)
User

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
May 1 2020, 10:28 PM (311 w, 1 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
RKemper (WMF) [ Global Accounts ]

Recent Activity

Fri, Apr 17

RKemper added a comment to T423327: Explore options for OpenSearch 2.x/3.x plugin packaging and distribution.

@bking the above approach seems really elegant (not quoting so I don't spam the ticket with a duplicate wall of text). this works around lots of the ugliness of some of the other hacks I/we've been putting in place to work around this. I'm fully in favor of this route.

Fri, Apr 17, 6:00 AM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17)

Wed, Apr 15

RKemper added a comment to T423327: Explore options for OpenSearch 2.x/3.x plugin packaging and distribution.

Great catch on that Brian. I uploaded a patch to remove the stale references in puppet.

Wed, Apr 15, 6:13 PM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17)
RKemper moved T423327: Explore options for OpenSearch 2.x/3.x plugin packaging and distribution from Backlog - project to Needs Review on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.

We should def work with upstream, but let's be pessimistic about timing, so we should patch on our end too.

Wed, Apr 15, 7:09 AM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17)
RKemper added a comment to T420691: Automate Wikimedia IDM ("Bitu") and GrowthBook role synchronization.

@BTullis raised the question of whether we should stick with the k8s cronjob approach or just integrate this into airflow. I can see valid arguments for both sides. I'm leaning airflow currently, but in any case, the underlying script will be the same in either case, so for now I'm just flagging this as a deferred decision to revisit later; immediate priority is getting the full spec posted, getting team buyin, and beginning to test out the script

Wed, Apr 15, 4:40 AM · Patch-For-Review, Test Kitchen, Data-Platform-SRE, OKR-Work

Tue, Apr 14

RKemper moved T420696: API keys for GrowthBook from Backlog - operations to In Progress on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.

Keys provisioned and written to private-puppet; all that's left is the deployment-charts patch and then this can be closed

Tue, Apr 14, 3:09 PM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Test Kitchen, OKR-Work, Epic
RKemper added a comment to T420691: Automate Wikimedia IDM ("Bitu") and GrowthBook role synchronization.
  1. High-level Spec
Tue, Apr 14, 6:51 AM · Patch-For-Review, Test Kitchen, Data-Platform-SRE, OKR-Work

Fri, Apr 3

RKemper closed T242453: Detect and alert and/or remediate Blazegraph deadlocks as Resolved.

This is all done; remediation working across all blazegraph instance types (although practically, it's only wdqs-blazegraph on the wdqs-main clusters that should ever really have auto-restarts), with metrics flowing to prometheus. There's a subtask to add the grafana panels, but we can close this main ticket and leave that one open.

Fri, Apr 3, 9:26 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review, Wikidata-Query-Service, Wikidata

Tue, Mar 31

RKemper closed T410577: sre.elasticsearch.rolling-operation: Fix reboot --start-datetime logic as Resolved.
Tue, Mar 31, 3:03 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Essential-Work

Sat, Mar 28

RKemper added a comment to T242453: Detect and alert and/or remediate Blazegraph deadlocks.

The recent datacenter switchover has changed what our steady-state thread count is, so I'm bumping the thread limit up a bit to reduce how frequently we're restarting.

Sat, Mar 28, 6:31 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review, Wikidata-Query-Service, Wikidata
RKemper added a comment to T418723: Materialize analytics queries to improve superset dashboard latency.

Merged patch and ran sudo -u yarn yarn rmadmin -refreshQueues on the active master an-master1003

Sat, Mar 28, 2:27 AM · Patch-For-Review, SRE, Wikidata Platform Team (Sprint 03 (2026/03/03)), OKR-Work

Wed, Mar 25

RKemper updated the task description for T421285: Write cookbook for analytics coordinator reboot with automated service failover.
Wed, Mar 25, 8:00 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review
RKemper changed the status of T421285: Write cookbook for analytics coordinator reboot with automated service failover from Open to In Progress.
Wed, Mar 25, 7:58 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review
RKemper created T421285: Write cookbook for analytics coordinator reboot with automated service failover.
Wed, Mar 25, 7:54 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review
RKemper added a comment to T242453: Detect and alert and/or remediate Blazegraph deadlocks.

Cleanup patch was merged last week, and I just uploaded a new patch to instrument prometheus metrics. Once this patch is shipped & confirmed working, this task will be officially done; we could implement SAL logging as well, but honestly I think the prometheus metrics are more than enough for now. Plus SAL logging could potentially get very noisy during persistent multi-DC outages anyway.

Wed, Mar 25, 7:57 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review, Wikidata-Query-Service, Wikidata

Mar 19 2026

RKemper added a comment to T419041: Enable custom readahead settings for Ceph block devices serving workload on the dse-k8s clusters.

Brian and I merged and deployed the patch today, so the readahead fix is operating on codfw now. We'll circle back to eqiad next week after we've had some burn-in time.

Mar 19 2026, 9:41 PM · Patch-For-Review, Discovery-Search (2026.03.03 - 2026.04.03), Data-Platform-SRE (2026-03-06 - 2026-03-27)

Mar 17 2026

RKemper updated the task description for T420416: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet.
Mar 17 2026, 10:07 PM · SRE, Essential-Work, ops-eqiad, DC-Ops, Data-Platform-SRE (2026-03-06 - 2026-03-27)
RKemper triaged T420416: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet as Medium priority.
Mar 17 2026, 10:06 PM · SRE, Essential-Work, ops-eqiad, DC-Ops, Data-Platform-SRE (2026-03-06 - 2026-03-27)
RKemper added a project to T420416: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet: Essential-Work.
Mar 17 2026, 10:04 PM · SRE, Essential-Work, ops-eqiad, DC-Ops, Data-Platform-SRE (2026-03-06 - 2026-03-27)
RKemper moved T420416: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet from Backlog - project to Blocked/Waiting on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Mar 17 2026, 10:03 PM · SRE, Essential-Work, ops-eqiad, DC-Ops, Data-Platform-SRE (2026-03-06 - 2026-03-27)
RKemper moved T420416: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.

Switched this to a HW failure ticket, given racadm getsel revealed a backplane issue

Mar 17 2026, 10:03 PM · SRE, Essential-Work, ops-eqiad, DC-Ops, Data-Platform-SRE (2026-03-06 - 2026-03-27)
RKemper renamed T420416: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet from an-worker1172.eqiad.wmnet unreachable since 2026-03-03 to hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet.
Mar 17 2026, 10:02 PM · SRE, Essential-Work, ops-eqiad, DC-Ops, Data-Platform-SRE (2026-03-06 - 2026-03-27)
RKemper created T420416: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet.
Mar 17 2026, 9:52 PM · SRE, Essential-Work, ops-eqiad, DC-Ops, Data-Platform-SRE (2026-03-06 - 2026-03-27)
RKemper closed T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts as Resolved.

Completed all remaining DPE-owned host reboots today (2026-03-17). All 143 reachable Bullseye hosts are now at or above the 5.10.0-36 target kernel (most are on 5.10.0-39). One host (an-worker1172) remains unreachable and needs to be investigated separately.

Mar 17 2026, 9:36 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE
RKemper added a comment to T242453: Detect and alert and/or remediate Blazegraph deadlocks.

The bulk of this work is done. Remaining: 1 quick patch to clean up stuff, and (optional) instrument prometheus metrics [or SAL writing] so we have visibility into restarts getting performed

Mar 17 2026, 7:30 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review, Wikidata-Query-Service, Wikidata

Mar 6 2026

RKemper updated the task description for T415073: Cleanup after decommission of the WDQS full graph endpoint.
Mar 6 2026, 6:39 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Wikidata, Wikidata-Query-Service, Essential-Work
RKemper added a comment to T415073: Cleanup after decommission of the WDQS full graph endpoint.

Merged puppet cleanup (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1247933)
and deployment-charts cleanup (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1247947).
Also ran helmfile apply across staging/eqiad/codfw to teardown the
wikidata-query-legacy-full-gui helm release.

Mar 6 2026, 6:39 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Wikidata, Wikidata-Query-Service, Essential-Work

Feb 26 2026

RKemper moved T242453: Detect and alert and/or remediate Blazegraph deadlocks from Backlog - project to Needs Review on the Data-Platform-SRE (2026-02-13 - 2026-03-06) board.

Some refactoring and deploying to non-wikidata instances patches are ready for review. The non-wdqs instances is more for completeness' sake, since we rarely see issues in WCQS or wdqs-internal-[main-scholarly].

Feb 26 2026, 2:21 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review, Wikidata-Query-Service, Wikidata
RKemper changed the status of T242453: Detect and alert and/or remediate Blazegraph deadlocks from Open to In Progress.
Feb 26 2026, 2:19 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review, Wikidata-Query-Service, Wikidata
RKemper added a comment to T242453: Detect and alert and/or remediate Blazegraph deadlocks.

This has been working great in eqiad and codfw. Already got some genuine restart events:

Feb 26 2026, 1:45 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review, Wikidata-Query-Service, Wikidata

Feb 25 2026

RKemper added a comment to T242453: Detect and alert and/or remediate Blazegraph deadlocks.
ryankemper@wdqs2007:~$   cat /var/log/wdqs-blazegraph-deadlock-remediation.log
2026-02-25T17:54:08Z wdqs2007 RESTART: threads=525 (>120), restarting wdqs-blazegraph
2026-02-25T17:54:10Z wdqs2007 RESTART: wdqs-blazegraph restart issued successfully
2026-02-25T17:55:00Z wdqs2007 COOLDOWN: threads=381 (>120) but cooldown active (59m remaining)
2026-02-25T18:00:00Z wdqs2007 COOLDOWN: threads=860 (>120) but cooldown active (54m remaining)
Feb 25 2026, 6:03 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review, Wikidata-Query-Service, Wikidata
RKemper added a comment to T411584: Refresh trafficserver_backend_requests_seconds histogram.

+1 to this — we're frequently running into the 1.2s ceiling during recent periods of WDQS instability.

Feb 25 2026, 3:49 AM · Traffic

Feb 24 2026

RKemper added a comment to T242453: Detect and alert and/or remediate Blazegraph deadlocks.

We have access to local jmx metrics, so the plan is to have a systemd timer running on each wdqs host (we'll deploy on just codfw in the first stage to test it out). if it detects thread count >= 1500, it restarts and logs to a file on disk (in the future if this approach works we could look into hooking it into the SAL). and we'll prohibit auto-restart if the service has been auto-restarted in the last hour.

Feb 24 2026, 9:20 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review, Wikidata-Query-Service, Wikidata
RKemper claimed T242453: Detect and alert and/or remediate Blazegraph deadlocks.
Feb 24 2026, 9:15 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review, Wikidata-Query-Service, Wikidata
RKemper added a comment to T242453: Detect and alert and/or remediate Blazegraph deadlocks.

@bking We've been talking recently about implementing some rudimentary auto-remediation into WDQS. I figured we could revive this ticket and use it for our base of operations

Feb 24 2026, 8:48 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review, Wikidata-Query-Service, Wikidata
RKemper added a comment to T416721: Requesting access to "Community Wishlist" dashboard for hmonroy on Superset.

Just stopping by, is this ticket ready to be closed out or is there something else still pending?

Feb 24 2026, 8:26 PM · Data-Platform-SRE (2026-02-13 - 2026-03-06)
RKemper closed T415696: Decommission WDQS Linked Data Fragment (LDF) endpoint as Resolved.

Removed references to LDF on wikitech:

Feb 24 2026, 8:20 PM · Data-Platform-SRE (2026-02-13 - 2026-03-06), Wikidata, Wikidata-Query-Service
RKemper claimed T403955: Switch all hard coded druid_public host urls to druid-public-coordinator svc url.
Feb 24 2026, 7:49 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Essential-Work, Patch-For-Review

Feb 23 2026

RKemper added a comment to T415696: Decommission WDQS Linked Data Fragment (LDF) endpoint.

Merging in all the cleanup patches now. I'll let puppet auto-run across next 30 mins, so if there's any issues with these patches they'll surface by then

Feb 23 2026, 11:13 PM · Data-Platform-SRE (2026-02-13 - 2026-03-06), Wikidata, Wikidata-Query-Service
RKemper added a comment to T415696: Decommission WDQS Linked Data Fragment (LDF) endpoint.

Good catch, thanks @TBurmeister

Feb 23 2026, 9:54 PM · Data-Platform-SRE (2026-02-13 - 2026-03-06), Wikidata, Wikidata-Query-Service

Feb 20 2026

RKemper added a comment to T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts.

an-test-worker* done

Feb 20 2026, 7:55 AM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE

Feb 19 2026

RKemper added a comment to T415696: Decommission WDQS Linked Data Fragment (LDF) endpoint.

We missed the deadline slightly, but ldf has now officially been decommissioned! We'll circle back at a later date to merge the cleanup patches; let's give it at least a day before we proceed further.

Feb 19 2026, 10:32 PM · Data-Platform-SRE (2026-02-13 - 2026-03-06), Wikidata, Wikidata-Query-Service
RKemper added a comment to T410577: sre.elasticsearch.rolling-operation: Fix reboot --start-datetime logic.

Most recent rounds of feedback is addressed

Feb 19 2026, 8:41 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Essential-Work

Feb 13 2026

RKemper added a comment to T415002: Unusually high disk errors on the an-worker nodes since upgrading the disks.

Hit another one: https://phabricator.wikimedia.org/T389065#11613962

Feb 13 2026, 7:15 AM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE, ops-eqiad, DC-Ops
RKemper reopened T389065: hw troubleshooting: disk in slot 10 for an-worker1194, a subtask of T388512: Bring an-worker1[187-208] into the hadoop cluster, as Open.
Feb 13 2026, 7:09 AM · Data-Platform-SRE (2025.03.01 - 2025.03.21)
RKemper reopened T389065: hw troubleshooting: disk in slot 10 for an-worker1194 as "Open".

Slot 252:10 failed again D:

Feb 13 2026, 7:09 AM · SRE, ops-eqiad, DC-Ops

Feb 12 2026

RKemper moved T411919: hw troubleshooting: PERC1 battery failure for an-worker1148 from In Progress to Done on the Data-Platform-SRE (2026.01.05 - 2026.01.23) board.

an-worker1148 was missing a /boot entry in its fstab (but the mbr was still able to find the grub stuff on /dev/sda1 so the host was still bootable, it just wouldn't upgrade its kernel upon reboot like i'd expected). I added that back in, and kernel is properly upgraded now.

Feb 12 2026, 11:28 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops

Feb 11 2026

RKemper added a comment to T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts.

@RKemper You don't need to reboot these by hand, BTW. There's a cookbook (sre.kafka.roll-restart-reboot-brokers, which sets proper downtime, reboots one node a time and performs proper cluster health checks before proceeding with the next.

Feb 11 2026, 9:25 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE
RKemper added a comment to T414948: Decommission an-worker11[17-41] but reuse an-worker11[17,18,31,33,34] as dse-k8s-workers.

Running decom cookbook, and have the puppet patch up. There's still changes to the private puppet repo and site.pp that will need to be made, but I think we're better off doing those all at once when we decom all the hosts. Just did the non-site.pp puppet patch stuff for now so I don't forget it later

Feb 11 2026, 9:12 PM · SRE, DC-Ops, ops-eqiad, Data-Platform-SRE (2026-02-13 - 2026-03-06)
RKemper added a comment to T411919: hw troubleshooting: PERC1 battery failure for an-worker1148.

RAID card swap verified on an-worker1148. All checks pass:

Feb 11 2026, 9:00 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
RKemper added a comment to T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts.

kafka-jumbo successfully completed.

Feb 11 2026, 4:12 AM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE
RKemper added a comment to T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts.

kafka-test* ongoing right now.

Feb 11 2026, 12:01 AM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE

Feb 10 2026

RKemper closed T414692: Deploy the opensearch semantic-search cluster on dse-k8s-codfw, a subtask of T414693: Build and deploy the opensearch 3.x docker image, as Resolved.
Feb 10 2026, 10:12 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23)
RKemper closed T414692: Deploy the opensearch semantic-search cluster on dse-k8s-codfw, a subtask of T414703: Deploy the opensearch-semantic-search opensearch 3.x clusters, as Resolved.
Feb 10 2026, 10:12 PM · Data-Platform-SRE (2026-02-13 - 2026-03-06)
RKemper closed T414692: Deploy the opensearch semantic-search cluster on dse-k8s-codfw as Resolved.
Feb 10 2026, 10:12 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13)
RKemper reassigned T411919: hw troubleshooting: PERC1 battery failure for an-worker1148 from RKemper to Jclark-ctr.

Alright, an-worker1132 has been shutdown.

Feb 10 2026, 8:23 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
RKemper merged T414701: Define the opensearch-semantic-search namespace into T414702: Provision opensearch-semantic-search namespaces.
Feb 10 2026, 7:48 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13)
RKemper merged task T414701: Define the opensearch-semantic-search namespace into T414702: Provision opensearch-semantic-search namespaces.
Feb 10 2026, 7:48 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13)
RKemper claimed T411919: hw troubleshooting: PERC1 battery failure for an-worker1148.

Great, I'll ping you here when one is ready. Looks like we're a few days away from being drained, so the latest we'd have a host available to swap would be next monday, but there might be some magic we can do to terminate one earlier; I'll talk with Ben or somebody.

Feb 10 2026, 4:33 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
RKemper added a comment to T414948: Decommission an-worker11[17-41] but reuse an-worker11[17,18,31,33,34] as dse-k8s-workers.

Under 7 million now. Should be 3-4 more days.

Feb 10 2026, 4:30 PM · SRE, DC-Ops, ops-eqiad, Data-Platform-SRE (2026-02-13 - 2026-03-06)
RKemper added a comment to T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts.

Bleh, turned out I'd had a typo in my cumin query, so I'd inverted the hosts: the ones I listed as needing reboot were all already done.

Feb 10 2026, 6:03 AM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE

Feb 7 2026

RKemper closed T414691: Deploy the opensearch semantic-search cluster on dse-k8s-eqiad, a subtask of T414693: Build and deploy the opensearch 3.x docker image, as Resolved.
Feb 7 2026, 5:33 AM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23)
RKemper closed T414691: Deploy the opensearch semantic-search cluster on dse-k8s-eqiad, a subtask of T414703: Deploy the opensearch-semantic-search opensearch 3.x clusters, as Resolved.
Feb 7 2026, 5:33 AM · Data-Platform-SRE (2026-02-13 - 2026-03-06)
RKemper closed T414691: Deploy the opensearch semantic-search cluster on dse-k8s-eqiad as Resolved.
Feb 7 2026, 5:33 AM · Data-Platform-SRE (2026.01.23 - 2026.02.13)
RKemper closed T414702: Provision opensearch-semantic-search namespaces, a subtask of T414691: Deploy the opensearch semantic-search cluster on dse-k8s-eqiad, as Resolved.
Feb 7 2026, 5:32 AM · Data-Platform-SRE (2026.01.23 - 2026.02.13)
RKemper closed T414702: Provision opensearch-semantic-search namespaces as Resolved.
Feb 7 2026, 5:32 AM · Data-Platform-SRE (2026.01.23 - 2026.02.13)
RKemper updated the task description for T414702: Provision opensearch-semantic-search namespaces.
Feb 7 2026, 5:32 AM · Data-Platform-SRE (2026.01.23 - 2026.02.13)

Feb 6 2026

RKemper moved T411919: hw troubleshooting: PERC1 battery failure for an-worker1148 from Reported to In Progress on the Data-Platform-SRE (2026.01.05 - 2026.01.23) board.

Upon rebooting the host, we're back to the same issue:

Feb 6 2026, 6:04 AM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
RKemper added a comment to T416166: Follow-up: Degraded Disk Not Yet Added to RAID (an-worker1175, an-worker1199).

an-worker1199 still remains to be done; looks like it's waiting for another drive swap.

Feb 6 2026, 3:04 AM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE, ops-eqiad, DC-Ops
RKemper added a comment to T416166: Follow-up: Degraded Disk Not Yet Added to RAID (an-worker1175, an-worker1199).

Just took care of an-worker1175:

Feb 6 2026, 3:02 AM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE, ops-eqiad, DC-Ops

Feb 5 2026

RKemper closed T415002: Unusually high disk errors on the an-worker nodes since upgrading the disks as Resolved.
Feb 5 2026, 10:56 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE, ops-eqiad, DC-Ops
RKemper added a comment to T415002: Unusually high disk errors on the an-worker nodes since upgrading the disks.

Alright, entered the emergency shell, added a virtual device (/dev/sdm) for the new drive located at 252:8, formatted/partitioned/mounted, and verified everything looked good. More detailed shell log follows:

Feb 5 2026, 10:56 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE, ops-eqiad, DC-Ops
RKemper moved T415696: Decommission WDQS Linked Data Fragment (LDF) endpoint from Needs Review to Blocked/Waiting on the Data-Platform-SRE (2026.01.23 - 2026.02.13) board.

+1'd all the patches. Everything looks great so I couldn't find anything to change, minus one small typo in the base commit message.

Feb 5 2026, 9:06 PM · Data-Platform-SRE (2026-02-13 - 2026-03-06), Wikidata, Wikidata-Query-Service
RKemper updated the task description for T416365: WDQS: Making BlazegraphFailedServerRatioIncrease alerts less sensitive.
Feb 5 2026, 9:04 PM · Wikidata Platform Team, Data-Platform-SRE (2026.01.23 - 2026.02.13)
RKemper added a comment to T414306: wdqs: alert on ratio of failed queries increase.

We've tuned the alert slightly; a little over a third of the time we were seeing the alert fire for between 1-14 minutes and then resolve. So we've bumped from 30m to 45m to make the alert more actionable on the SRE side. Let us know if there's any issues with the change; for the timebeing I've merged the patch.

Feb 5 2026, 9:01 PM · Wikidata Platform Team (Sprint 01 (2026/01/13)), OKR-Work, Wikidata
RKemper closed T416365: WDQS: Making BlazegraphFailedServerRatioIncrease alerts less sensitive as Resolved.

I'll update the Wikidata platform team on T414306 with this change. I think given it's a very simple change of 30m -> 45m, and the original intent of the alert is still preserved, there's no need to block this on getting formal approval. So I'll mark this resolved.

Feb 5 2026, 8:59 PM · Wikidata Platform Team, Data-Platform-SRE (2026.01.23 - 2026.02.13)
RKemper added a comment to T362114: OpenSearch on K8s: Create Dashboards.

Should have a more significant update on this tomorrow, but wanted to call out that I did just notice the ticket description mentions:

Feb 5 2026, 1:01 AM · Data-Platform-SRE (2026.01.23 - 2026.02.13), OKR-Work

Feb 4 2026

RKemper added a comment to T415002: Unusually high disk errors on the an-worker nodes since upgrading the disks.

Unsurprisingly there's some post-swap steps for us (DPE SRE) to resolve:

Feb 4 2026, 10:37 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE, ops-eqiad, DC-Ops

Feb 3 2026

RKemper added a comment to T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts.

Resuming an-worker reboots.

Feb 3 2026, 7:30 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE
RKemper updated the task description for T416167: Create SLOs for OpenSearch on k8s.
Feb 3 2026, 4:32 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27)

Feb 2 2026

RKemper renamed T416167: Create SLOs for OpenSearch on k8s from Create SLO for OpenSearch on k8s to Create SLOs for OpenSearch on k8s.
Feb 2 2026, 11:25 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27)
RKemper created T416269: Create SLIs for opensearch on k8s.
Feb 2 2026, 11:23 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27)
RKemper renamed T416167: Create SLOs for OpenSearch on k8s from Create SLO for the OpenSearch cluster on k8s to Create SLO for OpenSearch on k8s.
Feb 2 2026, 11:07 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27)
RKemper added a comment to T416167: Create SLOs for OpenSearch on k8s.

(Discussed with bking, brief summary follows)

Feb 2 2026, 11:04 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27)
RKemper closed T408026: Add WDQS triples disrepancy alerting as Resolved.

This alert seems like it's in a decent spot for now (it will fire if there is >2% difference between the most up-to-date host and the furthest-behind host for 60 minutes). Closing now; we can re-open if further tuning is needed.

Feb 2 2026, 10:50 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Essential-Work, Wikidata, Wikidata-Query-Service

Jan 29 2026

RKemper moved T410577: sre.elasticsearch.rolling-operation: Fix reboot --start-datetime logic from In Progress to Needs Review on the Data-Platform-SRE (2026.01.23 - 2026.02.13) board.

Spicerack and cookbook patches are up. 99% of the logic lives in spicerack so that's the most important patch to review.

Jan 29 2026, 8:36 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Essential-Work
RKemper moved T408026: Add WDQS triples disrepancy alerting from Backlog - project to Needs Review on the Data-Platform-SRE (2026.01.23 - 2026.02.13) board.

I think with https://gerrit.wikimedia.org/r/c/operations/alerts/+/1174723 merged this task is likely done. Putting it in Needs Review for now until we verify that we're happy with these alerts (cc @bking)

Jan 29 2026, 5:15 AM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Essential-Work, Wikidata, Wikidata-Query-Service
RKemper moved T414702: Provision opensearch-semantic-search namespaces from Backlog - project to Needs Review on the Data-Platform-SRE (2026.01.23 - 2026.02.13) board.

Pushed out patches for both opensearch-semantic-search and opensearch-semantic-search-test. We should be ready to provision these namespaces when we're ready.

Jan 29 2026, 5:03 AM · Data-Platform-SRE (2026.01.23 - 2026.02.13)
RKemper claimed T414691: Deploy the opensearch semantic-search cluster on dse-k8s-eqiad.

Alright, patches for both opensearch-semantic-search and opensearch-semantic-search-test are up, and outstanding comments have been addressed.

Jan 29 2026, 4:47 AM · Data-Platform-SRE (2026.01.23 - 2026.02.13)

Jan 23 2026

RKemper added a comment to T413360: Degraded RAID on an-worker1200.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   12/22/2025 13:29:20
Source:      system
Severity:    Critical
Description: Fault detected on drive 8 in disk drive bay 1.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   01/15/2026 09:52:05
Source:      system
Severity:    Ok
Description: Drive 8 in disk drive bay 1 is operating normally.
-------------------------------------------------------------------------------
Jan 23 2026, 7:36 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Essential-Work, SRE, DC-Ops, ops-eqiad
RKemper added a comment to T393966: Update WDQS SLOs to reflect graph split changes.

Merged patch for the new SLO (and corresponding recording rules; I realized pyrra wants stuff in terms of total and errors thus why there's 2 recording rules instead of 1 now).

Jan 23 2026, 5:10 AM · Wikidata Platform Team, Data-Platform-SRE (2026-03-27 - 2026-04-17), Wikidata, Wikidata-Query-Service, User-Elukey, Essential-Work, SRE-SLO, observability
RKemper added a comment to T414702: Provision opensearch-semantic-search namespaces.

I think I've got everything we need in this patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1230512

Jan 23 2026, 5:07 AM · Data-Platform-SRE (2026.01.23 - 2026.02.13)

Jan 22 2026

RKemper moved T414517: Request: wdqs shell access for user trueg from Quick Wins to Needs Review on the Data-Platform-SRE (2026.01.05 - 2026.01.23) board.

@trueg You should have access within the next 30 minutes, can you verify that you're got the expected access?

Jan 22 2026, 5:38 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Wikidata, Wikidata-Query-Service
RKemper renamed T393966: Update WDQS SLOs to reflect graph split changes from Update WDQS SLO lag queries to reflect graph split changes to Update WDQS SLOs to reflect graph split changes.
Jan 22 2026, 5:21 PM · Wikidata Platform Team, Data-Platform-SRE (2026-03-27 - 2026-04-17), Wikidata, Wikidata-Query-Service, User-Elukey, Essential-Work, SRE-SLO, observability

Jan 16 2026

RKemper updated subscribers of T414517: Request: wdqs shell access for user trueg.

@Gehel Just realized your approval is necesary for wdqs-admins and wdqs-roots groups as well

Jan 16 2026, 10:11 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Wikidata, Wikidata-Query-Service

Jan 15 2026

RKemper added a comment to T414695: Create openjdk-21 docker images based on Bookworm.
root@build2001:/srv/images/production-images# /srv/deployment/docker-pkg/venv/bin/docker-pkg -c /etc/production-images/config.yaml build images/ --select '*jre*'
== Step 0: scanning /srv/images/production-images/images/ ==
Will build the following images:
* docker-registry.discovery.wmnet/openjdk-21-jre:0.1
== Step 1: building images ==
* Built image docker-registry.discovery.wmnet/openjdk-21-jre:0.1
== Step 2: publishing ==
Successfully published image docker-registry.discovery.wmnet/openjdk-21-jre:0.1
== Build done! ==
You can see the logs at ./docker-pkg-build.log
Jan 15 2026, 10:45 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), Semantic Search, Discovery-Search (2026.01.05 - 2026.01.30), CirrusSearch
RKemper added a comment to T414695: Create openjdk-21 docker images based on Bookworm.

Brian and I are testing out building of https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1227376 like so:

Jan 15 2026, 10:33 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), Semantic Search, Discovery-Search (2026.01.05 - 2026.01.30), CirrusSearch
RKemper added a comment to T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts.

Got about 40 an-worker* hosts done, but there's still another ~80 left to be done

Jan 15 2026, 8:03 AM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE

Jan 14 2026

RKemper closed T412969: Add https://cat.apis.beeldengeluid.nl/sparql to WDQS allowlist as Resolved.

Change rolled out. Please re-open and give us a shout if there's any issues

Jan 14 2026, 12:20 AM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), Wikidata, Wikidata-Query-Service
RKemper closed T412969: Add https://cat.apis.beeldengeluid.nl/sparql to WDQS allowlist, a subtask of T402894: ☎️ Wikidata Allowlist nominations: link in documentation, as Resolved.
Jan 14 2026, 12:20 AM · Wikibase Cloud
RKemper closed T406721: Add DNB / GND to WDQS allowlist as Resolved.

Change rolled out. Please re-open and give us a shout if there's any issues

Jan 14 2026, 12:19 AM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), Wikibase Cloud, Wikidata, Wikidata-Query-Service