User Details
- User Since
- May 1 2020, 10:28 PM (311 w, 1 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- RKemper (WMF) [ Global Accounts ]
Fri, Apr 17
@bking the above approach seems really elegant (not quoting so I don't spam the ticket with a duplicate wall of text). this works around lots of the ugliness of some of the other hacks I/we've been putting in place to work around this. I'm fully in favor of this route.
Wed, Apr 15
Great catch on that Brian. I uploaded a patch to remove the stale references in puppet.
We should def work with upstream, but let's be pessimistic about timing, so we should patch on our end too.
@BTullis raised the question of whether we should stick with the k8s cronjob approach or just integrate this into airflow. I can see valid arguments for both sides. I'm leaning airflow currently, but in any case, the underlying script will be the same in either case, so for now I'm just flagging this as a deferred decision to revisit later; immediate priority is getting the full spec posted, getting team buyin, and beginning to test out the script
Tue, Apr 14
Keys provisioned and written to private-puppet; all that's left is the deployment-charts patch and then this can be closed
- High-level Spec
Fri, Apr 3
This is all done; remediation working across all blazegraph instance types (although practically, it's only wdqs-blazegraph on the wdqs-main clusters that should ever really have auto-restarts), with metrics flowing to prometheus. There's a subtask to add the grafana panels, but we can close this main ticket and leave that one open.
Tue, Mar 31
Sat, Mar 28
The recent datacenter switchover has changed what our steady-state thread count is, so I'm bumping the thread limit up a bit to reduce how frequently we're restarting.
Merged patch and ran sudo -u yarn yarn rmadmin -refreshQueues on the active master an-master1003
Wed, Mar 25
Cleanup patch was merged last week, and I just uploaded a new patch to instrument prometheus metrics. Once this patch is shipped & confirmed working, this task will be officially done; we could implement SAL logging as well, but honestly I think the prometheus metrics are more than enough for now. Plus SAL logging could potentially get very noisy during persistent multi-DC outages anyway.
Mar 19 2026
Brian and I merged and deployed the patch today, so the readahead fix is operating on codfw now. We'll circle back to eqiad next week after we've had some burn-in time.
Mar 17 2026
Switched this to a HW failure ticket, given racadm getsel revealed a backplane issue
Completed all remaining DPE-owned host reboots today (2026-03-17). All 143 reachable Bullseye hosts are now at or above the 5.10.0-36 target kernel (most are on 5.10.0-39). One host (an-worker1172) remains unreachable and needs to be investigated separately.
The bulk of this work is done. Remaining: 1 quick patch to clean up stuff, and (optional) instrument prometheus metrics [or SAL writing] so we have visibility into restarts getting performed
Mar 6 2026
Merged puppet cleanup (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1247933)
and deployment-charts cleanup (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1247947).
Also ran helmfile apply across staging/eqiad/codfw to teardown the
wikidata-query-legacy-full-gui helm release.
Feb 26 2026
Some refactoring and deploying to non-wikidata instances patches are ready for review. The non-wdqs instances is more for completeness' sake, since we rarely see issues in WCQS or wdqs-internal-[main-scholarly].
This has been working great in eqiad and codfw. Already got some genuine restart events:
Feb 25 2026
ryankemper@wdqs2007:~$ cat /var/log/wdqs-blazegraph-deadlock-remediation.log 2026-02-25T17:54:08Z wdqs2007 RESTART: threads=525 (>120), restarting wdqs-blazegraph 2026-02-25T17:54:10Z wdqs2007 RESTART: wdqs-blazegraph restart issued successfully 2026-02-25T17:55:00Z wdqs2007 COOLDOWN: threads=381 (>120) but cooldown active (59m remaining) 2026-02-25T18:00:00Z wdqs2007 COOLDOWN: threads=860 (>120) but cooldown active (54m remaining)
+1 to this — we're frequently running into the 1.2s ceiling during recent periods of WDQS instability.
Feb 24 2026
We have access to local jmx metrics, so the plan is to have a systemd timer running on each wdqs host (we'll deploy on just codfw in the first stage to test it out). if it detects thread count >= 1500, it restarts and logs to a file on disk (in the future if this approach works we could look into hooking it into the SAL). and we'll prohibit auto-restart if the service has been auto-restarted in the last hour.
@bking We've been talking recently about implementing some rudimentary auto-remediation into WDQS. I figured we could revive this ticket and use it for our base of operations
Just stopping by, is this ticket ready to be closed out or is there something else still pending?
Removed references to LDF on wikitech:
Feb 23 2026
Merging in all the cleanup patches now. I'll let puppet auto-run across next 30 mins, so if there's any issues with these patches they'll surface by then
Good catch, thanks @TBurmeister
Feb 20 2026
an-test-worker* done
Feb 19 2026
We missed the deadline slightly, but ldf has now officially been decommissioned! We'll circle back at a later date to merge the cleanup patches; let's give it at least a day before we proceed further.
Most recent rounds of feedback is addressed
Feb 13 2026
Hit another one: https://phabricator.wikimedia.org/T389065#11613962
Slot 252:10 failed again D:
Feb 12 2026
an-worker1148 was missing a /boot entry in its fstab (but the mbr was still able to find the grub stuff on /dev/sda1 so the host was still bootable, it just wouldn't upgrade its kernel upon reboot like i'd expected). I added that back in, and kernel is properly upgraded now.
Feb 11 2026
Running decom cookbook, and have the puppet patch up. There's still changes to the private puppet repo and site.pp that will need to be made, but I think we're better off doing those all at once when we decom all the hosts. Just did the non-site.pp puppet patch stuff for now so I don't forget it later
RAID card swap verified on an-worker1148. All checks pass:
kafka-jumbo successfully completed.
kafka-test* ongoing right now.
Feb 10 2026
Alright, an-worker1132 has been shutdown.
Great, I'll ping you here when one is ready. Looks like we're a few days away from being drained, so the latest we'd have a host available to swap would be next monday, but there might be some magic we can do to terminate one earlier; I'll talk with Ben or somebody.
Under 7 million now. Should be 3-4 more days.
Bleh, turned out I'd had a typo in my cumin query, so I'd inverted the hosts: the ones I listed as needing reboot were all already done.
Feb 7 2026
Feb 6 2026
Upon rebooting the host, we're back to the same issue:
an-worker1199 still remains to be done; looks like it's waiting for another drive swap.
Just took care of an-worker1175:
Feb 5 2026
Alright, entered the emergency shell, added a virtual device (/dev/sdm) for the new drive located at 252:8, formatted/partitioned/mounted, and verified everything looked good. More detailed shell log follows:
+1'd all the patches. Everything looks great so I couldn't find anything to change, minus one small typo in the base commit message.
We've tuned the alert slightly; a little over a third of the time we were seeing the alert fire for between 1-14 minutes and then resolve. So we've bumped from 30m to 45m to make the alert more actionable on the SRE side. Let us know if there's any issues with the change; for the timebeing I've merged the patch.
I'll update the Wikidata platform team on T414306 with this change. I think given it's a very simple change of 30m -> 45m, and the original intent of the alert is still preserved, there's no need to block this on getting formal approval. So I'll mark this resolved.
Should have a more significant update on this tomorrow, but wanted to call out that I did just notice the ticket description mentions:
Feb 4 2026
Unsurprisingly there's some post-swap steps for us (DPE SRE) to resolve:
Feb 3 2026
Resuming an-worker reboots.
Feb 2 2026
(Discussed with bking, brief summary follows)
This alert seems like it's in a decent spot for now (it will fire if there is >2% difference between the most up-to-date host and the furthest-behind host for 60 minutes). Closing now; we can re-open if further tuning is needed.
Jan 29 2026
Spicerack and cookbook patches are up. 99% of the logic lives in spicerack so that's the most important patch to review.
I think with https://gerrit.wikimedia.org/r/c/operations/alerts/+/1174723 merged this task is likely done. Putting it in Needs Review for now until we verify that we're happy with these alerts (cc @bking)
Pushed out patches for both opensearch-semantic-search and opensearch-semantic-search-test. We should be ready to provision these namespaces when we're ready.
Alright, patches for both opensearch-semantic-search and opensearch-semantic-search-test are up, and outstanding comments have been addressed.
Jan 23 2026
------------------------------------------------------------------------------- Record: 7 Date/Time: 12/22/2025 13:29:20 Source: system Severity: Critical Description: Fault detected on drive 8 in disk drive bay 1. ------------------------------------------------------------------------------- Record: 8 Date/Time: 01/15/2026 09:52:05 Source: system Severity: Ok Description: Drive 8 in disk drive bay 1 is operating normally. -------------------------------------------------------------------------------
Merged patch for the new SLO (and corresponding recording rules; I realized pyrra wants stuff in terms of total and errors thus why there's 2 recording rules instead of 1 now).
I think I've got everything we need in this patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1230512
Jan 22 2026
@trueg You should have access within the next 30 minutes, can you verify that you're got the expected access?
Jan 16 2026
@Gehel Just realized your approval is necesary for wdqs-admins and wdqs-roots groups as well
Jan 15 2026
root@build2001:/srv/images/production-images# /srv/deployment/docker-pkg/venv/bin/docker-pkg -c /etc/production-images/config.yaml build images/ --select '*jre*' == Step 0: scanning /srv/images/production-images/images/ == Will build the following images: * docker-registry.discovery.wmnet/openjdk-21-jre:0.1 == Step 1: building images == * Built image docker-registry.discovery.wmnet/openjdk-21-jre:0.1 == Step 2: publishing == Successfully published image docker-registry.discovery.wmnet/openjdk-21-jre:0.1 == Build done! == You can see the logs at ./docker-pkg-build.log
Brian and I are testing out building of https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1227376 like so:
Got about 40 an-worker* hosts done, but there's still another ~80 left to be done
Jan 14 2026
Change rolled out. Please re-open and give us a shout if there's any issues
Change rolled out. Please re-open and give us a shout if there's any issues
