User Details
- User Since
- Nov 3 2025, 12:35 PM (23 w, 1 d)
- Availability
- Available
- LDAP User
- Blake
- MediaWiki User
- BJensen-WMF [ Global Accounts ]
Yesterday
I'll merge the exclusion patch and work on updating the docs tomorrow.
Fri, Apr 10
Ah, thanks Cathal! The original patch was abandoned because I was struggling with git, the new patch is now https://gerrit.wikimedia.org/r/c/operations/alerts/+/1269994. I'll update it in accordance with the comment on the previous patch.
Thu, Apr 9
@Scott_French It sounds like it might be reasonable to exclude this service from the switchover, add a new Cumin alias for docker-registry-eqiad and docker-registry-codfw, and then add a small cookbook which can restart the relevant systemd service in either DC. Does that seem like an appropriate way to proceed?
Wed, Apr 8
More of the same error, it seems.
Now that we've repooled and resized, closing this out.
Wed, Apr 1
@Trizek-WMF After a chat with some folks on the team, it sounds unlikely that this is related to the switchover, but I'll make a note to follow up and see if we encounter a similar issue next iteration. Thanks for all your help!
Mon, Mar 30
Moving this to the backlog for now.
@MLechvien-WMF This was not completed in time for the switchover. I'm in the middle of a significant rework after the last round of comments in https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1239368. If I pivot back to working on this, I can likely have it out for review in a week or so.
Ah, thanks for the pointer! I'll have a read around and start to build context.
My inclination is to not start any of these jobs manually unless it's clear that there's some kind of user-facing impact of the job not being run. I'm going to do some exploration to better understand what this job is, and whether it not running is actually problematic.
I think the approach that gives us the most flexibility in alerting and does not require per-cron customization would be to add rich exit code information for these crons. I'll have a chat with Claime when they're back to see how we should best raise this issue with the devs.
Likely an artifact of the switchover last week:
Ah, yes. Same error, though:
I believe it was a side effect of the DC switchover last week. The relevant log line is:
It's a little odd that we're just getting this alert now - this cron has run several times since the error, and has completed successfully. I'll clear out the failed job.
Job cleared out.
Wed, Mar 25
From timestamps in IRC, the RO time was 02:28.832528, just under 2 and a half minutes.
I'll leave this open until we repool next week.
Switchover activities are complete - thanks!
The read-only time is over and the switchover has been completed successfully. Thank you!
The read-only time has not yet started - it's targeted for 15:00 UTC today.
Tue, Mar 24
k8s-ingress-wikikube-rw, rest-gateway are not excluded from the switchover in hieradata, but are marked active/passive. This is going to result in a day of cross-dc calls for services behind ingress and the rest-gateway, but that's known and acceptable.
Pooled status of services pre-switchover:
Mon, Mar 23
Ah, great, thanks very much.
@Trizek-WMF, what is the 'wmfall' messaging task referenced in the description? What that is and how to do it are not currently in the SRE documentation for the switchover. Thanks!
Fri, Mar 20
This is fixed for now, but is a bit noisy in IRC. I'll see if there might be a way to reduce that noise, but it's unrelated to the functionality we needed here.
Wed, Mar 18
Any significant updates will be tracked on the Phabricator task for the switchover, which is T413974. If it would be helpful, I can also update this task with progress - I expect the biggest update will be on Wednesday the 25th, when we leave read-only, and have mostly cleanup left to do. If there are any changes with respect to duration, or problems that arise as a result of the switchover, I'll update T413974, and will reach out. If there's anything else that would be helpful, please let me know. Cheers!
Mar 11 2026
This appears to be a duplicate of T410152.
Mar 6 2026
Alright, these have all been imaged with Trixie, and have been pooled.
Hello! There have been no notable changes since the last switchover.
Proceeding with one-off reimages here so we can get these hosts repooled on Trixie.
On reflection, this probably shouldn't be High, because it's non-blocking for the parent task. Looking into it, I think it's a bit more complicated to do this in a way that retains the current batching and doesn't spam IRC and Phabricator. It also sounds (from a conversation with Janis) that we don't expect this situation to be permanent, so this might be a nice-to-have in the interim.
Mar 5 2026
Hey folks, would it be possible to get some more detail about what assistance is required here? Is applying the rewrite rule the piece of work that requires SRE assistance? Thanks!
That seems reasonable to me.
Mar 4 2026
I'll leave this open so I don't forget to update the cookbook.
Test completed successfully.
The current primary is codfw, so for the live test, we should move from eqiad to codfw.
Before we run the reimage, Janis guided me through verifying network connectivity for these hosts.
Mar 3 2026
Sounds good, I'll take a look at this tomorrow.
Mar 2 2026
Feb 26 2026
Feb 25 2026
The number of metrics statsd-exporter is reporting seems to be increasing similarly over the lifetime of the container:
Feb 24 2026
Currently, this is the list of hosts:
Feb 23 2026
This seems reasonable to me - in the longer term, I'd prefer that we (ServiceOps) find a way to improve the resilience of these scripts, so they can mostly be retrying, rather than opening tickets on failure, but until we fix that, someone should be aware of the failures.
Feb 20 2026
Unfortunately, it doesn't look like this is going to be straightforward. Adding a numeric alert threshold isn't possible, because Prometheus metric labels are always strings, and we don't have a way to use that string in the alerting expression.
I think I'd be inclined to prefer the more-defensive option (maybe @Clement_Goubert has a preference here?).
Feb 19 2026
I'm wondering if there's a way we could, rather than having teams be defined as a property of a service, define teams as a first class structure, which is then imported where we need to use it.
Feb 12 2026
Okay, this has been applied across all of the environments above, and https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DKubernetesContainerReachingMemoryLimit&q=container%3Dstatsd-exporter is blessedly quiet. Resolving.
Ahhhh, okay, thanks very much!
The reason I didn't put exclude_from_switchover into discovery is because of this comment where "discovery" is defined - specifically, exclusion from the switchover did not seem to me to be a property of the DNS Discovery capabilities of the service, but rather a property of the service itself. It's possible that this was mistaken, as I don't have full context for DNS Discovery.
switchdc/services was deleted in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1225500, because of the work in T412211, where we migrated the EXCLUDED_SERVICES constant to instead pull from a property (exclude_from_switchover), which was recently added to the service registry.
Hey folks, for triaging purposes, is there remaining work here? If so, is it for observability, or serviceops? Thanks!
Feb 11 2026
I'm a little confused. It doesn't look like the change has applied in k8s, after executing scap sync-world --k8s-only --k8s-confirm-diff -Dbuild_mw_container_image:False, as described here.
Feb 10 2026
I think it might make sense for the identifier to add to match a 'team' in the context of alerting receivers. For instance, it would be useful to associate the mw-api-int namespace alerts with teams in alertmanager, because that is a reference we know points at humans who care to receive alerts.

