Page MenuHomePhabricator

Blake (Blake)
User

Today

  • No visible events.

Tomorrow

  • No visible events.

Friday

  • No visible events.

User Details

User Since
Nov 3 2025, 12:35 PM (23 w, 1 d)
Availability
Available
LDAP User
Blake
MediaWiki User
BJensen-WMF [ Global Accounts ]

Recent Activity

Yesterday

Blake added a comment to T422166: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw.

I'll merge the exclusion patch and work on updating the docs tomorrow.

Tue, Apr 14, 3:43 PM · Ceph, SRE-swift-storage, Patch-For-Review, ServiceOps new, Datacenter-Switchover, SRE

Fri, Apr 10

Blake added a comment to T356877: Increase visibility of kubernetes network status.

Ah, thanks Cathal! The original patch was abandoned because I was struggling with git, the new patch is now https://gerrit.wikimedia.org/r/c/operations/alerts/+/1269994. I'll update it in accordance with the comment on the previous patch.

Fri, Apr 10, 2:43 PM · Patch-For-Review, Sustainability (Incident Followup), ServiceOps-good-first-task, Infrastructure-Foundations, netops, ServiceOps new, observability, Prod-Kubernetes, Kubernetes

Thu, Apr 9

Blake added a comment to T422166: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw.

@Scott_French It sounds like it might be reasonable to exclude this service from the switchover, add a new Cumin alias for docker-registry-eqiad and docker-registry-codfw, and then add a small cookbook which can restart the relevant systemd service in either DC. Does that seem like an appropriate way to proceed?

Thu, Apr 9, 1:34 PM · Ceph, SRE-swift-storage, Patch-For-Review, ServiceOps new, Datacenter-Switchover, SRE
Blake claimed T422166: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw.
Thu, Apr 9, 10:00 AM · Ceph, SRE-swift-storage, Patch-For-Review, ServiceOps new, Datacenter-Switchover, SRE

Wed, Apr 8

Blake added a comment to T422678: MediaWiki periodic job update-special-pages-s5 failed.

More of the same error, it seems.

Wed, Apr 8, 3:07 PM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Blake added a subtask for T422486: MediaWiki periodic job failures due to timeouts: T422678: MediaWiki periodic job update-special-pages-s5 failed.
Wed, Apr 8, 3:06 PM · ServiceOps new (Next quarter), DBA
Blake added a parent task for T422678: MediaWiki periodic job update-special-pages-s5 failed: T422486: MediaWiki periodic job failures due to timeouts.
Wed, Apr 8, 3:06 PM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Blake closed T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) as Resolved.

Now that we've repooled and resized, closing this out.

Wed, Apr 8, 10:59 AM · ServiceOps new, Datacenter-Switchover

Wed, Apr 1

Blake added a comment to T414096: MoveComms support for Northward Datacentre Switchover (March 2026; codfw to eqiad).

@Trizek-WMF After a chat with some folks on the team, it sounds unlikely that this is related to the switchover, but I'll make a note to follow up and see if we encounter a similar issue next iteration. Thanks for all your help!

Wed, Apr 1, 2:33 PM · MoveComms-Support

Mon, Mar 30

Blake moved T330997: Support locking cookbooks run except for switchover related cookbooks from Scheduled (this Q) to Backlog on the ServiceOps new board.

Moving this to the backlog for now.

Mon, Mar 30, 2:55 PM · Patch-For-Review, ServiceOps new, SRE-tools, Infrastructure-Foundations, Datacenter-Switchover, SRE
Blake added a comment to T330997: Support locking cookbooks run except for switchover related cookbooks.

@MLechvien-WMF This was not completed in time for the switchover. I'm in the middle of a significant rework after the last round of comments in https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1239368. If I pivot back to working on this, I can likely have it out for review in a week or so.

Mon, Mar 30, 12:18 PM · Patch-For-Review, ServiceOps new, SRE-tools, Infrastructure-Foundations, Datacenter-Switchover, SRE
Blake added a comment to T421337: MediaWiki periodic job updatequerypages-wantedpages-s1 failed.

Ah, thanks for the pointer! I'll have a read around and start to build context.

Mon, Mar 30, 11:02 AM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Blake added a comment to T421337: MediaWiki periodic job updatequerypages-wantedpages-s1 failed.

My inclination is to not start any of these jobs manually unless it's clear that there's some kind of user-facing impact of the job not being run. I'm going to do some exploration to better understand what this job is, and whether it not running is actually problematic.

Mon, Mar 30, 10:57 AM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Blake added a comment to T416576: Make mw-cron jobs alert thresholds easily configurable.

I think the approach that gives us the most flexibility in alerting and does not require per-cron customization would be to add rich exit code information for these crons. I'll have a chat with Claime when they're back to see how we should best raise this issue with the devs.

Mon, Mar 30, 10:47 AM · ServiceOps new
Blake added a comment to T421679: MediaWiki periodic job initsitestats failed.

Likely an artifact of the switchover last week:

Mon, Mar 30, 10:45 AM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Blake added a comment to T421337: MediaWiki periodic job updatequerypages-wantedpages-s1 failed.

Ah, yes. Same error, though:

Mon, Mar 30, 10:44 AM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Blake added a comment to T421337: MediaWiki periodic job updatequerypages-wantedpages-s1 failed.

I believe it was a side effect of the DC switchover last week. The relevant log line is:

Mon, Mar 30, 10:23 AM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Blake closed T421679: MediaWiki periodic job initsitestats failed as Resolved.
Mon, Mar 30, 10:09 AM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Blake added a comment to T421679: MediaWiki periodic job initsitestats failed.

It's a little odd that we're just getting this alert now - this cron has run several times since the error, and has completed successfully. I'll clear out the failed job.

Mon, Mar 30, 10:09 AM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Blake claimed T421679: MediaWiki periodic job initsitestats failed.
Mon, Mar 30, 9:08 AM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Blake closed T421337: MediaWiki periodic job updatequerypages-wantedpages-s1 failed as Resolved.

Job cleared out.

Mon, Mar 30, 9:05 AM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages

Wed, Mar 25

Blake added a comment to T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad).

From timestamps in IRC, the RO time was 02:28.832528, just under 2 and a half minutes.

Wed, Mar 25, 3:52 PM · ServiceOps new, Datacenter-Switchover
Blake reopened T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) as "Open".

I'll leave this open until we repool next week.

Wed, Mar 25, 3:48 PM · ServiceOps new, Datacenter-Switchover
Blake added a comment to T414096: MoveComms support for Northward Datacentre Switchover (March 2026; codfw to eqiad).

Switchover activities are complete - thanks!

Wed, Mar 25, 3:47 PM · MoveComms-Support
Blake closed T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) as Resolved.
Wed, Mar 25, 3:46 PM · ServiceOps new, Datacenter-Switchover
Blake added a comment to T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad).

The read-only time is over and the switchover has been completed successfully. Thank you!

Wed, Mar 25, 3:12 PM · ServiceOps new, Datacenter-Switchover
Blake added a comment to T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad).

The read-only time has not yet started - it's targeted for 15:00 UTC today.

Wed, Mar 25, 2:23 PM · ServiceOps new, Datacenter-Switchover

Tue, Mar 24

Blake added a comment to T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad).

k8s-ingress-wikikube-rw, rest-gateway are not excluded from the switchover in hieradata, but are marked active/passive. This is going to result in a day of cross-dc calls for services behind ingress and the rest-gateway, but that's known and acceptable.

Tue, Mar 24, 12:43 PM · ServiceOps new, Datacenter-Switchover
Blake added a comment to T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad).

Pooled status of services pre-switchover:

Tue, Mar 24, 12:14 PM · ServiceOps new, Datacenter-Switchover

Mon, Mar 23

Blake added a comment to T414096: MoveComms support for Northward Datacentre Switchover (March 2026; codfw to eqiad).

Ah, great, thanks very much.

Mon, Mar 23, 3:12 PM · MoveComms-Support
Blake added a comment to T414096: MoveComms support for Northward Datacentre Switchover (March 2026; codfw to eqiad).

@Trizek-WMF, what is the 'wmfall' messaging task referenced in the description? What that is and how to do it are not currently in the SRE documentation for the switchover. Thanks!

Mon, Mar 23, 1:03 PM · MoveComms-Support

Fri, Mar 20

Blake closed T419032: Allow roll-reimage-nodes to reimage nodes absent from conftool, a subtask of T418142: Fix k8s node automation to account for new rack topology, as Resolved.
Fri, Mar 20, 3:05 PM · Kubernetes, serviceops-tooling, ServiceOps new
Blake closed T419032: Allow roll-reimage-nodes to reimage nodes absent from conftool as Resolved.

This is fixed for now, but is a bit noisy in IRC. I'll see if there might be a way to reduce that noise, but it's unrelated to the functionality we needed here.

Fri, Mar 20, 3:05 PM · Patch-For-Review, Kubernetes, serviceops-tooling, ServiceOps new

Wed, Mar 18

Blake added a comment to T414096: MoveComms support for Northward Datacentre Switchover (March 2026; codfw to eqiad).

Any significant updates will be tracked on the Phabricator task for the switchover, which is T413974. If it would be helpful, I can also update this task with progress - I expect the biggest update will be on Wednesday the 25th, when we leave read-only, and have mostly cleanup left to do. If there are any changes with respect to duration, or problems that arise as a result of the switchover, I'll update T413974, and will reach out. If there's anything else that would be helpful, please let me know. Cheers!

Wed, Mar 18, 10:31 AM · MoveComms-Support
Blake merged task T416451: October 2025 Bullseye reboots (ServiceOps hosts) into Restricted Task.
Wed, Mar 18, 9:50 AM · ServiceOps-Upgrades-Hardware, ServiceOps new, Essential-Work, Vuln-VulnComponent, SecTeam-Processed, Infrastructure Security, SRE, Security

Mar 11 2026

Blake closed T418383: Investigate mw-on-k8s statsd-exporter RAM usage pattern as Invalid.

This appears to be a duplicate of T410152.

Mar 11 2026, 2:56 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new
Blake closed T418133: Northward Datacenter Switchover Live Test (March 2026; codfw to eqiad), a subtask of T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad), as Resolved.
Mar 11 2026, 9:53 AM · ServiceOps new, Datacenter-Switchover
Blake closed T418133: Northward Datacenter Switchover Live Test (March 2026; codfw to eqiad) as Resolved.
Mar 11 2026, 9:53 AM · ServiceOps new, Datacenter-Switchover

Mar 6 2026

Blake closed T417772: wikikube-worker23[32-56] implementation tracking, a subtask of T408757: Q2:rack/setup/install wikikube-worker2332-56, as Resolved.
Mar 6 2026, 3:54 PM · ServiceOps-Upgrades-Hardware, ServiceOps new, SRE, ops-codfw, DC-Ops
Blake closed T417772: wikikube-worker23[32-56] implementation tracking as Resolved.
Mar 6 2026, 3:54 PM · ServiceOps-Upgrades-Hardware, ServiceOps new
Blake added a comment to T417772: wikikube-worker23[32-56] implementation tracking.

Alright, these have all been imaged with Trixie, and have been pooled.

Mar 6 2026, 3:54 PM · ServiceOps-Upgrades-Hardware, ServiceOps new
Blake added a comment to T414096: MoveComms support for Northward Datacentre Switchover (March 2026; codfw to eqiad).

Hello! There have been no notable changes since the last switchover.

Mar 6 2026, 1:53 PM · MoveComms-Support
Blake added a comment to T417772: wikikube-worker23[32-56] implementation tracking.

Proceeding with one-off reimages here so we can get these hosts repooled on Trixie.

Mar 6 2026, 10:06 AM · ServiceOps-Upgrades-Hardware, ServiceOps new
Blake lowered the priority of T419032: Allow roll-reimage-nodes to reimage nodes absent from conftool from High to Low.

On reflection, this probably shouldn't be High, because it's non-blocking for the parent task. Looking into it, I think it's a bit more complicated to do this in a way that retains the current batching and doesn't spam IRC and Phabricator. It also sounds (from a conversation with Janis) that we don't expect this situation to be permanent, so this might be a nice-to-have in the interim.

Mar 6 2026, 9:56 AM · Patch-For-Review, Kubernetes, serviceops-tooling, ServiceOps new

Mar 5 2026

Blake triaged T417026: Create a rewrite for the GraphQL endpoint on wikidata.org as Medium priority.
Mar 5 2026, 5:04 PM · ServiceOps new, Wikimedia-Apache-configuration, SRE, Wikidata, Wikibase GraphQL, Wikibase Reuse Team
Blake added a comment to T417026: Create a rewrite for the GraphQL endpoint on wikidata.org.

Hey folks, would it be possible to get some more detail about what assistance is required here? Is applying the rewrite rule the piece of work that requires SRE assistance? Thanks!

Mar 5 2026, 3:44 PM · ServiceOps new, Wikimedia-Apache-configuration, SRE, Wikidata, Wikibase GraphQL, Wikibase Reuse Team
Blake triaged T418200: Migrate Service Ops Docker images running in production away from Bullseye as Medium priority.
Mar 5 2026, 3:30 PM · ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki
Blake triaged T419032: Allow roll-reimage-nodes to reimage nodes absent from conftool as High priority.

That seems reasonable to me.

Mar 5 2026, 10:41 AM · Patch-For-Review, Kubernetes, serviceops-tooling, ServiceOps new

Mar 4 2026

Blake moved T419032: Allow roll-reimage-nodes to reimage nodes absent from conftool from Inbox to Scheduled (this Q) on the ServiceOps new board.
Mar 4 2026, 4:56 PM · Patch-For-Review, Kubernetes, serviceops-tooling, ServiceOps new
Blake created T419032: Allow roll-reimage-nodes to reimage nodes absent from conftool.
Mar 4 2026, 4:55 PM · Patch-For-Review, Kubernetes, serviceops-tooling, ServiceOps new
Blake reopened T418133: Northward Datacenter Switchover Live Test (March 2026; codfw to eqiad), a subtask of T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad), as Open.
Mar 4 2026, 4:50 PM · ServiceOps new, Datacenter-Switchover
Blake reopened T418133: Northward Datacenter Switchover Live Test (March 2026; codfw to eqiad) as "Open".

I'll leave this open so I don't forget to update the cookbook.

Mar 4 2026, 4:50 PM · ServiceOps new, Datacenter-Switchover
Blake added a comment to T418133: Northward Datacenter Switchover Live Test (March 2026; codfw to eqiad).

Test completed successfully.

Mar 4 2026, 4:50 PM · ServiceOps new, Datacenter-Switchover
Blake closed T418133: Northward Datacenter Switchover Live Test (March 2026; codfw to eqiad), a subtask of T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad), as Resolved.
Mar 4 2026, 4:47 PM · ServiceOps new, Datacenter-Switchover
Blake closed T418133: Northward Datacenter Switchover Live Test (March 2026; codfw to eqiad) as Resolved.
Mar 4 2026, 4:47 PM · ServiceOps new, Datacenter-Switchover
Blake added a comment to T418133: Northward Datacenter Switchover Live Test (March 2026; codfw to eqiad).

The current primary is codfw, so for the live test, we should move from eqiad to codfw.

Mar 4 2026, 3:58 PM · ServiceOps new, Datacenter-Switchover
Blake added a comment to T417772: wikikube-worker23[32-56] implementation tracking.

Before we run the reimage, Janis guided me through verifying network connectivity for these hosts.

Mar 4 2026, 11:31 AM · ServiceOps-Upgrades-Hardware, ServiceOps new

Mar 3 2026

Blake added a comment to T417772: wikikube-worker23[32-56] implementation tracking.

Sounds good, I'll take a look at this tomorrow.

Mar 3 2026, 3:49 PM · ServiceOps-Upgrades-Hardware, ServiceOps new
Blake updated the task description for T359375: make better use of spicerack's service_catalog().
Mar 3 2026, 2:57 PM · User-jijiki, Serviceops-easywins, ServiceOps new, Datacenter-Switchover

Mar 2 2026

Blake updated the task description for T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad).
Mar 2 2026, 2:19 PM · ServiceOps new, Datacenter-Switchover

Feb 26 2026

Blake moved T397653: Incorporate new arm64 host in our tooling from Inbox to Radar (Awareness) on the ServiceOps new board.
Feb 26 2026, 5:18 PM · ServiceOps new, ARM support, Infrastructure-Foundations
Blake edited projects for T397653: Incorporate new arm64 host in our tooling, added: ServiceOps new; removed serviceops-deprecated.
Feb 26 2026, 5:18 PM · ServiceOps new, ARM support, Infrastructure-Foundations
Blake moved T391784: Gradually isolate mediawiki authentication code and infrastructure from Inbox to Radar (Awareness) on the ServiceOps new board.
Feb 26 2026, 5:16 PM · ServiceOps new, MediaWiki-Platform-Team, SecTeam-Processed, Security-Team, MediaWiki-extensions-CentralAuth, MediaWiki-Core-AuthManager, Security, Epic
Blake edited projects for T391784: Gradually isolate mediawiki authentication code and infrastructure, added: ServiceOps new; removed serviceops-deprecated.
Feb 26 2026, 5:15 PM · ServiceOps new, MediaWiki-Platform-Team, SecTeam-Processed, Security-Team, MediaWiki-extensions-CentralAuth, MediaWiki-Core-AuthManager, Security, Epic
Blake moved T417026: Create a rewrite for the GraphQL endpoint on wikidata.org from Inbox to Backlog on the ServiceOps new board.
Feb 26 2026, 5:14 PM · ServiceOps new, Wikimedia-Apache-configuration, SRE, Wikidata, Wikibase GraphQL, Wikibase Reuse Team
Blake edited projects for T417026: Create a rewrite for the GraphQL endpoint on wikidata.org, added: ServiceOps new; removed serviceops-deprecated.
Feb 26 2026, 5:12 PM · ServiceOps new, Wikimedia-Apache-configuration, SRE, Wikidata, Wikibase GraphQL, Wikibase Reuse Team

Feb 25 2026

Blake added a comment to T418383: Investigate mw-on-k8s statsd-exporter RAM usage pattern.

The number of metrics statsd-exporter is reporting seems to be increasing similarly over the lifetime of the container:

statsd_exporter_metrics_increasing.png (928×2 px, 489 KB)

Feb 25 2026, 3:09 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new
Blake removed a project from T418160: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026: serviceops-deprecated.
Feb 25 2026, 1:52 PM · Essential-Work, ServiceOps new, Abstract Wikipedia team, SRE-SLO

Feb 24 2026

Blake moved T418200: Migrate Service Ops Docker images running in production away from Bullseye from Inbox to Backlog on the ServiceOps new board.
Feb 24 2026, 2:10 PM · ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki
Blake moved T418160: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 from Inbox to Radar (Awareness) on the ServiceOps new board.
Feb 24 2026, 2:09 PM · Essential-Work, ServiceOps new, Abstract Wikipedia team, SRE-SLO
Blake moved T418133: Northward Datacenter Switchover Live Test (March 2026; codfw to eqiad) from Inbox to Scheduled (this Q) on the ServiceOps new board.
Feb 24 2026, 2:06 PM · ServiceOps new, Datacenter-Switchover
Blake added a comment to T418212: Automate the creation of implementation task from rack/setup/install tasks for Serviceops.

Currently, this is the list of hosts:

Feb 24 2026, 9:58 AM · ServiceOps-Upgrades-Hardware, serviceops-tooling, ServiceOps new, DC-Ops

Feb 23 2026

Blake created T418133: Northward Datacenter Switchover Live Test (March 2026; codfw to eqiad).
Feb 23 2026, 1:47 PM · ServiceOps new, Datacenter-Switchover
Blake added a comment to T417020: Proposal: mw-cron failure tasks that get automatically filed for unstewarded components should also tag the ServiceOps Phabricator project.

This seems reasonable to me - in the longer term, I'd prefer that we (ServiceOps) find a way to improve the resilience of these scripts, so they can mostly be retrying, rather than opening tickets on failure, but until we fix that, someone should be aware of the failures.

Feb 23 2026, 9:46 AM · user-a_smart_kitten, ServiceOps new

Feb 20 2026

Blake added a comment to T416576: Make mw-cron jobs alert thresholds easily configurable.

Unfortunately, it doesn't look like this is going to be straightforward. Adding a numeric alert threshold isn't possible, because Prometheus metric labels are always strings, and we don't have a way to use that string in the alerting expression.

Feb 20 2026, 3:37 PM · ServiceOps new
Blake added a comment to T330997: Support locking cookbooks run except for switchover related cookbooks.

I think I'd be inclined to prefer the more-defensive option (maybe @Clement_Goubert has a preference here?).

Feb 20 2026, 10:48 AM · Patch-For-Review, ServiceOps new, SRE-tools, Infrastructure-Foundations, Datacenter-Switchover, SRE

Feb 19 2026

Blake updated subscribers of T330997: Support locking cookbooks run except for switchover related cookbooks.

@elukey and @Volans, do you happen to have thoughts about the best way to go about checking for a global lock and associated allowlist? My intuition would be that this might be something to be included in CookbookRunnerBase, but I'm not sure.

Feb 19 2026, 10:36 PM · Patch-For-Review, ServiceOps new, SRE-tools, Infrastructure-Foundations, Datacenter-Switchover, SRE
Blake added a comment to T303744: Keep track of teams responsible for namespaces inside kubernetes.

I'm wondering if there's a way we could, rather than having teams be defined as a property of a service, define teams as a first class structure, which is then imported where we need to use it.

Feb 19 2026, 10:29 AM · ServiceOps new (Next quarter), observability, Serviceops-easywins, Prod-Kubernetes

Feb 12 2026

Blake closed T416985: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker2071.codfw.wmnet) as Resolved.

Okay, this has been applied across all of the environments above, and https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DKubernetesContainerReachingMemoryLimit&q=container%3Dstatsd-exporter is blessedly quiet. Resolving.

Feb 12 2026, 4:24 PM · ServiceOps new, sre-alert-triage
Blake added a comment to T416985: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker2071.codfw.wmnet).

Ahhhh, okay, thanks very much!

Feb 12 2026, 2:48 PM · ServiceOps new, sre-alert-triage
Blake added a comment to T359375: make better use of spicerack's service_catalog().

The reason I didn't put exclude_from_switchover into discovery is because of this comment where "discovery" is defined - specifically, exclusion from the switchover did not seem to me to be a property of the DNS Discovery capabilities of the service, but rather a property of the service itself. It's possible that this was mistaken, as I don't have full context for DNS Discovery.

Feb 12 2026, 12:20 PM · User-jijiki, Serviceops-easywins, ServiceOps new, Datacenter-Switchover
Blake added a comment to T359375: make better use of spicerack's service_catalog().

switchdc/services was deleted in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1225500, because of the work in T412211, where we migrated the EXCLUDED_SERVICES constant to instead pull from a property (exclude_from_switchover), which was recently added to the service registry.

Feb 12 2026, 11:26 AM · User-jijiki, Serviceops-easywins, ServiceOps new, Datacenter-Switchover
Blake added a project to T383805: Performance assessment of PHP 8: Prod-Kubernetes.
Feb 12 2026, 10:22 AM · Prod-Kubernetes, ServiceOps new, MediaWiki-Platform-Team (Radar)
Blake moved T383805: Performance assessment of PHP 8 from Inbox to Backlog on the ServiceOps new board.
Feb 12 2026, 10:22 AM · Prod-Kubernetes, ServiceOps new, MediaWiki-Platform-Team (Radar)
Blake edited projects for T383805: Performance assessment of PHP 8, added: ServiceOps new; removed serviceops-deprecated.
Feb 12 2026, 10:22 AM · Prod-Kubernetes, ServiceOps new, MediaWiki-Platform-Team (Radar)
Blake moved T362568: Drop backports from base images from Inbox to Radar (Awareness) on the ServiceOps new board.
Feb 12 2026, 10:20 AM · ServiceOps new, Infrastructure-Foundations
Blake edited projects for T362568: Drop backports from base images, added: ServiceOps new; removed serviceops-deprecated.
Feb 12 2026, 10:20 AM · ServiceOps new, Infrastructure-Foundations
Blake moved T369607: High cardinality metrics break queries/dashboards (envoy, istio, ...) from Inbox to Radar (Awareness) on the ServiceOps new board.
Feb 12 2026, 10:19 AM · ServiceOps new, Kubernetes, Grafana, Observability-Metrics
Blake edited projects for T369607: High cardinality metrics break queries/dashboards (envoy, istio, ...), added: ServiceOps new; removed serviceops-deprecated.
Feb 12 2026, 10:18 AM · ServiceOps new, Kubernetes, Grafana, Observability-Metrics
Blake added a comment to T369607: High cardinality metrics break queries/dashboards (envoy, istio, ...).

Hey folks, for triaging purposes, is there remaining work here? If so, is it for observability, or serviceops? Thanks!

Feb 12 2026, 10:18 AM · ServiceOps new, Kubernetes, Grafana, Observability-Metrics
Blake moved T372753: Decommission cxserver endpoints from RESTBase from Inbox to Radar (Awareness) on the ServiceOps new board.
Feb 12 2026, 10:16 AM · ServiceOps new, RESTBase Sunsetting, CXServer, Language and Product Localization, Essential-Work
Blake edited projects for T372753: Decommission cxserver endpoints from RESTBase, added: ServiceOps new; removed serviceops-deprecated.
Feb 12 2026, 10:16 AM · ServiceOps new, RESTBase Sunsetting, CXServer, Language and Product Localization, Essential-Work
Blake added a project to T377805: WikiKube: Rename the last few "production" named helm releases to use "main" instead: Prod-Kubernetes.
Feb 12 2026, 10:14 AM · Prod-Kubernetes, ServiceOps-good-first-task, ServiceOps new, Data-Engineering-Radar, Data-Engineering, Recommendation-API, events, Event-Platform, Proton

Feb 11 2026

Blake added a comment to T416985: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker2071.codfw.wmnet).

I'm a little confused. It doesn't look like the change has applied in k8s, after executing scap sync-world --k8s-only --k8s-confirm-diff -Dbuild_mw_container_image:False, as described here.

Feb 11 2026, 11:38 AM · sre-alert-triage, ServiceOps new

Feb 10 2026

Blake added a comment to T303744: Keep track of teams responsible for namespaces inside kubernetes.

I think it might make sense for the identifier to add to match a 'team' in the context of alerting receivers. For instance, it would be useful to associate the mw-api-int namespace alerts with teams in alertmanager, because that is a reference we know points at humans who care to receive alerts.

Feb 10 2026, 4:00 PM · ServiceOps new (Next quarter), observability, Serviceops-easywins, Prod-Kubernetes
Blake placed T417020: Proposal: mw-cron failure tasks that get automatically filed for unstewarded components should also tag the ServiceOps Phabricator project up for grabs.
Feb 10 2026, 3:04 PM · user-a_smart_kitten, ServiceOps new
Blake claimed T417020: Proposal: mw-cron failure tasks that get automatically filed for unstewarded components should also tag the ServiceOps Phabricator project.
Feb 10 2026, 3:04 PM · user-a_smart_kitten, ServiceOps new
Blake triaged T414167: Do not alert about a failed cron job when logs are already discarded as Low priority.
Feb 10 2026, 2:15 PM · ServiceOps new, Growth-Team, MW-on-K8s
Blake moved T414167: Do not alert about a failed cron job when logs are already discarded from Inbox to Backlog on the ServiceOps new board.
Feb 10 2026, 2:15 PM · ServiceOps new, Growth-Team, MW-on-K8s
Blake added a subtask for T416576: Make mw-cron jobs alert thresholds easily configurable: T414167: Do not alert about a failed cron job when logs are already discarded.
Feb 10 2026, 2:14 PM · ServiceOps new