Page MenuHomePhabricator

RLazarus (Reuven Lazarus) (rzl)
User

Today

  • No visible events.

Tomorrow

  • No visible events.

Monday

  • No visible events.

User Details

User Since
Oct 15 2019, 4:02 PM (339 w, 3 d)
Availability
Available
IRC Nick
rzl
LDAP User
RLazarus
MediaWiki User
RLazarus (WMF) [ Global Accounts ]

Recent Activity

Wed, Apr 15

RLazarus added a comment to T423311: Writes to /*/wf-wan/ failing with CONNECTION FAILURE or SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY (mcrouter not being reached?).

On the prerequisites:

  • Double-checked, and both mcrouters have all the routes except /local/wf which isn't on mw-mcrouter. That's fine, because...
  • Per @Jdforrester-WMF, we don't need to keep the /local/wf default. Nothing in mw-* namespaces, including mw-wikifunctions, uses it. (The orchestrator, running in the wikifunctions namespace does, but that's out of scope here.)
Wed, Apr 15, 8:22 PM · Essential-Work, Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions integration, WikiLambda
RLazarus added a comment to T423311: Writes to /*/wf-wan/ failing with CONNECTION FAILURE or SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY (mcrouter not being reached?).

So, the story is that

Wed, Apr 15, 12:58 AM · Essential-Work, Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions integration, WikiLambda

Fri, Apr 10

RLazarus closed T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30), a subtask of T341984: Update Kubernetes clusters to 1.31, as Resolved.
Fri, Apr 10, 1:28 AM · Data-Platform-SRE (2026.01.05 - 2026.01.23), Epic, ServiceOps new, Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes
RLazarus closed T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30) as Resolved.
Fri, Apr 10, 1:28 AM · Patch-For-Review, ServiceOps new, Kubernetes, Prod-Kubernetes

Wed, Apr 8

RLazarus added a comment to T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30).

Weirdly, Envoy failed to start after the change, with this in the logs:

Wed, Apr 8, 9:35 PM · Patch-For-Review, ServiceOps new, Kubernetes, Prod-Kubernetes

Wed, Mar 25

RLazarus closed T420679: No mediawiki-on-kubernetes alerts are paging as Resolved.
Wed, Mar 25, 5:36 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new, Sustainability (Incident Followup)
RLazarus updated subscribers of T420498: Factor in pooled status for SLO measurements.

We were actually just talking about this in the SLOs group last week (adding @Vgutierrez and @CDanis).

Wed, Mar 25, 4:22 PM · SRE-SLO, observability, Traffic

Tue, Mar 24

RLazarus assigned T380299: Revisit use of the wmf-deployment Gerrit group for deployment-charts rights to wiki_willy.

In the serviceops meeting today we decided to go ahead with this.

Tue, Mar 24, 12:47 AM · Infrastructure-Foundations, ServiceOps new, Kubernetes

Mon, Mar 23

RLazarus closed T420982: vopsbot !ack and !resolve without incident numbers aren't working as Resolved.
Mon, Mar 23, 11:06 PM · Observability-Alerting, SRE-OnFire, SRE
RLazarus created T420982: vopsbot !ack and !resolve without incident numbers aren't working.
Mon, Mar 23, 6:17 PM · Observability-Alerting, SRE-OnFire, SRE

Fri, Mar 20

RLazarus triaged T420679: No mediawiki-on-kubernetes alerts are paging as High priority.
Fri, Mar 20, 1:49 AM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new, Sustainability (Incident Followup)
RLazarus created T420679: No mediawiki-on-kubernetes alerts are paging.
Fri, Mar 20, 1:49 AM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new, Sustainability (Incident Followup)

Mar 17 2026

RLazarus moved T341441: Pushing mediawiki-multiversion Docker image from deploy server takes 4 minutes from Inbox to Backlog on the ServiceOps new board.
Mar 17 2026, 7:59 PM · ServiceOps-SharedInfra, ServiceOps new, Release-Engineering-Team (Radar), Scap, MW-on-K8s
RLazarus triaged T341441: Pushing mediawiki-multiversion Docker image from deploy server takes 4 minutes as Medium priority.
Mar 17 2026, 7:59 PM · ServiceOps-SharedInfra, ServiceOps new, Release-Engineering-Team (Radar), Scap, MW-on-K8s
RLazarus moved T380299: Revisit use of the wmf-deployment Gerrit group for deployment-charts rights from Inbox to Needs Info / Blocked on the ServiceOps new board.

(Service Ops triage here: moving this to Needs Info for discussion at our team meeting.)

Mar 17 2026, 5:30 PM · Infrastructure-Foundations, ServiceOps new, Kubernetes
RLazarus edited projects for T380299: Revisit use of the wmf-deployment Gerrit group for deployment-charts rights, added: ServiceOps new; removed serviceops-deprecated.
Mar 17 2026, 5:29 PM · Infrastructure-Foundations, ServiceOps new, Kubernetes
RLazarus moved T382710: Deploy portals independently of MediaWiki from Inbox to Backlog on the ServiceOps new board.
Mar 17 2026, 5:14 PM · MW-on-K8s, ServiceOps new, Wikimedia-Portals
RLazarus triaged T382710: Deploy portals independently of MediaWiki as Low priority.
Mar 17 2026, 5:14 PM · MW-on-K8s, ServiceOps new, Wikimedia-Portals
RLazarus moved T390946: Harmonise configs between API gateway and REST gateway from Inbox to Backlog on the ServiceOps new board.
Mar 17 2026, 5:07 PM · ServiceOps-SharedInfra, ServiceOps new
RLazarus triaged T390946: Harmonise configs between API gateway and REST gateway as Low priority.
Mar 17 2026, 5:07 PM · ServiceOps-SharedInfra, ServiceOps new
RLazarus added a comment to T420264: Data Platform SRE paging alerts and on-call SRE response.

One more axis to consider: Best-practices-wise, for alerting on Kubernetes platforms, there's a distinction between control plane and data plane.

Mar 17 2026, 3:27 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE

Mar 13 2026

RLazarus closed T410975: Upgrade Envoy to v1.35.7, a subtask of T380211: Upgrade Envoy to >= 1.24, as Resolved.
Mar 13 2026, 2:23 AM · SRE, serviceops-deprecated, envoy
RLazarus closed T410975: Upgrade Envoy to v1.35.7 as Resolved.

Resolving; the remaining hosts will go straight to 1.35.9 in T419637 instead.

Mar 13 2026, 2:23 AM · ServiceOps-Services-Oids, ServiceOps new, SRE, envoy

Mar 12 2026

RLazarus changed the status of T419637: Upgrade Envoy to v1.35.9 from Open to In Progress.
Mar 12 2026, 7:46 PM · User-Eevans, ServiceOps-Services-Oids, envoy, ServiceOps new
RLazarus moved T411058: Can't deploy machinetranslation due to exceeding resource quotas from Backlog to Radar (Pending) on the ServiceOps new board.
Mar 12 2026, 6:35 PM · ServiceOps new, LPL Essential (FY2025-26 Q3), LPL Projects (Other), Unplanned-Sprint-Work, MinT, Prod-Kubernetes, SRE
RLazarus moved T416623: Decommission NodeJS IPoid service from Radar (Awareness) to Radar (Pending) on the ServiceOps new board.
Mar 12 2026, 5:13 PM · Essential-Work, Product Safety and Integrity, ServiceOps-Services-Oids, ServiceOps new, iPoid-Service (IPoid OpenSearch)
RLazarus assigned T419747: Possible hardware issues on wikikube-worker2332.codfw.wmnet to Scott_French.
Mar 12 2026, 5:10 PM · SRE, ops-codfw, DC-Ops, ServiceOps new
RLazarus assigned T419058: Prepare packages and production images for ICU 72 upgrade to Scott_French.
Mar 12 2026, 5:06 PM · Essential-Work, User-Raine, ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki
RLazarus changed the status of T419831: Create a memcached::mediawiki:wikifunctions role from Open to In Progress.
Mar 12 2026, 4:16 PM · Patch-For-Review, ServiceOps-Datastores, ServiceOps new
RLazarus merged T419784: Change Wikifunctions k8s pods apparmor annotation to a config field, former is deprecated since k8s 1.30 into T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30).
Mar 12 2026, 2:33 AM · Patch-For-Review, ServiceOps new, Kubernetes, Prod-Kubernetes
RLazarus merged task T419784: Change Wikifunctions k8s pods apparmor annotation to a config field, former is deprecated since k8s 1.30 into T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30).
Mar 12 2026, 2:33 AM · Kubernetes, Abstract Wikipedia team, Essential-Work, function-orchestrator, function-evaluator
RLazarus updated subscribers of T419781: Lots of Wikifunctions k8s pods in staging stuck in "Terminating", some for 14 days+.

Unsurprisingly /var/log/kern.log on kubestage1006 (hosting the above example pod) is full of lines like:

Mar 12 2026, 2:30 AM · Patch-For-Review, Prod-Kubernetes, ServiceOps-Services-Oids, Kubernetes, ServiceOps new, Abstract Wikipedia team
RLazarus changed the status of T419781: Lots of Wikifunctions k8s pods in staging stuck in "Terminating", some for 14 days+ from Open to In Progress.
Mar 12 2026, 2:10 AM · Patch-For-Review, Prod-Kubernetes, ServiceOps-Services-Oids, Kubernetes, ServiceOps new, Abstract Wikipedia team
RLazarus added a comment to T419781: Lots of Wikifunctions k8s pods in staging stuck in "Terminating", some for 14 days+.

The actual profiles are at modules/profile/files/kubernetes/node/wikifunctions-evaluator and .../wikifunctions-orchestrator (which is the same except s/evaluator/orchestrator/g). The staging and prod containers have the same apparmor annotation[1] so should have the same signals policy, and anyway it looks correct at a glance.

Mar 12 2026, 1:57 AM · Patch-For-Review, Prod-Kubernetes, ServiceOps-Services-Oids, Kubernetes, ServiceOps new, Abstract Wikipedia team
RLazarus added projects to T419781: Lots of Wikifunctions k8s pods in staging stuck in "Terminating", some for 14 days+: ServiceOps-Services-Oids, Prod-Kubernetes.
rzl@deploy2002:~$ kube-env wikifunctions staging
rzl@deploy2002:~$ kubectl describe pod function-evaluator-javascript-evaluator-58c586f4c5-zgzvp
[ output trimmed to just the relevant lines: ]
Containers:
  function-evaluator-javascript-evaluator:
    Container ID:    containerd://31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9
    State:           Running
  function-evaluator-javascript-evaluator-tls-proxy:
    Container ID:    containerd://ea12e56b08b979b586254f08b8ac7fb9011d9adc3e4370c5d8fd0d0237be2ac3
    State:           Terminated
      Reason:        Completed
      Exit Code:     0
Events:
  Type     Reason         Age                    From     Message
  ----     ------         ----                   ----     -------
  Warning  FailedMount    54m (x303 over 10h)    kubelet  MountVolume.SetUp failed for volume "tls-certs-volume" : object "wikifunctions"/"function-evaluator-javascript-evaluator-tls-proxy-certs" not registered
  Warning  FailedMount    44m (x308 over 10h)    kubelet  MountVolume.SetUp failed for volume "envoy-config-volume" : object "wikifunctions"/"function-evaluator-javascript-evaluator-envoy-config-volume" not registered
  Normal   Killing        14m (x895 over 10h)    kubelet  Stopping container function-evaluator-javascript-evaluator
  Warning  FailedKillPod  4m48s (x908 over 10h)  kubelet  error killing pod: [failed to "KillContainer" for "function-evaluator-javascript-evaluator" with KillContainerError: "rpc error: code = Unknown desc = failed to kill container \"31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9\": unknown error after kill: runc did not terminate successfully: exit status 1: unable to signal init: permission denied\n: unknown", failed to "KillPodSandbox" for "fb689e76-3315-4fa2-8157-772ff7a8a45d" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container \"31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9\": failed to kill container \"31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9\": unknown error after kill: runc did not terminate successfully: exit status 1: unable to signal init: permission denied\n: unknown"]
Mar 12 2026, 1:31 AM · Patch-For-Review, Prod-Kubernetes, ServiceOps-Services-Oids, Kubernetes, ServiceOps new, Abstract Wikipedia team

Mar 11 2026

RLazarus removed a project from T419647: Eqiad: lsw1-d2-eqiad BGP maintenance: ServiceOps new.

Service Ops triage here: Agreed there's nothing for us to do, thanks @ayounsi - untagging us.

Mar 11 2026, 4:24 PM · netops, Infrastructure-Foundations, SRE

Mar 10 2026

RLazarus moved T419637: Upgrade Envoy to v1.35.9 from Inbox to Scheduled (this Q) on the ServiceOps new board.
Mar 10 2026, 11:51 PM · User-Eevans, ServiceOps-Services-Oids, envoy, ServiceOps new
RLazarus created T419637: Upgrade Envoy to v1.35.9.
Mar 10 2026, 11:50 PM · User-Eevans, ServiceOps-Services-Oids, envoy, ServiceOps new
RLazarus added a comment to T417163: Noise in #wikimedia-operations is making incident response more difficult.

This makes tracking response difficult, and in some situations recently we have had to move to #wikimedia-sre in order to communicate properly. If we wish to pursue this as an official pattern, it needs to be documented and recorded.

Mar 10 2026, 9:01 PM · SRE-Unowned, SRE, Sustainability (Incident Followup)
RLazarus added a comment to F32455073: find_collations.py.

Here you go: https://gitlab.wikimedia.org/repos/sre/serviceops-kitchensink/-/merge_requests/28

Mar 10 2026, 6:44 PM
RLazarus added projects to T419229: Periodic job alerts could use some more information on what to do: ServiceOps-Mediawiki, MW-on-K8s.
Mar 10 2026, 4:00 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new, SRE Observability
RLazarus triaged T419229: Periodic job alerts could use some more information on what to do as Low priority.

(ServiceOps bug triage here.)

Mar 10 2026, 3:59 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new, SRE Observability

Mar 2 2026

RLazarus closed T417020: Proposal: mw-cron failure tasks that get automatically filed for unstewarded components should also tag the ServiceOps Phabricator project as Resolved.

(Copying from my Gerrit comment:)

Mar 2 2026, 10:36 PM · user-a_smart_kitten, ServiceOps new

Feb 26 2026

RLazarus added a comment to T354853: Service mesh envoy does not treat incoming connections as local.

Not offhand, but your reading sounds believable to me. I'd also be interested in whether the change to X-Forwarded-For (Envoy wouldn't append the remote address to it anymore) would cause any problems for anyone depending on that, but I wouldn't imagine so.

Feb 26 2026, 6:01 PM · ServiceOps-SharedInfra, ServiceOps new
RLazarus closed T372242: Alert on unscrapable pods as Declined.

From the serviceops triage meeting:

Feb 26 2026, 5:47 PM · SRE Observability (FY2025/2026-Q1), Observability-Alerting, serviceops-deprecated, Kubernetes

Feb 19 2026

RLazarus closed T417456: charlie should support --services_dir as Resolved.
rzl@deploy2002:/srv/deployment-charts/helmfile.d/ml-services$ charlie --services_dir . list
_example_: ml-serve-eqiad, ml-serve-codfw
article-descriptions: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
article-models: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
articletopic-outlink: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
edit-check: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
experimental: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
llm: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
logo-detection: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
ores-legacy: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
readability: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
recommendation-api-ng: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
revertrisk: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
revise-tone-task-generator: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
revision-models: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
revscoring-articlequality: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
revscoring-articletopic: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
revscoring-draftquality: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
revscoring-drafttopic: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
revscoring-editquality-damaging: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
revscoring-editquality-goodfaith: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
revscoring-editquality-reverted: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
Feb 19 2026, 6:41 PM · serviceops-tooling, Kubernetes, ServiceOps new

Feb 14 2026

RLazarus added a comment to T289202: Run httpbb periodically.

Yeah, I'm fine closing this. I'd still like to address the flapping alerts from intermittent failures someday, maybe with smarter logic around alerting if we fail N in a row, but we don't need to keep this task open to track that. Since the appserver config isn't in Puppet anymore, the situation in T289202#7812998 isn't a factor.

Feb 14 2026, 3:08 AM · ServiceOps-SharedInfra, ServiceOps new, SRE

Feb 13 2026

RLazarus changed the status of T417456: charlie should support --services_dir from Open to In Progress.
Feb 13 2026, 9:59 PM · serviceops-tooling, Kubernetes, ServiceOps new
RLazarus created T417456: charlie should support --services_dir.
Feb 13 2026, 9:59 PM · serviceops-tooling, Kubernetes, ServiceOps new

Feb 12 2026

RLazarus added a comment to T406836: The Edit Check's SLO has burned all its error budget.

Hi Editing Team -- is this still on your plate for the current work cycle? Let us know how we can help.

Feb 12 2026, 11:16 PM · Editing-team (Editing-Q4-27Apr-8May-2026), OKR-Work, Goal, EditCheck
RLazarus added a comment to T410975: Upgrade Envoy to v1.35.7.

We're very close. There are some bare-metal hosts still left to upgrade, I've updated the description (see also Debmonitor).

Feb 12 2026, 11:08 PM · ServiceOps-Services-Oids, ServiceOps new, SRE, envoy
RLazarus updated the task description for T410975: Upgrade Envoy to v1.35.7.
Feb 12 2026, 11:07 PM · ServiceOps-Services-Oids, ServiceOps new, SRE, envoy
RLazarus closed T380390: mw-videoscaler helm chart fails to render in staging as Invalid.

ServiceOps triage -- closing this as obsolete, as it was fixed at https://gerrit.wikimedia.org/r/1080583.

Feb 12 2026, 3:19 AM · serviceops-deprecated, MW-on-K8s
RLazarus moved T394433: mobileapps consistently 503s when a summary of an image is requested from Inbox to Backlog on the ServiceOps new board.
Feb 12 2026, 2:33 AM · ServiceOps-Services-Oids, ServiceOps new, Page Content Service
RLazarus triaged T394433: mobileapps consistently 503s when a summary of an image is requested as Medium priority.

Triaging this in 2026: We should evaluate if this 503 rate is still high, and go from there.

Feb 12 2026, 2:33 AM · ServiceOps-Services-Oids, ServiceOps new, Page Content Service
RLazarus moved T394501: Reduce wikifeeds 5xx rate in order to enable better SRE response from Inbox to Backlog on the ServiceOps new board.
Feb 12 2026, 2:21 AM · ServiceOps-Services-Oids, ServiceOps new, Wikifeeds
RLazarus triaged T394501: Reduce wikifeeds 5xx rate in order to enable better SRE response as Medium priority.

Triaging this in 2026: We should evaluate if the error rate is still high, and go from there. At a glance, it seems like things might have improved.

Feb 12 2026, 2:21 AM · ServiceOps-Services-Oids, ServiceOps new, Wikifeeds
RLazarus moved T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30) from Inbox to Backlog on the ServiceOps new board.
Feb 12 2026, 2:04 AM · Patch-For-Review, ServiceOps new, Kubernetes, Prod-Kubernetes
RLazarus edited projects for T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30), added: ServiceOps new; removed serviceops-deprecated.
Feb 12 2026, 2:03 AM · Patch-For-Review, ServiceOps new, Kubernetes, Prod-Kubernetes
RLazarus moved T392478: Move cloudweb to Ganeti VMs and repurpose the servers as wikikube nodes from Inbox to Radar (Awareness) on the ServiceOps new board.
Feb 12 2026, 2:01 AM · Prod-Kubernetes, ServiceOps new, cloud-services-team (FY2025/2026-Q3-Q4), Horizon, Striker, SRE
RLazarus edited projects for T392478: Move cloudweb to Ganeti VMs and repurpose the servers as wikikube nodes, added: ServiceOps new, Prod-Kubernetes; removed serviceops-deprecated.
Feb 12 2026, 2:01 AM · Prod-Kubernetes, ServiceOps new, cloud-services-team (FY2025/2026-Q3-Q4), Horizon, Striker, SRE
RLazarus moved T394657: Implement continuously running maintenance jobs from Inbox to Backlog on the ServiceOps new board.
Feb 12 2026, 1:55 AM · ServiceOps-Mediawiki, ServiceOps new, MW-on-K8s
RLazarus edited projects for T394657: Implement continuously running maintenance jobs, added: ServiceOps new, ServiceOps-Mediawiki; removed serviceops-deprecated.
Feb 12 2026, 1:55 AM · ServiceOps-Mediawiki, ServiceOps new, MW-on-K8s

Feb 5 2026

RLazarus closed T411411: Allow dash-suffixes for chart versions as Declined.

Got it, sounds like a fine workaround. :) Closing as above in that case, but I still hope we can make this better all around and I'll let you know how things progress.

Feb 5 2026, 8:23 PM · Prod-Kubernetes, ServiceOps new
RLazarus moved T416623: Decommission NodeJS IPoid service from Inbox to Radar (Awareness) on the ServiceOps new board.
Feb 5 2026, 5:28 PM · Essential-Work, Product Safety and Integrity, ServiceOps-Services-Oids, ServiceOps new, iPoid-Service (IPoid OpenSearch)
RLazarus added projects to T416623: Decommission NodeJS IPoid service: ServiceOps new, ServiceOps-Services-Oids.
Feb 5 2026, 5:27 PM · Essential-Work, Product Safety and Integrity, ServiceOps-Services-Oids, ServiceOps new, iPoid-Service (IPoid OpenSearch)

Feb 4 2026

RLazarus added a comment to T411411: Allow dash-suffixes for chart versions.

Thanks @MLechvien-WMF, sorry @daniel for not responding sooner.

Feb 4 2026, 6:54 PM · Prod-Kubernetes, ServiceOps new

Feb 3 2026

RLazarus added a comment to T405703: Update wikikube eqiad to kubernetes 1.31.

Before the next upgrade, we may want to give charlie the ability to optionally exclude mediawiki services, so that they can be sequenced independently (e.g., via SKIP_DIRS).

Feb 3 2026, 7:04 PM · Discovery-Search (2025.09.26 - 2025.10.17), Data-Platform-SRE (2025.09.26 - 2025.10.17), Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes, serviceops-deprecated

Jan 13 2026

RLazarus moved T411058: Can't deploy machinetranslation due to exceeding resource quotas from Inbox to Backlog on the ServiceOps new board.
Jan 13 2026, 2:36 AM · ServiceOps new, LPL Essential (FY2025-26 Q3), LPL Projects (Other), Unplanned-Sprint-Work, MinT, Prod-Kubernetes, SRE
RLazarus edited projects for T411058: Can't deploy machinetranslation due to exceeding resource quotas, added: ServiceOps new; removed serviceops-deprecated.
Jan 13 2026, 2:36 AM · ServiceOps new, LPL Essential (FY2025-26 Q3), LPL Projects (Other), Unplanned-Sprint-Work, MinT, Prod-Kubernetes, SRE
RLazarus moved T410626: WE6.2.6: ☂️ hcaptcha-proxy Production Readiness Review from Inbox to In Progress on the ServiceOps new board.
Jan 13 2026, 2:19 AM · User-Raine, Epic, ServiceOps-Services-Oids, ServiceOps new
RLazarus changed the status of T410626: WE6.2.6: ☂️ hcaptcha-proxy Production Readiness Review from Open to In Progress.
Jan 13 2026, 2:19 AM · User-Raine, Epic, ServiceOps-Services-Oids, ServiceOps new
RLazarus moved T410975: Upgrade Envoy to v1.35.7 from Inbox to In Progress on the ServiceOps new board.
Jan 13 2026, 2:13 AM · ServiceOps-Services-Oids, ServiceOps new, SRE, envoy
RLazarus changed the status of T410975: Upgrade Envoy to v1.35.7, a subtask of T380211: Upgrade Envoy to >= 1.24, from Open to In Progress.
Jan 13 2026, 2:13 AM · SRE, serviceops-deprecated, envoy
RLazarus changed the status of T410975: Upgrade Envoy to v1.35.7 from Open to In Progress.
Jan 13 2026, 2:12 AM · ServiceOps-Services-Oids, ServiceOps new, SRE, envoy

Jan 8 2026

RLazarus closed T414132: httpbb tests are failiing due to changed error message as Resolved.
Jan 8 2026, 9:01 PM · serviceops-deprecated
RLazarus claimed T414132: httpbb tests are failiing due to changed error message.
Jan 8 2026, 8:36 PM · serviceops-deprecated

Jan 6 2026

RLazarus closed T341553: Allow running one-off scripts manually as Resolved.

Happy new year! The core functionality here is long since complete and in widespread active use. I'm going to resolve this task, which should no longer be used for general mwscript-k8s feedback. There's still work to do, including both UX polish like T387268 and added functionality like T379675, and those tasks remain open to track that work.

Jan 6 2026, 2:10 AM · MW-on-K8s, serviceops-deprecated
RLazarus closed T341553: Allow running one-off scripts manually, a subtask of T341560: Migrate mwmaint server functionality to mw-on-k8s, as Resolved.
Jan 6 2026, 2:10 AM · serviceops-deprecated, MW-on-K8s

Dec 27 2025

RLazarus added a comment to T413544: Decide whether to exclude {api,rest}-gateway-ro from ATSBackendErrorsHigh.

Proposing https://gerrit.wikimedia.org/r/1221195 as an interim solution over the break, and once we're back we can either keep it (and resolve this task) or revert it (and abandon this task) or some third thing.

Dec 27 2025, 10:34 PM · serviceops-deprecated, SRE
RLazarus created T413544: Decide whether to exclude {api,rest}-gateway-ro from ATSBackendErrorsHigh.
Dec 27 2025, 10:32 PM · serviceops-deprecated, SRE

Dec 19 2025

RLazarus renamed T411058: Can't deploy machinetranslation due to exceeding resource quotas from machinetranslation eqiad pods in state ContainerStatusUnknown to Can't deploy machinetranslation due to exceeding resource quotas.
Dec 19 2025, 2:15 AM · ServiceOps new, LPL Essential (FY2025-26 Q3), LPL Projects (Other), Unplanned-Sprint-Work, MinT, Prod-Kubernetes, SRE
RLazarus added a comment to T411058: Can't deploy machinetranslation due to exceeding resource quotas.

Helm timed out again when I tried to deploy machinetranslation for the next round of envoy upgrades. I'll retitle this task, as the ContainerStatusUnknown pods aren't the cause of the problem, but we still can't run helmfile apply and we should fix that.

Dec 19 2025, 2:14 AM · ServiceOps new, LPL Essential (FY2025-26 Q3), LPL Projects (Other), Unplanned-Sprint-Work, MinT, Prod-Kubernetes, SRE
RLazarus closed T380211: Upgrade Envoy to >= 1.24 as Resolved.

Still some hosts remaining to upgrade to 1.35 in T410975, but we don't need this umbrella task open to track the multi-stage upgrade anymore.

Dec 19 2025, 1:37 AM · SRE, serviceops-deprecated, envoy
RLazarus closed T405808: Upgrade Envoy to v1.32.12, a subtask of T380211: Upgrade Envoy to >= 1.24, as Resolved.
Dec 19 2025, 1:29 AM · SRE, serviceops-deprecated, envoy
RLazarus closed T405808: Upgrade Envoy to v1.32.12 as Resolved.
Dec 19 2025, 1:29 AM · SRE, serviceops-deprecated, envoy
RLazarus closed T409510: Envoy config updates from v1.32, a subtask of T405808: Upgrade Envoy to v1.32.12, as Resolved.
Dec 19 2025, 1:29 AM · SRE, serviceops-deprecated, envoy
RLazarus closed T409510: Envoy config updates from v1.32 as Resolved.
Dec 19 2025, 1:29 AM · SRE, serviceops-deprecated, envoy

Dec 12 2025

RLazarus updated subscribers of T412493: ErrorBudgetBurn.

@dr0ptp4kt (And cc @herron) Good question!

Dec 12 2025, 9:01 PM · Test Kitchen (Test Kitchen (Experiment Platform Sprint 22))

Dec 3 2025

RLazarus renamed T410975: Upgrade Envoy to v1.35.7 from Upgrade Envoy to v1.35.6 to Upgrade Envoy to v1.35.7.
Dec 3 2025, 11:44 PM · ServiceOps-Services-Oids, ServiceOps new, SRE, envoy
RLazarus added a comment to T410975: Upgrade Envoy to v1.35.7.

Envoy 1.35.7 is about to come out, with security fixes: https://groups.google.com/g/envoy-announce/c/zr2OzwmJFqY

Dec 3 2025, 11:43 PM · ServiceOps-Services-Oids, ServiceOps new, SRE, envoy

Dec 1 2025

RLazarus added a comment to T411411: Allow dash-suffixes for chart versions.

The conversation in #wikimedia-serviceops when this was raised:

Dec 1 2025, 8:38 PM · Prod-Kubernetes, ServiceOps new

Nov 26 2025

RLazarus added a project to T410933: Add Druid as a Private Grafana Datasource: Observability-Metrics.

(Clinic duty here! Apparently a milestone tag, like SRE Observability (FY2025/2026-Q3), is mutually exclusive with the project tag, like SRE Observability, and that means the task shows up on the clinic duty dashboard as "needs triage." I'm adding Observability-Metrics at a guess, because that also takes it off the triage list, but if you'll be using those milestone tags going forward, we may want to adjust the clinic duty dashboard query.)

Nov 26 2025, 6:09 PM · Observability-Metrics, SRE Observability (FY2025/2026-Q3), SRE
RLazarus closed T410972: Requesting access to cassandra-staging-devs group for amastilovic as Resolved.

@Ahoelzl @KOfori Thanks both!

Nov 26 2025, 5:54 PM · SRE, SRE-Access-Requests
RLazarus updated the task description for T410972: Requesting access to cassandra-staging-devs group for amastilovic.
Nov 26 2025, 5:30 PM · SRE, SRE-Access-Requests

Nov 25 2025

RLazarus created T411058: Can't deploy machinetranslation due to exceeding resource quotas.
Nov 25 2025, 11:11 PM · ServiceOps new, LPL Essential (FY2025-26 Q3), LPL Projects (Other), Unplanned-Sprint-Work, MinT, Prod-Kubernetes, SRE
RLazarus moved T410972: Requesting access to cassandra-staging-devs group for amastilovic from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Nov 25 2025, 6:11 PM · SRE, SRE-Access-Requests
RLazarus closed T409707: Requesting access to Analytics_Privatedata for Chandra-WMDE as Resolved.

Added to nda:

rzl@ldap-maint1001:~$ ldapsearch -x cn=nda | grep chandra-wmde
member: uid=chandra-wmde,ou=people,dc=wikimedia,dc=org
Nov 25 2025, 6:10 PM · Data-Engineering, SRE, SRE-Access-Requests
RLazarus updated the task description for T409707: Requesting access to Analytics_Privatedata for Chandra-WMDE.
Nov 25 2025, 6:02 PM · Data-Engineering, SRE, SRE-Access-Requests
RLazarus added a comment to T410426: Requesting access to analytics-privatedata-users for dsmit.

Oh, and: On top of L3 which you've already read, please ensure you're also familiar with https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#User_responsibilities and reach out if you have any questions. Thanks!

Nov 25 2025, 5:58 PM · SRE, SRE-Access-Requests