Page MenuHomePhabricator

RLazarus (Reuven Lazarus) (rzl)
User

Today

  • No visible events.

Tomorrow

  • No visible events.

Friday

  • No visible events.

User Details

User Since
Oct 15 2019, 4:02 PM (347 w, 1 d)
Availability
Available
IRC Nick
rzl
LDAP User
RLazarus
MediaWiki User
RLazarus (WMF) [ Global Accounts ]

Recent Activity

Today

RLazarus closed T427553: Requesting access to <Superset> for <APDube-WMF> as Resolved.

Great!

Wed, Jun 10, 5:04 PM · Data-Engineering, SRE, SRE-Access-Requests
RLazarus added a comment to T427553: Requesting access to <Superset> for <APDube-WMF>.

My fault, sorry about that!

Wed, Jun 10, 4:51 PM · Data-Engineering, SRE, SRE-Access-Requests

Yesterday

RLazarus added a comment to T428301: MediaWiki periodic job campaignevents-aggregateanswers-metawiki failed.

This is allowed, but you need to use deploy (read-write) credentials. Try it after

Tue, Jun 9, 9:22 PM · ServiceOps new, Connection-Team
RLazarus moved T426995: Requesting access to deployment for caro from Manager/NDA Approval/Confirmation to Awaiting User Input on the SRE-Access-Requests board.
Tue, Jun 9, 3:27 PM · SRE, SRE-Access-Requests
RLazarus closed T427553: Requesting access to <Superset> for <APDube-WMF> as Resolved.

Done! Please wait up to 30 minutes for that to propagate to all servers, then you should be all set. I'll resolve this but feel free to reopen it if you have any trouble using your new access.

Tue, Jun 9, 3:26 PM · Data-Engineering, SRE, SRE-Access-Requests
RLazarus updated subscribers of T428133: Alert in need of triage: ProbeDown (instance sophroid:4252).

Chatted with @Scott_French about this today. The cause is Sophroid's http probe in service.yaml, routed to a port that doesn't handle HTTP, only GRPC.

Tue, Jun 9, 3:14 AM · ServiceOps new, sre-alert-triage

Mon, Jun 8

RLazarus updated the task description for T428416: Requesting access to "analytics-privatedata-users" for Mahmoud Abdelsattar (WMDE).
Mon, Jun 8, 10:17 PM · SRE, SRE-Access-Requests
RLazarus closed T427701: Requesting access to Cassandra staging for akhatun as Resolved.

This is done! Wait up to 30 minutes for it to propagate to all hosts, and you'll be all set. Resolving the task, feel free to reopen it if you have any trouble.

Mon, Jun 8, 10:03 PM · SRE, SRE-Access-Requests
RLazarus added a comment to T427553: Requesting access to <Superset> for <APDube-WMF>.

Hi @APDube-WMF! I see you provided an SSH key on the task, but if Superset access is all you need, we won't actually need it. I'll set you up without SSH access for now, but if you also need access to anything listed at level 2 or higher, like the stat servers, let me know, and we can always add your SSH key later.

Mon, Jun 8, 9:58 PM · Data-Engineering, SRE, SRE-Access-Requests
RLazarus updated the task description for T426995: Requesting access to deployment for caro.
Mon, Jun 8, 8:56 PM · SRE, SRE-Access-Requests
RLazarus closed T417056: SSH key replacement for tchanders as Resolved.

Never mind! Imagine my surprise to find that key already there. :) This was done in https://gerrit.wikimedia.org/r/1298282, thanks @ssingh!

Mon, Jun 8, 8:47 PM · SRE, SRE-Access-Requests
RLazarus changed the status of T417056: SSH key replacement for tchanders from Open to In Progress.

Verified out of band, updating.

Mon, Jun 8, 8:44 PM · SRE, SRE-Access-Requests
RLazarus updated the task description for T428416: Requesting access to "analytics-privatedata-users" for Mahmoud Abdelsattar (WMDE).
Mon, Jun 8, 8:39 PM · SRE, SRE-Access-Requests
RLazarus changed the status of T428416: Requesting access to "analytics-privatedata-users" for Mahmoud Abdelsattar (WMDE) from Open to In Progress.

Hi @mahmoud.abdelsattar.wmde! I see you already have restricted access (using the SSH key you included), so we should be able to additionally grant you analytics-privatedata-users. (For anyone following along, that effectively means level 2 access, even though only level 1 is strictly required for Superset.)

Mon, Jun 8, 8:39 PM · SRE, SRE-Access-Requests
RLazarus closed T428262: Requesting access to deployment for OSleger_WMF as Resolved.

All done! This will take up to 30 minutes to roll out everywhere, then make sure to follow the instructions in T428262#11990183.

Mon, Jun 8, 8:27 PM · SRE, SRE-Access-Requests
RLazarus updated the task description for T427701: Requesting access to Cassandra staging for akhatun.
Mon, Jun 8, 8:23 PM · SRE, SRE-Access-Requests
RLazarus added a comment to T426995: Requesting access to deployment for caro.

SSH public key (must be a separate key from Wikimedia cloud SSH access): N/A (already in modules/admin/data/data.yaml)

Mon, Jun 8, 8:13 PM · SRE, SRE-Access-Requests
RLazarus added a comment to T406836: The Edit Check's SLO has burned all its error budget.

I'm always happy to get involved if you need me, but my colleague @CDanis from the SLO working group is your best contact for the followup here. :)

Mon, Jun 8, 6:49 PM · Editing-team (Editing-current-Q4-8Jun-19Jun-2026), OKR-Work, Goal, EditCheck
RLazarus updated the task description for T428278: Consider emitting an HTTP header from the orchestrator when talking to the evaluator to tell Envoy the outer limits of the remaining timeout for this request.
Mon, Jun 8, 6:41 PM · Abstract Wikipedia team, function-orchestrator

Thu, Jun 4

RLazarus reassigned T427863: Service mesh configuration and network policies for evaluator-orchestrator callbacks from RLazarus to DMartin-WMF.
Thu, Jun 4, 11:52 PM · OKR-Work, Patch-For-Review, ServiceOps new, Abstract Wikipedia team (26Q4 (Apr–Jun))
RLazarus moved T427863: Service mesh configuration and network policies for evaluator-orchestrator callbacks from In Progress to Radar (Pending) on the ServiceOps new board.

Moving to Pending until the app-side work is done on the orchestrator and evaluator, then we can wrap this up. In order:

Thu, Jun 4, 11:51 PM · OKR-Work, Patch-For-Review, ServiceOps new, Abstract Wikipedia team (26Q4 (Apr–Jun))
RLazarus added a comment to T426044: Migrate Mediawiki memcached to Debian Trixie.

Word from the Abstract Wikipedia folks is that you should go ahead and reimage the mc-wf hosts without any prework -- just, one at a time please.

Thu, Jun 4, 4:32 PM · ServiceOps new

Tue, Jun 2

RLazarus added a comment to T427863: Service mesh configuration and network policies for evaluator-orchestrator callbacks.

I've reserved port 4974 for this, the next available service port after 4970 (evaluator) and 4971 (orchestrator main port).

Tue, Jun 2, 12:44 AM · OKR-Work, Patch-For-Review, ServiceOps new, Abstract Wikipedia team (26Q4 (Apr–Jun))

Mon, Jun 1

RLazarus moved T427863: Service mesh configuration and network policies for evaluator-orchestrator callbacks from Inbox to In Progress on the ServiceOps new board.
Mon, Jun 1, 11:25 PM · OKR-Work, Patch-For-Review, ServiceOps new, Abstract Wikipedia team (26Q4 (Apr–Jun))
RLazarus changed the status of T427863: Service mesh configuration and network policies for evaluator-orchestrator callbacks, a subtask of T421848: [Hypothesis] WE2.3.15 Complete implementation and deployment of Evaluator-Orchestrator callbacks, from Open to In Progress.
Mon, Jun 1, 11:25 PM · Abstract Wikipedia team (26Q4 (Apr–Jun)), Epic, OKR-Work
RLazarus changed the status of T427863: Service mesh configuration and network policies for evaluator-orchestrator callbacks from Open to In Progress.
Mon, Jun 1, 11:25 PM · OKR-Work, Patch-For-Review, ServiceOps new, Abstract Wikipedia team (26Q4 (Apr–Jun))
RLazarus created T427863: Service mesh configuration and network policies for evaluator-orchestrator callbacks.
Mon, Jun 1, 11:25 PM · OKR-Work, Patch-For-Review, ServiceOps new, Abstract Wikipedia team (26Q4 (Apr–Jun))

Thu, May 28

RLazarus added a comment to T427588: etherpad showing transient connection issues.

From the logs:

Thu, May 28, 11:48 PM · collaboration-services, Wikimedia-Etherpad

Fri, May 22

RLazarus reopened T418175: Create SLO for the opensearch-ipoid cluster that runs on our OpenSearch on K8s platform, a subtask of T348935: IPoid: Define service level indicators and service level objectives, as Open.
Fri, May 22, 3:49 PM · Product Safety and Integrity (Sprint lily-of-the-valley (May 4 - May 22)), Data-Platform-SRE (2026-04-24 - 2026-05-15), Essential-Work, ServiceOps new, SRE-SLO, iPoid-Service (iPoid 1.0)
RLazarus reopened T418175: Create SLO for the opensearch-ipoid cluster that runs on our OpenSearch on K8s platform as "Open".

Let's leave it open until we move it from "draft" to "approved." :)

Fri, May 22, 3:49 PM · Data-Platform-SRE (2026-04-24 - 2026-05-15)
RLazarus reopened T418175: Create SLO for the opensearch-ipoid cluster that runs on our OpenSearch on K8s platform, a subtask of T408586: ☂️ OpenSearch on K8s: Ensure that our first tenant workload is ready for production ☂️, as Open.
Fri, May 22, 3:48 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), OKR-Work
RLazarus added a comment to T418175: Create SLO for the opensearch-ipoid cluster that runs on our OpenSearch on K8s platform.

Done.

Fri, May 22, 3:36 PM · Data-Platform-SRE (2026-04-24 - 2026-05-15)

Thu, May 21

RLazarus added a comment to T425340: Upgrade the version of Rust in the abstractwiki-rust-web base images to something more modern.

Built and published:

Thu, May 21, 5:32 PM · Essential-Work, Abstract Wikipedia team (26Q4 (Apr–Jun)), function-evaluator, Abstract Wikipedia Fix-It tasks

Mon, May 18

RLazarus updated subscribers of T425340: Upgrade the version of Rust in the abstractwiki-rust-web base images to something more modern.

I chatted with @MoritzMuehlenhoff about this today (thanks Moritz).

Mon, May 18, 6:06 PM · Essential-Work, Abstract Wikipedia team (26Q4 (Apr–Jun)), function-evaluator, Abstract Wikipedia Fix-It tasks

Thu, May 14

RLazarus triaged T371069: Add helm rollback functionality to scap as Medium priority.
Thu, May 14, 5:49 PM · ServiceOps new, Release-Engineering-Team (Priority Backlog 📥), MW-on-K8s, Scap

May 11 2026

RLazarus added a parent task for T425298: MediaWiki periodic job update-special-pages-s2 failed: T422486: MediaWiki periodic job failures due to timeouts.
May 11 2026, 7:32 PM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
RLazarus added a subtask for T422486: MediaWiki periodic job failures due to timeouts: T425298: MediaWiki periodic job update-special-pages-s2 failed.
May 11 2026, 7:32 PM · ServiceOps new (Next quarter), DBA
RLazarus added a comment to T425298: MediaWiki periodic job update-special-pages-s2 failed.

Relevant log lines from this one:

May 11 2026, 7:31 PM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
RLazarus moved T424835: Check for Mediawiki redirect loops via httpbb from Inbox to Needs Info / Blocked on the ServiceOps new board.

We actually added a test for that case already (in T387549, for this incident):

May 11 2026, 7:10 PM · ServiceOps new, SRE Observability, Sustainability (Incident Followup)

May 9 2026

RLazarus removed a project from T371069: Add helm rollback functionality to scap: Sustainability (Incident Followup).

I think listing it in that incident was a mistake, actually -- there weren't any releases in state failed in that event, so this feature wouldn't have affected things at all. (I think the incident author wanted a one-line "roll back mediawiki without having to touch the charts repo" command, and thought from the task title that's what this task is. I'm not sure if I agree that feature would be a good idea, but it's not the same thing being discussed here.)

May 9 2026, 12:47 AM · ServiceOps new, Release-Engineering-Team (Priority Backlog 📥), MW-on-K8s, Scap

May 8 2026

RLazarus closed T424985: MediaWiki periodic job update-special-pages-s6 failed as Resolved.

Transient failure, followed by a successful run, resolving.

May 8 2026, 10:39 PM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
RLazarus closed T425298: MediaWiki periodic job update-special-pages-s2 failed as Resolved.

Transient failure, followed by a successful run, resolving.

May 8 2026, 10:39 PM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages

May 6 2026

RLazarus triaged T385007: Extend functionality to support MediaWiki infrastructure Windows and related repos as Medium priority.
May 6 2026, 8:14 PM · User-jijiki, Release-Engineering-Team, ServiceOps new, Patch-For-Review, Wikimedia-Hackathon-2026, Tool-schedule-deployment
RLazarus changed the status of T385007: Extend functionality to support MediaWiki infrastructure Windows and related repos from Open to In Progress.

Posting as serviceops triage: Adding Releng for awareness (who don't own the schedule-deployment tool but do manage other deployment calendar automation).

May 6 2026, 8:14 PM · User-jijiki, Release-Engineering-Team, ServiceOps new, Patch-For-Review, Wikimedia-Hackathon-2026, Tool-schedule-deployment
RLazarus changed the status of T425255: Upgrade mcrouter to v2026.04.27.00 and switch build system from Open to In Progress.
May 6 2026, 7:59 PM · ServiceOps new (Next quarter), Patch-For-Review, ServiceOps-Datastores, Wikimedia-Hackathon-2026
RLazarus changed the status of T425255: Upgrade mcrouter to v2026.04.27.00 and switch build system, a subtask of T425258: Upgrade mcrouter production image to Trixie, from Open to In Progress.
May 6 2026, 7:59 PM · ServiceOps-Datastores
RLazarus moved T425390: api rate limits: split anon-mediawiki from unauthed-mediawiki from Inbox to Radar (Awareness) on the ServiceOps new board.
May 6 2026, 7:52 PM · MediaWiki-Platform-Team (Radar), ServiceOps new, OKR-Work, MW-Interfaces-Team, FY2025-26 KR 5.1
RLazarus moved T425391: api rate limits: implement more lenient limits for the Wikipedia App from Inbox to Radar (Awareness) on the ServiceOps new board.
May 6 2026, 7:52 PM · MediaWiki-Platform-Team (Radar), ServiceOps new, OKR-Work, MW-Interfaces-Team, FY2025-26 KR 5.1

May 4 2026

RLazarus changed the visibility for T424765: webrequest_sampled not updated.
May 4 2026, 11:24 PM · Incident Severity 3, Wikimedia-Incident
RLazarus created T425381: Clean up /etc/kafka/admin.properties.
May 4 2026, 11:22 PM · Data-Platform-SRE (2026-04-24 - 2026-05-15), Sustainability (Incident Followup)
RLazarus created T425380: More comprehensive end-to-end monitoring for webrequest data.
May 4 2026, 11:22 PM · Data-Platform-SRE, Sustainability (Incident Followup)
RLazarus created T425379: Tune kafka_server_BrokerTopicMetrics_BytesOut_total.
May 4 2026, 11:22 PM · Data-Platform-SRE (2026-06-05 - 2026-06-26), Sustainability (Incident Followup)

Apr 24 2026

RLazarus moved T371069: Add helm rollback functionality to scap from Inbox to Needs Info / Blocked on the ServiceOps new board.
Apr 24 2026, 5:10 PM · ServiceOps new, Release-Engineering-Team (Priority Backlog 📥), MW-on-K8s, Scap
RLazarus added a project to T371069: Add helm rollback functionality to scap: ServiceOps new.
Apr 24 2026, 5:10 PM · ServiceOps new, Release-Engineering-Team (Priority Backlog 📥), MW-on-K8s, Scap
RLazarus added a comment to T371069: Add helm rollback functionality to scap.

Thanks for digging into it!

Apr 24 2026, 4:54 PM · ServiceOps new, Release-Engineering-Team (Priority Backlog 📥), MW-on-K8s, Scap
RLazarus updated subscribers of T371069: Add helm rollback functionality to scap.

Suppose version A is running, and we're deploying version B.

Apr 24 2026, 3:43 AM · ServiceOps new, Release-Engineering-Team (Priority Backlog 📥), MW-on-K8s, Scap

Apr 21 2026

RLazarus moved T423626: Don't calculate in hot path from MCROUTER_SERVER in mc.php for WikiLambda from Inbox to In Progress on the ServiceOps new board.
Apr 21 2026, 5:51 PM · MW-1.47-notes (1.47.0-wmf.2; 2026-05-12), ServiceOps new, Abstract Wikipedia team (26Q4 (Apr–Jun)), Essential-Work, MW-1.46-notes (1.46.0-wmf.26; 2026-04-28), Technical-Debt, MW-on-K8s
RLazarus added a project to T423626: Don't calculate in hot path from MCROUTER_SERVER in mc.php for WikiLambda: ServiceOps new.
Apr 21 2026, 5:51 PM · MW-1.47-notes (1.47.0-wmf.2; 2026-05-12), ServiceOps new, Abstract Wikipedia team (26Q4 (Apr–Jun)), Essential-Work, MW-1.46-notes (1.46.0-wmf.26; 2026-04-28), Technical-Debt, MW-on-K8s

Apr 20 2026

RLazarus closed T423624: Drop in-pod mcrouter from mw-wikifunctions pod, no longer used as Resolved.
Apr 20 2026, 10:43 PM · Technical-Debt, MW-on-K8s
RLazarus added a comment to T423626: Don't calculate in hot path from MCROUTER_SERVER in mc.php for WikiLambda.

I agree that we should take the explode call out of the hot path, but reworking the existing env variables is probably more than we want to tackle -- we could orchestrate the config change for ourselves, but it'd also be a visible change for non-WMF users of MediaWiki with memcache. Not impossible but a lot of work, especially if we don't end up keeping MemcachedWrapper in the long run.

Apr 20 2026, 5:41 PM · MW-1.47-notes (1.47.0-wmf.2; 2026-05-12), ServiceOps new, Abstract Wikipedia team (26Q4 (Apr–Jun)), Essential-Work, MW-1.46-notes (1.46.0-wmf.26; 2026-04-28), Technical-Debt, MW-on-K8s

Apr 15 2026

RLazarus added a comment to T423311: Writes to /*/wf-wan/ failing with CONNECTION FAILURE or SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY (mcrouter not being reached?).

On the prerequisites:

  • Double-checked, and mw-mcrouter has all the routes except /local/wf. That's fine, because...
  • Per @Jdforrester-WMF, we don't need to keep the /local/wf default. Nothing in mw-* namespaces, including mw-wikifunctions, uses it. (The orchestrator, running in the wikifunctions namespace does, but that's out of scope here.)
Apr 15 2026, 8:22 PM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions integration, WikiLambda
RLazarus added a comment to T423311: Writes to /*/wf-wan/ failing with CONNECTION FAILURE or SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY (mcrouter not being reached?).

So, the story is that

Apr 15 2026, 12:58 AM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions integration, WikiLambda

Apr 10 2026

RLazarus closed T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30), a subtask of T341984: Update Kubernetes clusters to 1.31, as Resolved.
Apr 10 2026, 1:28 AM · Data-Platform-SRE (2026.01.05 - 2026.01.23), Epic, ServiceOps new, Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes
RLazarus closed T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30) as Resolved.
Apr 10 2026, 1:28 AM · Patch-For-Review, ServiceOps new, Kubernetes, Prod-Kubernetes

Apr 8 2026

RLazarus added a comment to T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30).

Weirdly, Envoy failed to start after the change, with this in the logs:

Apr 8 2026, 9:35 PM · Patch-For-Review, ServiceOps new, Kubernetes, Prod-Kubernetes

Mar 25 2026

RLazarus closed T420679: No mediawiki-on-kubernetes alerts are paging as Resolved.
Mar 25 2026, 5:36 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new, Sustainability (Incident Followup)
RLazarus updated subscribers of T420498: Factor in pooled status for SLO measurements.

We were actually just talking about this in the SLOs group last week (adding @Vgutierrez and @CDanis).

Mar 25 2026, 4:22 PM · SRE-SLO, observability, Traffic

Mar 24 2026

RLazarus assigned T380299: Revisit use of the wmf-deployment Gerrit group for deployment-charts rights to wiki_willy.

In the serviceops meeting today we decided to go ahead with this.

Mar 24 2026, 12:47 AM · Infrastructure-Foundations, ServiceOps new, Kubernetes

Mar 23 2026

RLazarus closed T420982: vopsbot !ack and !resolve without incident numbers aren't working as Resolved.
Mar 23 2026, 11:06 PM · Observability-Alerting, SRE-OnFire, SRE
RLazarus created T420982: vopsbot !ack and !resolve without incident numbers aren't working.
Mar 23 2026, 6:17 PM · Observability-Alerting, SRE-OnFire, SRE

Mar 20 2026

RLazarus triaged T420679: No mediawiki-on-kubernetes alerts are paging as High priority.
Mar 20 2026, 1:49 AM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new, Sustainability (Incident Followup)
RLazarus created T420679: No mediawiki-on-kubernetes alerts are paging.
Mar 20 2026, 1:49 AM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new, Sustainability (Incident Followup)

Mar 17 2026

RLazarus moved T341441: Pushing mediawiki-multiversion Docker image from deploy server takes 4 minutes from Inbox to Backlog on the ServiceOps new board.
Mar 17 2026, 7:59 PM · ServiceOps-SharedInfra, ServiceOps new, Release-Engineering-Team (Radar), Scap, MW-on-K8s
RLazarus triaged T341441: Pushing mediawiki-multiversion Docker image from deploy server takes 4 minutes as Medium priority.
Mar 17 2026, 7:59 PM · ServiceOps-SharedInfra, ServiceOps new, Release-Engineering-Team (Radar), Scap, MW-on-K8s
RLazarus moved T380299: Revisit use of the wmf-deployment Gerrit group for deployment-charts rights from Inbox to Needs Info / Blocked on the ServiceOps new board.

(Service Ops triage here: moving this to Needs Info for discussion at our team meeting.)

Mar 17 2026, 5:30 PM · Infrastructure-Foundations, ServiceOps new, Kubernetes
RLazarus edited projects for T380299: Revisit use of the wmf-deployment Gerrit group for deployment-charts rights, added: ServiceOps new; removed serviceops-deprecated.
Mar 17 2026, 5:29 PM · Infrastructure-Foundations, ServiceOps new, Kubernetes
RLazarus moved T382710: Deploy portals independently of MediaWiki from Inbox to Backlog on the ServiceOps new board.
Mar 17 2026, 5:14 PM · MW-on-K8s, ServiceOps new, Wikimedia-Portals
RLazarus triaged T382710: Deploy portals independently of MediaWiki as Low priority.
Mar 17 2026, 5:14 PM · MW-on-K8s, ServiceOps new, Wikimedia-Portals
RLazarus moved T390946: Harmonise configs between API gateway and REST gateway from Inbox to Backlog on the ServiceOps new board.
Mar 17 2026, 5:07 PM · ServiceOps-SharedInfra, ServiceOps new
RLazarus triaged T390946: Harmonise configs between API gateway and REST gateway as Low priority.
Mar 17 2026, 5:07 PM · ServiceOps-SharedInfra, ServiceOps new
RLazarus added a comment to T420264: Data Platform SRE paging alerts and on-call SRE response.

One more axis to consider: Best-practices-wise, for alerting on Kubernetes platforms, there's a distinction between control plane and data plane.

Mar 17 2026, 3:27 PM · Data-Platform-SRE (2026-04-24 - 2026-05-15), SRE

Mar 13 2026

RLazarus closed T410975: Upgrade Envoy to v1.35.7, a subtask of T380211: Upgrade Envoy to >= 1.24, as Resolved.
Mar 13 2026, 2:23 AM · SRE, serviceops-deprecated, envoy
RLazarus closed T410975: Upgrade Envoy to v1.35.7 as Resolved.

Resolving; the remaining hosts will go straight to 1.35.9 in T419637 instead.

Mar 13 2026, 2:23 AM · ServiceOps-Services-Oids, ServiceOps new, SRE, envoy

Mar 12 2026

RLazarus changed the status of T419637: Upgrade Envoy to v1.35.9 from Open to In Progress.
Mar 12 2026, 7:46 PM · ServiceOps-Services-Oids, envoy, ServiceOps new
RLazarus moved T411058: Can't deploy machinetranslation due to exceeding resource quotas from Backlog to Radar (Pending) on the ServiceOps new board.
Mar 12 2026, 6:35 PM · ServiceOps new, LPL Essential (FY2025-26 Q3&4), LPL Projects (Other), Unplanned-Sprint-Work, MinT, Prod-Kubernetes, SRE
RLazarus moved T416623: Decommission NodeJS IPoid service from Radar (Awareness) to Radar (Pending) on the ServiceOps new board.
Mar 12 2026, 5:13 PM · Essential-Work, Product Safety and Integrity, ServiceOps-Services-Oids, ServiceOps new, iPoid-Service (IPoid OpenSearch)
RLazarus assigned T419747: Possible hardware issues on wikikube-worker2332.codfw.wmnet to Scott_French.
Mar 12 2026, 5:10 PM · SRE, ops-codfw, DC-Ops, ServiceOps new
RLazarus assigned T419058: Prepare packages and production images for ICU 72 upgrade to Scott_French.
Mar 12 2026, 5:06 PM · Essential-Work, User-Raine, ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki
RLazarus changed the status of T419831: Create a memcached::mediawiki:wikifunctions role from Open to In Progress.
Mar 12 2026, 4:16 PM · ServiceOps-Datastores, ServiceOps new
RLazarus merged T419784: Change Wikifunctions k8s pods apparmor annotation to a config field, former is deprecated since k8s 1.30 into T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30).
Mar 12 2026, 2:33 AM · Patch-For-Review, ServiceOps new, Kubernetes, Prod-Kubernetes
RLazarus merged task T419784: Change Wikifunctions k8s pods apparmor annotation to a config field, former is deprecated since k8s 1.30 into T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30).
Mar 12 2026, 2:33 AM · Kubernetes, Abstract Wikipedia team, Essential-Work, function-orchestrator, function-evaluator
RLazarus updated subscribers of T419781: Lots of Wikifunctions k8s pods in staging stuck in "Terminating", some for 14 days+.

Unsurprisingly /var/log/kern.log on kubestage1006 (hosting the above example pod) is full of lines like:

Mar 12 2026, 2:30 AM · Patch-For-Review, Prod-Kubernetes, ServiceOps-Services-Oids, Kubernetes, ServiceOps new, Abstract Wikipedia team
RLazarus changed the status of T419781: Lots of Wikifunctions k8s pods in staging stuck in "Terminating", some for 14 days+ from Open to In Progress.
Mar 12 2026, 2:10 AM · Patch-For-Review, Prod-Kubernetes, ServiceOps-Services-Oids, Kubernetes, ServiceOps new, Abstract Wikipedia team
RLazarus added a comment to T419781: Lots of Wikifunctions k8s pods in staging stuck in "Terminating", some for 14 days+.

The actual profiles are at modules/profile/files/kubernetes/node/wikifunctions-evaluator and .../wikifunctions-orchestrator (which is the same except s/evaluator/orchestrator/g). The staging and prod containers have the same apparmor annotation[1] so should have the same signals policy, and anyway it looks correct at a glance.

Mar 12 2026, 1:57 AM · Patch-For-Review, Prod-Kubernetes, ServiceOps-Services-Oids, Kubernetes, ServiceOps new, Abstract Wikipedia team
RLazarus added projects to T419781: Lots of Wikifunctions k8s pods in staging stuck in "Terminating", some for 14 days+: ServiceOps-Services-Oids, Prod-Kubernetes.
rzl@deploy2002:~$ kube-env wikifunctions staging
rzl@deploy2002:~$ kubectl describe pod function-evaluator-javascript-evaluator-58c586f4c5-zgzvp
[ output trimmed to just the relevant lines: ]
Containers:
  function-evaluator-javascript-evaluator:
    Container ID:    containerd://31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9
    State:           Running
  function-evaluator-javascript-evaluator-tls-proxy:
    Container ID:    containerd://ea12e56b08b979b586254f08b8ac7fb9011d9adc3e4370c5d8fd0d0237be2ac3
    State:           Terminated
      Reason:        Completed
      Exit Code:     0
Events:
  Type     Reason         Age                    From     Message
  ----     ------         ----                   ----     -------
  Warning  FailedMount    54m (x303 over 10h)    kubelet  MountVolume.SetUp failed for volume "tls-certs-volume" : object "wikifunctions"/"function-evaluator-javascript-evaluator-tls-proxy-certs" not registered
  Warning  FailedMount    44m (x308 over 10h)    kubelet  MountVolume.SetUp failed for volume "envoy-config-volume" : object "wikifunctions"/"function-evaluator-javascript-evaluator-envoy-config-volume" not registered
  Normal   Killing        14m (x895 over 10h)    kubelet  Stopping container function-evaluator-javascript-evaluator
  Warning  FailedKillPod  4m48s (x908 over 10h)  kubelet  error killing pod: [failed to "KillContainer" for "function-evaluator-javascript-evaluator" with KillContainerError: "rpc error: code = Unknown desc = failed to kill container \"31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9\": unknown error after kill: runc did not terminate successfully: exit status 1: unable to signal init: permission denied\n: unknown", failed to "KillPodSandbox" for "fb689e76-3315-4fa2-8157-772ff7a8a45d" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container \"31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9\": failed to kill container \"31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9\": unknown error after kill: runc did not terminate successfully: exit status 1: unable to signal init: permission denied\n: unknown"]
Mar 12 2026, 1:31 AM · Patch-For-Review, Prod-Kubernetes, ServiceOps-Services-Oids, Kubernetes, ServiceOps new, Abstract Wikipedia team

Mar 11 2026

RLazarus removed a project from T419647: Eqiad: lsw1-d2-eqiad BGP maintenance: ServiceOps new.

Service Ops triage here: Agreed there's nothing for us to do, thanks @ayounsi - untagging us.

Mar 11 2026, 4:24 PM · netops, Infrastructure-Foundations, SRE

Mar 10 2026

RLazarus moved T419637: Upgrade Envoy to v1.35.9 from Inbox to Scheduled (this Q) on the ServiceOps new board.
Mar 10 2026, 11:51 PM · ServiceOps-Services-Oids, envoy, ServiceOps new
RLazarus created T419637: Upgrade Envoy to v1.35.9.
Mar 10 2026, 11:50 PM · ServiceOps-Services-Oids, envoy, ServiceOps new
RLazarus added a comment to T417163: Noise in #wikimedia-operations is making incident response more difficult.

This makes tracking response difficult, and in some situations recently we have had to move to #wikimedia-sre in order to communicate properly. If we wish to pursue this as an official pattern, it needs to be documented and recorded.

Mar 10 2026, 9:01 PM · SRE-Unowned, SRE, Sustainability (Incident Followup)
RLazarus added a comment to F32455073: find_collations.py.

Here you go: https://gitlab.wikimedia.org/repos/sre/serviceops-kitchensink/-/merge_requests/28

Mar 10 2026, 6:44 PM