Page MenuHomePhabricator

Scott_French (Scott French)
User

Projects (9)

Today

  • No visible events.

Tomorrow

  • No visible events.

Sunday

  • No visible events.

User Details

User Since
Jan 18 2024, 5:33 PM (107 w, 1 d)
Availability
Available
LDAP User
Scott French
MediaWiki User
SFrench-WMF [ Global Accounts ]

Recent Activity

Today

Scott_French created T416752: WE6.2.9: Adopt node.js service-utils.
Fri, Feb 6, 11:14 PM · ServiceOps-SharedInfra, Epic, ServiceOps new
Scott_French added a comment to T397685: helmfile/scap does not reliably bootstrap mediawiki.

So, I'd say the main problem is really that we've introduced tight coupling between releases, which makes bootstrapping challenging since it forces sequencing. Investing in loosening that coupling (best possible solution), or ensuring that the appropriate tooling understands those constraints, seems like the right path here.

Fri, Feb 6, 6:39 PM · ServiceOps-Mediawiki, ServiceOps new, MW-on-K8s, Release-Engineering-Team, Scap

Wed, Feb 4

Scott_French added a comment to T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.

Roughly 9h after switching back to node 22 with --max-old-space-size=4096 and --max-semi-space-size=16, we're seeing some interesting results:

Wed, Feb 4, 2:46 AM · Wikimedia-production-error, ServiceOps new, Wikipedia-Android-App-Backlog, Content-Transform-Team, Wikifeeds

Tue, Feb 3

Scott_French closed T403220: Introduce known-client identity objects and integrate with requestctl as Resolved.

Yes, I believe we can mark this resolved, now that the feature is in wider use and seems to be working as expected.

Tue, Feb 3, 3:06 PM · Patch-For-Review, Traffic, Hiddenparma
Scott_French closed T403220: Introduce known-client identity objects and integrate with requestctl, a subtask of T400100: FY 25/26 WE 5.4.2: Known bots / clients, as Resolved.
Tue, Feb 3, 3:06 PM · Epic, ServiceOps new, SRE
Scott_French added a comment to T397685: helmfile/scap does not reliably bootstrap mediawiki.

Alright, I've updated the task description on T405703: Update wikikube eqiad to kubernetes 1.31 to reflect two points:

  • During the "Deploy mediawiki" phase, the sequencing constraints we've discussed here, together with example commands for bringing up the support releases.
  • During the "Deploy all the services" phase, charlie in its current form will operate on all mediawiki services as well, which is probably not what we want in practice if we want to do that via scap (or, if we do want to operate on them in this phase, that should be possible if we move the support-release bring-up earlier to ensure it happens first).
Tue, Feb 3, 1:11 AM · ServiceOps-Mediawiki, ServiceOps new, MW-on-K8s, Release-Engineering-Team, Scap
Scott_French updated the task description for T405703: Update wikikube eqiad to kubernetes 1.31.
Tue, Feb 3, 12:58 AM · Discovery-Search (2025.09.26 - 2025.10.17), Data-Platform-SRE (2025.09.26 - 2025.10.17), Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes, serviceops
Scott_French updated the task description for T405703: Update wikikube eqiad to kubernetes 1.31.
Tue, Feb 3, 12:46 AM · Discovery-Search (2025.09.26 - 2025.10.17), Data-Platform-SRE (2025.09.26 - 2025.10.17), Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes, serviceops

Mon, Feb 2

Scott_French added a comment to T412951: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph.

Thank you, @elukey!

Mon, Feb 2, 10:56 PM · Patch-For-Review, Epic, Kubernetes, ServiceOps new, Release-Engineering-Team (Radar), Ceph, SRE-swift-storage
Scott_French added a comment to T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.

Thanks for the additional investigation, @Jgiannelos.

Mon, Feb 2, 5:58 PM · Wikimedia-production-error, ServiceOps new, Wikipedia-Android-App-Backlog, Content-Transform-Team, Wikifeeds
Scott_French moved T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13 from In Progress to Radar on the ServiceOps new board.
Mon, Feb 2, 3:27 PM · Wikimedia-production-error, ServiceOps new, Wikipedia-Android-App-Backlog, Content-Transform-Team, Wikifeeds

Fri, Jan 23

Scott_French added a comment to T390251: docker-registry.wikimedia.org keeps serving bad blobs.

I've now also merged T406392, for the same reason.

Fri, Jan 23, 2:18 AM · ServiceOps new, Patch-For-Review
Scott_French merged T406392: failed to push docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-10-03-192132-3003d4328df66a0086a350fdd2ba1dbd80a235c5: unknown: blob upload invalid into T390251: docker-registry.wikimedia.org keeps serving bad blobs.
Fri, Jan 23, 12:56 AM · ServiceOps new, Patch-For-Review
Scott_French merged task T406392: failed to push docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-10-03-192132-3003d4328df66a0086a350fdd2ba1dbd80a235c5: unknown: blob upload invalid into T390251: docker-registry.wikimedia.org keeps serving bad blobs.
Fri, Jan 23, 12:56 AM · Kubernetes, ServiceOps new, GitLab (CI & Job Runners)
Scott_French added a comment to T406392: failed to push docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-10-03-192132-3003d4328df66a0086a350fdd2ba1dbd80a235c5: unknown: blob upload invalid.

Since this is fundamentally the same class of failure mode as already tracked in T390251, I am going to duplicate this into the latter as canonical.

Fri, Jan 23, 12:56 AM · Kubernetes, ServiceOps new, GitLab (CI & Job Runners)

Thu, Jan 22

Scott_French added a comment to T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.

A couple of hours in after @Jgiannelos set --max-old-space-size (and deployed the new node 22-based image), we're seeing cyclic latency excursions as measured from Envoy's view on the Wikifeeds side of things, which again seem to correlate with CPU and memory (note: these are totals, not per-pod behavior) bumps.

Thu, Jan 22, 10:59 PM · Wikimedia-production-error, ServiceOps new, Wikipedia-Android-App-Backlog, Content-Transform-Team, Wikifeeds
Scott_French added a comment to T390251: docker-registry.wikimedia.org keeps serving bad blobs.

I've merged T412265: Pushing to the docker registry fails with 500 Internal Server Error into this task, as we believe it's another manifestation of the same class of failure modes discussed here.

Thu, Jan 22, 4:18 PM · ServiceOps new, Patch-For-Review
Scott_French merged T412265: Pushing to the docker registry fails with 500 Internal Server Error into T390251: docker-registry.wikimedia.org keeps serving bad blobs.
Thu, Jan 22, 4:10 PM · ServiceOps new, Patch-For-Review
Scott_French merged task T412265: Pushing to the docker registry fails with 500 Internal Server Error into T390251: docker-registry.wikimedia.org keeps serving bad blobs.
Thu, Jan 22, 4:09 PM · ServiceOps-SharedInfra, ServiceOps new, SRE, MW-on-K8s
Scott_French added a comment to T412265: Pushing to the docker registry fails with 500 Internal Server Error.

Since this is fundamentally the same class of failure mode as already tracked and reported in T390251, I am going to duplicate this into the latter as canonical.

Thu, Jan 22, 4:09 PM · ServiceOps-SharedInfra, ServiceOps new, SRE, MW-on-K8s
Scott_French assigned T398592: Review the behaviour of foreachwikiindblist in mw-cron to Urbanecm_WMF.

@Urbanecm_WMF - Could you please confirm whether #1 from T398592#11539714 is correct or not? We'd like to try to confirm that immediate-term need has been met. Longer term, we would try to prioritize #2 once that functionality exists in the relevant maintenance scripts. Please unassign once responded.

Thu, Jan 22, 4:00 PM · ServiceOps-Mediawiki, ServiceOps new, MW-on-K8s
Scott_French removed projects from T415169: Transcode jobs failing with Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit'): ServiceOps new, serviceops.
Thu, Jan 22, 3:26 PM · Reader Growth Team, TimedMediaHandler, MW-Interfaces-Team, Wikimedia-production-error

Wed, Jan 21

Scott_French added a comment to T415169: Transcode jobs failing with Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit').

Ah, of course ... the "inner" commitPrimaryChanges call (i.e., from within WebVideoTranscodeJob::actuallyRun) raises DBTransactionError due to the name mismatch. This is then caught in run which returns (rather than re-raises), while trxRoundStage is left in a bad state, leading to the effect seen here. Good find!

Wed, Jan 21, 11:25 PM · Reader Growth Team, TimedMediaHandler, MW-Interfaces-Team, Wikimedia-production-error
Scott_French moved T397685: helmfile/scap does not reliably bootstrap mediawiki from Inbox to In Progress on the ServiceOps new board.
Wed, Jan 21, 9:27 PM · ServiceOps-Mediawiki, ServiceOps new, MW-on-K8s, Release-Engineering-Team, Scap
Scott_French claimed T397685: helmfile/scap does not reliably bootstrap mediawiki.

Revisiting this, I believe we understand what happened. What remains to be decided is what we plan to do about it, if anything.

Wed, Jan 21, 9:27 PM · ServiceOps-Mediawiki, ServiceOps new, MW-on-K8s, Release-Engineering-Team, Scap
Scott_French moved T397684: Build something capable of deploying all services to a cluster from Inbox to Needs Info / Blocked on the ServiceOps new board.
Wed, Jan 21, 8:59 PM · ServiceOps new, Kubernetes, Prod-Kubernetes
Scott_French assigned T397684: Build something capable of deploying all services to a cluster to Clement_Goubert.

I suspect this need is now met by Charlie, but am not certain.

Wed, Jan 21, 8:59 PM · ServiceOps new, Kubernetes, Prod-Kubernetes
Scott_French triaged T397684: Build something capable of deploying all services to a cluster as Medium priority.
Wed, Jan 21, 8:53 PM · ServiceOps new, Kubernetes, Prod-Kubernetes
Scott_French added a comment to T397874: Assess switchover behavior for mw-wikifunctions.

Speculatively moving this to Scheduled, since it would be good to make the respective documentation changes prior to the upcoming switchover.

Wed, Jan 21, 8:51 PM · ServiceOps new, Datacenter-Switchover
Scott_French moved T397874: Assess switchover behavior for mw-wikifunctions from Inbox to Scheduled (this Q) on the ServiceOps new board.
Wed, Jan 21, 8:45 PM · ServiceOps new, Datacenter-Switchover
Scott_French triaged T397874: Assess switchover behavior for mw-wikifunctions as Low priority.
Wed, Jan 21, 8:43 PM · ServiceOps new, Datacenter-Switchover
Scott_French moved T397937: Some wikifeeds endpoints very sensitive to mobileapps latency from Inbox to Radar on the ServiceOps new board.
Wed, Jan 21, 8:41 PM · ServiceOps-Services-Oids, ServiceOps new, Wikifeeds
Scott_French edited projects for T397937: Some wikifeeds endpoints very sensitive to mobileapps latency, added: ServiceOps new, ServiceOps-Services-Oids; removed serviceops.
Wed, Jan 21, 8:40 PM · ServiceOps-Services-Oids, ServiceOps new, Wikifeeds
Scott_French renamed T375285: sre.discovery.datacenter should handle depooled dnsbox hosts from sre.discovery.datacenter should handle depooled authdns hosts to sre.discovery.datacenter should handle depooled dnsbox hosts.
Wed, Jan 21, 8:06 PM · ServiceOps new, Patch-For-Review, Datacenter-Switchover
Scott_French moved T375285: sre.discovery.datacenter should handle depooled dnsbox hosts from Needs Info / Blocked to Backlog on the ServiceOps new board.
Wed, Jan 21, 7:57 PM · ServiceOps new, Patch-For-Review, Datacenter-Switchover
Scott_French placed T375285: sre.discovery.datacenter should handle depooled dnsbox hosts up for grabs.

Triaging as "Low" since, in practice, the main issue we've run into historically is DNS hosts that do not respond at all, which is (now) addressed by setting proper query timeouts. Also moving to backlog.

Wed, Jan 21, 7:57 PM · ServiceOps new, Patch-For-Review, Datacenter-Switchover
Scott_French added a parent task for T375014: Support listing pooled / active authdns hosts (rather than all): T375285: sre.discovery.datacenter should handle depooled dnsbox hosts.
Wed, Jan 21, 7:54 PM · Patch-For-Review, Infrastructure-Foundations, Spicerack, SRE-tools
Scott_French added a subtask for T375285: sre.discovery.datacenter should handle depooled dnsbox hosts: T375014: Support listing pooled / active authdns hosts (rather than all).
Wed, Jan 21, 7:54 PM · ServiceOps new, Patch-For-Review, Datacenter-Switchover
Scott_French reopened T375285: sre.discovery.datacenter should handle depooled dnsbox hosts as "Open".

Reopening, since we'll likely need to make changes to sre.discovery.datacenter to adopt the functionality discussed in T375014 once it lands in Spicerack. I'm making that dependency explicit now, and will update the description shortly.

Wed, Jan 21, 7:52 PM · ServiceOps new, Patch-For-Review, Datacenter-Switchover
Scott_French added a comment to T415169: Transcode jobs failing with Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit').

Confirmed that transcodes are completing once again following the rollback.

Wed, Jan 21, 6:29 PM · Reader Growth Team, TimedMediaHandler, MW-Interfaces-Team, Wikimedia-production-error
Scott_French added a comment to T415169: Transcode jobs failing with Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit').

From a quick scan of the code, we're somehow entering the commitPrimaryChanges call right after WebVideoTranscodeJob::run returns in JobExecutor::execute while still in LBFactory::ROUND_COMMITTING. That feels like a call to commitPrimaryChanges from within WebVideoTranscodeJob threw before resetting`trxRoundStage`, but it was caught.

Wed, Jan 21, 6:17 PM · Reader Growth Team, TimedMediaHandler, MW-Interfaces-Team, Wikimedia-production-error
Scott_French added a comment to T413803: 1.46.0-wmf.12 deployment blockers.

Following up from T415169, I suspect video transcodes are no longer completing due to this error. Can we roll the wmf.12 back to group 0? (i.e., get commons back to wmf.11)

Wed, Jan 21, 5:52 PM · Release-Engineering-Team (Doing 😎), Essential-Work, Release, Train Deployments
Scott_French added a project to T415169: Transcode jobs failing with Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit'): TimedMediaHandler.

These are indeed all WebVideoTranscodeJob. From a quick spot-check of Special:NewFiles on commons, together with the videoscaling error rates over the last 12h, I don't think transcodes are succeeding.

Wed, Jan 21, 5:48 PM · Reader Growth Team, TimedMediaHandler, MW-Interfaces-Team, Wikimedia-production-error
Scott_French added a comment to T398611: Migrate all memcached* clusters to nftables.

Updated the task description to reflect the point about the hardware refresh (and mark the gutter pool complete). This will still need info from @jijiki on feasibility of substantially completing the entire scope of work in Q3.

Wed, Jan 21, 3:48 PM · User-jijiki, Serviceops-easywins, ServiceOps-Datastores, ServiceOps new
Scott_French renamed T398611: Migrate all memcached* clusters to nftables from Migrate all memcached* clusters to nftables to Migrate all memcached* clusters to nftables.
Wed, Jan 21, 3:45 PM · User-jijiki, Serviceops-easywins, ServiceOps-Datastores, ServiceOps new

Tue, Jan 20

Scott_French moved T398073: Ensure DPE SRE can receive alerts for applications hosted in wikikube from Inbox to Needs Info / Blocked on the ServiceOps new board.
Tue, Jan 20, 11:22 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Kubernetes, ServiceOps new, Essential-Work, SRE Observability (FY2025/2026-Q1)
Scott_French added a comment to T398073: Ensure DPE SRE can receive alerts for applications hosted in wikikube.

My understanding is that the discussion here has converged on a "sketch of an implementation" - i.e., expose a namespace-to-team mapping by way of kube-state-metrics, which is then available to join with for alerts that require team-level (non-default) routing. However, the former is currently blocked on T303744.

Tue, Jan 20, 11:22 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Kubernetes, ServiceOps new, Essential-Work, SRE Observability (FY2025/2026-Q1)
Scott_French edited projects for T398073: Ensure DPE SRE can receive alerts for applications hosted in wikikube, added: ServiceOps new, Kubernetes; removed serviceops.
Tue, Jan 20, 11:00 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Kubernetes, ServiceOps new, Essential-Work, SRE Observability (FY2025/2026-Q1)
Scott_French added a comment to T398592: Review the behaviour of foreachwikiindblist in mw-cron.

It seems like there are two different time horizons to this task, as initially framed:

  1. In the near term, restoring the earlier behavior on mwmaint hosts (i.e., Problem #4) as a error-handling model that may be more appropriate for Growth-Team's use case. My understanding that this is possible using FOREACHWIKI_IGNORE_ERRORS per T398592#10971912, but please correct me if I'm wrong here.
  2. Introducing a more sophisticated error handling model that, for example, allows different error conditions / exceptions to surface different exit status codes (i.e., changes to maintenance scripts themselves and / or the abstractions they're built on) and accompanying support in mw-cron for status-depending handling (Problem #3).
Tue, Jan 20, 10:54 PM · ServiceOps-Mediawiki, ServiceOps new, MW-on-K8s
Scott_French moved T398592: Review the behaviour of foreachwikiindblist in mw-cron from Inbox to Needs Info / Blocked on the ServiceOps new board.
Tue, Jan 20, 10:43 PM · ServiceOps-Mediawiki, ServiceOps new, MW-on-K8s
Scott_French edited projects for T398592: Review the behaviour of foreachwikiindblist in mw-cron, added: ServiceOps new, ServiceOps-Mediawiki; removed serviceops.
Tue, Jan 20, 10:42 PM · ServiceOps-Mediawiki, ServiceOps new, MW-on-K8s
Scott_French assigned T397916: Provide functionality to apply specific patches/gerrit changes to mw-experimental to jijiki.

@jijiki - Do you think you might have enough info at this point to enable a more detailed description? (i.e., sufficient to at least drive sizing / scoping, particularly if there's an opportunity for T386246 to also benefit from this)

Tue, Jan 20, 10:34 PM · ServiceOps-Mediawiki, ServiceOps new
Scott_French moved T397916: Provide functionality to apply specific patches/gerrit changes to mw-experimental from Inbox to Needs Info / Blocked on the ServiceOps new board.
Tue, Jan 20, 10:30 PM · ServiceOps-Mediawiki, ServiceOps new
Scott_French edited projects for T397916: Provide functionality to apply specific patches/gerrit changes to mw-experimental, added: ServiceOps new, ServiceOps-Mediawiki; removed serviceops.
Tue, Jan 20, 10:29 PM · ServiceOps-Mediawiki, ServiceOps new
Scott_French moved T397735: mobileapps returns 501 on unsupported language from Inbox to Radar on the ServiceOps new board.
Tue, Jan 20, 10:22 PM · ServiceOps-Services-Oids, ServiceOps new, Page Content Service
Scott_French edited projects for T397735: mobileapps returns 501 on unsupported language, added: ServiceOps new, ServiceOps-Services-Oids; removed serviceops.
Tue, Jan 20, 10:22 PM · ServiceOps-Services-Oids, ServiceOps new, Page Content Service
Scott_French assigned T398611: Migrate all memcached* clusters to nftables to jijiki.

@jijiki - Is this something you anticipate we might be picking back up in Q3? Also, is there anything blocking this other than just finding time? (I see the gutter-pool change was merged and doesn't appear to have been reverted.)

Tue, Jan 20, 10:19 PM · User-jijiki, Serviceops-easywins, ServiceOps-Datastores, ServiceOps new
Scott_French moved T398611: Migrate all memcached* clusters to nftables from Inbox to Needs Info / Blocked on the ServiceOps new board.
Tue, Jan 20, 10:16 PM · User-jijiki, Serviceops-easywins, ServiceOps-Datastores, ServiceOps new
Scott_French edited projects for T398611: Migrate all memcached* clusters to nftables, added: ServiceOps new, ServiceOps-Datastores; removed serviceops.
Tue, Jan 20, 10:15 PM · User-jijiki, Serviceops-easywins, ServiceOps-Datastores, ServiceOps new
Scott_French added a comment to T412951: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph.

Thank you very much @elukey - that's great news!

Tue, Jan 20, 10:01 PM · Patch-For-Review, Epic, Kubernetes, ServiceOps new, Release-Engineering-Team (Radar), Ceph, SRE-swift-storage
Scott_French placed T384294: Add support for JIT in PHP 8.4 images up for grabs.
Tue, Jan 20, 4:35 PM · ServiceOps-Upgrades-Hardware, ServiceOps new, MediaWiki-Platform-Team (Radar), Wikimedia-Performance-recommendation, Patch-For-Review
Scott_French moved T384294: Add support for JIT in PHP 8.4 images from Needs Info / Blocked to Backlog on the ServiceOps new board.
Tue, Jan 20, 4:35 PM · ServiceOps-Upgrades-Hardware, ServiceOps new, MediaWiki-Platform-Team (Radar), Wikimedia-Performance-recommendation, Patch-For-Review
Scott_French added a comment to T384294: Add support for JIT in PHP 8.4 images.

@MLechvien-WMF - Thanks for checking. Agreed with Timo in T384294#10494399 that this should wait until we're on PHP 8.4 or later, which is when the re-written JIT compiler lands. That means some time during / after H1 FY26-27. I've updated the task description to highlight this and am moving it to backlog.

Tue, Jan 20, 4:34 PM · ServiceOps-Upgrades-Hardware, ServiceOps new, MediaWiki-Platform-Team (Radar), Wikimedia-Performance-recommendation, Patch-For-Review
Scott_French updated the task description for T384294: Add support for JIT in PHP 8.4 images.
Tue, Jan 20, 3:24 PM · ServiceOps-Upgrades-Hardware, ServiceOps new, MediaWiki-Platform-Team (Radar), Wikimedia-Performance-recommendation, Patch-For-Review

Thu, Jan 15

Scott_French triaged T414649: service-utils .d.ts files elide doc comments for certain re-exported declarations as Low priority.
Thu, Jan 15, 1:00 AM · service-utils
Scott_French created T414649: service-utils .d.ts files elide doc comments for certain re-exported declarations.
Thu, Jan 15, 1:00 AM · service-utils

Wed, Jan 14

Scott_French moved T368096: mediawiki: migrate from image-suggestion to data-gateway from Inbox to Scheduled (this Q) on the ServiceOps new board.
Wed, Jan 14, 9:52 PM · ServiceOps-Services-Oids, ServiceOps new, MW-1.45-notes (1.45.0-wmf.15; 2025-08-19), Patch-For-Review, Growth-Team, Cassandra
Scott_French edited projects for T368096: mediawiki: migrate from image-suggestion to data-gateway, added: ServiceOps new, ServiceOps-Services-Oids; removed serviceops.
Wed, Jan 14, 9:52 PM · ServiceOps-Services-Oids, ServiceOps new, MW-1.45-notes (1.45.0-wmf.15; 2025-08-19), Patch-For-Review, Growth-Team, Cassandra
Scott_French closed T405125: Clean up Shellbox PHP 8.1 service image builds as Resolved.
Wed, Jan 14, 9:48 PM · ServiceOps-Services-Oids, ServiceOps new, Shellbox
Scott_French added a comment to T405125: Clean up Shellbox PHP 8.1 service image builds.

This happened in the patch series starting at https://gerrit.wikimedia.org/r/c/mediawiki/libs/Shellbox/+/1211261.

Wed, Jan 14, 9:47 PM · ServiceOps-Services-Oids, ServiceOps new, Shellbox
Scott_French edited projects for T405125: Clean up Shellbox PHP 8.1 service image builds, added: ServiceOps new, ServiceOps-Services-Oids; removed serviceops.
Wed, Jan 14, 9:40 PM · ServiceOps-Services-Oids, ServiceOps new, Shellbox
Scott_French moved T412586: rest gateway: implement cost-based rate limits from Needs Info / Blocked to Scheduled (this Q) on the ServiceOps new board.
Wed, Jan 14, 4:07 PM · MediaWiki-Platform-Team (Radar), ServiceOps new, Traffic, Epic, OKR-Work, MW-Interfaces-Team, FY2025-26 KR 5.1
Scott_French moved T410038: Alert Management Review and Improvement for ServiceOps from Needs Info / Blocked to Scheduled (this Q) on the ServiceOps new board.
Wed, Jan 14, 4:04 PM · User-jijiki, Incident Tooling, Epic, ServiceOps new
Scott_French triaged T412586: rest gateway: implement cost-based rate limits as Low priority.
Wed, Jan 14, 4:04 PM · MediaWiki-Platform-Team (Radar), ServiceOps new, Traffic, Epic, OKR-Work, MW-Interfaces-Team, FY2025-26 KR 5.1

Tue, Jan 13

Scott_French moved T412524: New WMF docker registry credentials from Inbox to Radar on the ServiceOps new board.
Tue, Jan 13, 8:16 PM · Kubernetes, ServiceOps new, SRE
Scott_French added a project to T412524: New WMF docker registry credentials: Kubernetes.
Tue, Jan 13, 8:16 PM · Kubernetes, ServiceOps new, SRE
Scott_French edited projects for T412524: New WMF docker registry credentials, added: ServiceOps new; removed serviceops.
Tue, Jan 13, 8:15 PM · Kubernetes, ServiceOps new, SRE
Scott_French moved T412585: Epic: Enforce API rate limits (WE5.1.3c) from Inbox to Radar on the ServiceOps new board.
Tue, Jan 13, 8:08 PM · MediaWiki-Platform-Team (Radar), ServiceOps new, Traffic, Epic, OKR-Work, MW-Interfaces-Team, FY2025-26 KR 5.1
Scott_French edited projects for T412585: Epic: Enforce API rate limits (WE5.1.3c), added: ServiceOps new; removed serviceops.
Tue, Jan 13, 8:07 PM · MediaWiki-Platform-Team (Radar), ServiceOps new, Traffic, Epic, OKR-Work, MW-Interfaces-Team, FY2025-26 KR 5.1
Scott_French added a comment to T412586: rest gateway: implement cost-based rate limits.

@Clement_Goubert @daniel - If you could provide more detail on sizing, timing, and priority at your convenience, that would be greatly appreciated.

Tue, Jan 13, 8:05 PM · MediaWiki-Platform-Team (Radar), ServiceOps new, Traffic, Epic, OKR-Work, MW-Interfaces-Team, FY2025-26 KR 5.1
Scott_French moved T412586: rest gateway: implement cost-based rate limits from Inbox to Needs Info / Blocked on the ServiceOps new board.
Tue, Jan 13, 8:03 PM · MediaWiki-Platform-Team (Radar), ServiceOps new, Traffic, Epic, OKR-Work, MW-Interfaces-Team, FY2025-26 KR 5.1
Scott_French edited projects for T412586: rest gateway: implement cost-based rate limits, added: ServiceOps new; removed serviceops.
Tue, Jan 13, 7:57 PM · MediaWiki-Platform-Team (Radar), ServiceOps new, Traffic, Epic, OKR-Work, MW-Interfaces-Team, FY2025-26 KR 5.1
Scott_French added projects to T412693: Ensure all Chart.yaml files include required metadata fields: Kubernetes, ServiceOps-Services-Oids.
Tue, Jan 13, 7:50 PM · ServiceOps-Services-Oids, Kubernetes, ServiceOps new
Scott_French triaged T412693: Ensure all Chart.yaml files include required metadata fields as Medium priority.
Tue, Jan 13, 7:46 PM · ServiceOps-Services-Oids, Kubernetes, ServiceOps new
Scott_French moved T412693: Ensure all Chart.yaml files include required metadata fields from Inbox to Backlog on the ServiceOps new board.
Tue, Jan 13, 7:45 PM · ServiceOps-Services-Oids, Kubernetes, ServiceOps new
Scott_French edited projects for T412693: Ensure all Chart.yaml files include required metadata fields, added: ServiceOps new; removed serviceops.
Tue, Jan 13, 7:43 PM · ServiceOps-Services-Oids, Kubernetes, ServiceOps new

Sun, Jan 11

Scott_French added a comment to T414173: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS).

@Benwing2 - Thanks for calling our attention to the Retry-After response header format issue. We've made a change that we believe should address this, which should be live everywhere as of roughly 20:30 UTC today. Please let us know if you continue to see the unexpected float-like format.

Sun, Jan 11, 8:34 PM · SRE, Traffic

Fri, Jan 9

Scott_French added a comment to T412211: Move EXCLUDED_SERVICES attribute from sre.discovery.datacentre to service catalog.

After some discussion on https://gerrit.wikimedia.org/r/1224041 and a chat earlier on Thursday with @Blake, I wanted to follow up here about the state of the sre.switchdc.services cookbook and deprecation of the EXCLUDED_SERVICES constant.

Fri, Jan 9, 2:51 AM · serviceops

Dec 19 2025

Scott_French added a comment to T360995: Migrate Wikimedia production from PHP 8.1 to PHP 8.3.

I've made a handful of (mostly minor) additional revisions to the SRE-driven items in the checklist:

  • (diff) Preparation for WMF: Extend SRE tasks with 8.3 migration examples
  • (diff) Preparation for WMF: Add note about coordination with MW for PHP extension versions
    • This makes explicit that although SRE drives the package build process, there is some amount of coordination with MediaWiki Engineering to review the selected PHP extension versions (and more rarely, assist with migrating to appropriate alternatives when an extension is no longer supported, as in the case of tideways this time around).
  • (diff) Preparation for WMF: Explicitly refer to maintenance scripts to avoid possible confusion with mwscript
  • (diff) Rollout: Add context on rollout sequence, references to T405955 and hypothetical schedule sheet
    • This adds a bit more detail on the high-level structure of the production rollout and links example artifacts from the 8.3 migration (e.g., the rollout task that reflects "what happened" and a sheet containing a hypothetical progressive rollout schedule).
  • (diff) Post-rollout: Add previous examples for SRE items and note that title-case mapping cleanup may skipped if none was necessary
  • (diff) Preparation for WMF: Add note that title-case mapping may not be needed in some cases
Dec 19 2025, 10:05 PM · MediaWiki-Platform-Team (Q3 Kanban Board), Epic, serviceops

Dec 18 2025

Scott_French added a comment to T413008: Cross-datacenter Docker Registry replication broken since 2025-04-27.

Many thanks for investigating this @MatthewVernon and @elukey.

Dec 18 2025, 3:07 PM · Release-Engineering-Team (Radar), SRE, Infrastructure-Foundations, serviceops, SRE-swift-storage, Kubernetes
Scott_French updated subscribers of T412265: Pushing to the docker registry fails with 500 Internal Server Error.

@thcipriani - Thanks for pulling together T412265#11471277. Indeed, your understanding here is correct.

Dec 18 2025, 2:36 PM · ServiceOps-SharedInfra, ServiceOps new, SRE, MW-on-K8s

Dec 5 2025

Scott_French closed T405955: MediaWiki on PHP 8.3 production workload migration as Resolved.

Thank you for taking care of the mediawiki-dumps-legacy toolbox release, @BTullis.

Dec 5 2025, 4:03 PM · Patch-For-Review, serviceops
Scott_French closed T405955: MediaWiki on PHP 8.3 production workload migration, a subtask of T360995: Migrate Wikimedia production from PHP 8.1 to PHP 8.3, as Resolved.
Dec 5 2025, 4:03 PM · MediaWiki-Platform-Team (Q3 Kanban Board), Epic, serviceops

Dec 4 2025

Scott_French added a comment to T405955: MediaWiki on PHP 8.3 production workload migration.

Summarizing discussion with @BTullis:

Dec 4 2025, 7:10 PM · Patch-For-Review, serviceops
Scott_French added a comment to T390251: docker-registry.wikimedia.org keeps serving bad blobs.

T406392 is a good reminder of the fact that the bandaids we might otherwise fall back on (e.g., sleeps, internal retries) are not available in all contexts, so even though we've largely focused on the MediaWiki image use case here, we really need a more systematic solution.

Dec 4 2025, 6:53 PM · ServiceOps new, Patch-For-Review

Dec 3 2025

Scott_French updated subscribers of T405955: MediaWiki on PHP 8.3 production workload migration.

Following up on T405955#11408087, I stand corrected - it seems that the reference to mediawiki-multiversion-cli:2025-07-23-203525-publish-81 is used - i.e., it's used by the mediawiki-dumps-legacy "resources" helmfile release, which includes the "toolbox" deployment and does not inherit the scap-managed overrides.

Dec 3 2025, 9:09 PM · Patch-For-Review, serviceops
Scott_French closed T352245: Migrate the etcd main cluster to cfssl-based PKI as Resolved.

Although there may be some further cleanups to remove the now-unused certificates, no further action is planned as part of this task.

Dec 3 2025, 8:42 PM · Patch-For-Review, serviceops

Dec 2 2025

Scott_French added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

Thanks for the heads-up, @Marostegui.

Dec 2 2025, 8:01 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
Scott_French updated the task description for T352245: Migrate the etcd main cluster to cfssl-based PKI.
Dec 2 2025, 7:13 PM · Patch-For-Review, serviceops
Scott_French added a comment to T352245: Migrate the etcd main cluster to cfssl-based PKI.

Alright, that should be everything. All configcluster hosts have had etcd migrated to use cfssl-based PKI, which should unblock migration to Puppet 7.

Dec 2 2025, 7:11 PM · Patch-For-Review, serviceops