Page MenuHomePhabricator

Scott_French (Scott French)
User

Projects (11)

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Jan 18 2024, 5:33 PM (124 w, 2 d)
Availability
Available
LDAP User
Scott French
MediaWiki User
SFrench-WMF [ Global Accounts ]

Recent Activity

Fri, Jun 5

Scott_French updated the task description for T427668: Turn up the Pretrain MVP environment.
Fri, Jun 5, 7:24 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new
Scott_French updated the task description for T427666: Route testwiki traffic to the Pretrain MVP environment.
Fri, Jun 5, 6:43 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new
Scott_French moved T428174: Standard helm chart for simple service-utils nodejs apps from Needs Info / Blocked to Next quarter on the ServiceOps new board.
Fri, Jun 5, 3:16 PM · ServiceOps new (Next quarter), ServiceOps-SharedInfra, Data-Engineering
Scott_French triaged T428174: Standard helm chart for simple service-utils nodejs apps as Medium priority.

Great, thank you both, then!

Fri, Jun 5, 3:16 PM · ServiceOps new (Next quarter), ServiceOps-SharedInfra, Data-Engineering

Thu, Jun 4

Scott_French added a comment to T428174: Standard helm chart for simple service-utils nodejs apps.

Thanks for opening this, @Ottomata.

Thu, Jun 4, 9:12 PM · ServiceOps new (Next quarter), ServiceOps-SharedInfra, Data-Engineering
Scott_French added a comment to T428013: TypeError: Cannot assign null to property Shellbox\\Command\\Command::$includeStderr of type bool in Shellbox (server) v4.5.0.

Upon closer inspection, I suspect the only Command property in the serialized client data that was not otherwise guaranteed to take a value that will satisfy type checks upon assignment in setClientData on v4.5.0 is includeStderr. Meaning, as long as that's the only issue, we probably could get away with just changing L470 in setClientData to $value ?? false. That would make that function compatible with v4.4.0-serialized client data, at the expense of introducing what feels like kind of a hack.

Thu, Jun 4, 7:55 PM · ServiceOps-Services-Oids, ServiceOps new, Shellbox

Wed, Jun 3

Scott_French added a comment to T427820: Migrate Shellbox image to Bookworm.

Following up on one item we discussed today:

Wed, Jun 3, 8:14 PM · ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki
Scott_French added a comment to T427820: Migrate Shellbox image to Bookworm.

Alas, an attempt to deploy newly built Shellbox images (i.e., reflecting all code changes through v4.5.0) on Tuesday surfaced T428013. Unless that's resolved in the interim, we'll need to use a different approach to trigger image builds after the php8.3 production images switch to bookworm. For that, I've found the job-replay approach described in Shellbox#Deploying_a_new_version to work reasonably well.

Wed, Jun 3, 3:38 AM · ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki

Tue, Jun 2

Scott_French added a comment to T428013: TypeError: Cannot assign null to property Shellbox\\Command\\Command::$includeStderr of type bool in Shellbox (server) v4.5.0.

Moving this to our Radar (Pending) since it should not block work in T427820: Migrate Shellbox image to Bookworm (i.e., we have options available for triggering rebuilds at older commits), but it will have implications for how we proceed.

Tue, Jun 2, 11:25 PM · ServiceOps-Services-Oids, ServiceOps new, Shellbox
Scott_French moved T428013: TypeError: Cannot assign null to property Shellbox\\Command\\Command::$includeStderr of type bool in Shellbox (server) v4.5.0 from Inbox to Radar (Pending) on the ServiceOps new board.
Tue, Jun 2, 11:04 PM · ServiceOps-Services-Oids, ServiceOps new, Shellbox
Scott_French created T428013: TypeError: Cannot assign null to property Shellbox\\Command\\Command::$includeStderr of type bool in Shellbox (server) v4.5.0.
Tue, Jun 2, 11:03 PM · ServiceOps-Services-Oids, ServiceOps new, Shellbox
Scott_French created P93613 Shellbox TypeError (2026-06-02-125930).
Tue, Jun 2, 6:47 PM

Fri, May 29

Scott_French added a comment to T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.

So, it's entirely possible that I'm mistaken, but if not, then I have good news:

Fri, May 29, 7:06 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French triaged T427668: Turn up the Pretrain MVP environment as Medium priority.
Fri, May 29, 5:59 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new
Scott_French created T427668: Turn up the Pretrain MVP environment.
Fri, May 29, 5:59 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new
Scott_French triaged T427666: Route testwiki traffic to the Pretrain MVP environment as Medium priority.
Fri, May 29, 5:13 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new
Scott_French created T427666: Route testwiki traffic to the Pretrain MVP environment.
Fri, May 29, 5:12 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new
Scott_French changed the status of T427659: WE6.7.2 (FY25-26) Pretrain MVP deployment environment from Open to In Progress.
Fri, May 29, 4:07 PM · MW-on-K8s, ServiceOps-Mediawiki, Epic, ServiceOps new
Scott_French created T427659: WE6.7.2 (FY25-26) Pretrain MVP deployment environment.
Fri, May 29, 4:05 PM · MW-on-K8s, ServiceOps-Mediawiki, Epic, ServiceOps new

Thu, May 28

Scott_French closed T427312: Build PHP 8.3 packages for bookworm, a subtask of T418200: Migrate Service Ops Docker images running in production away from Bullseye, as Resolved.
Thu, May 28, 4:43 PM · Patch-For-Review, ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki
Scott_French closed T427312: Build PHP 8.3 packages for bookworm as Resolved.

I'll post the production-images patch in T427312#11964352 shortly, which we can reuse for the actual switch (wherever that ends up being tracked). For now, I think that's everything tracked here.

Thu, May 28, 4:43 PM · ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki
Scott_French added a comment to T427312: Build PHP 8.3 packages for bookworm.

Thanks for the pointers, Moritz!

Thu, May 28, 4:41 PM · ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki
Scott_French moved T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts from Radar (Pending) to Needs Info / Blocked on the ServiceOps new board.

In terms of what's explicitly tracked in this task, I believe all that remains is:

  1. Verify that the proof-of-concept improvements in hoarde work as expected.
  2. Summarize those improvements in a way that AQS 2.0 service owners can easily replicate (maybe this requires applying the same to kask and data-gateway in order to assess how well it generalizes?).
  3. Open follow-on tasks for that work.
Thu, May 28, 4:10 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French added a comment to T427312: Build PHP 8.3 packages for bookworm.

Alright, once I fixed the pristine-tar init on my import-dsc, everything works smoothly. I've now built, but not yet fully included (i.e., only those necessary to satisfy inter-package dependencies), all of the necessary packages.

Thu, May 28, 12:31 AM · ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki

Wed, May 27

Scott_French added a comment to T427312: Build PHP 8.3 packages for bookworm.

Made some progress today - package builds are progressing, and indeed there have been no build-time surprises thus far.

Wed, May 27, 10:00 PM · ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki

Tue, May 26

Scott_French changed the status of T427312: Build PHP 8.3 packages for bookworm, a subtask of T418200: Migrate Service Ops Docker images running in production away from Bullseye, from Open to In Progress.
Tue, May 26, 6:09 PM · Patch-For-Review, ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki
Scott_French changed the status of T427312: Build PHP 8.3 packages for bookworm from Open to In Progress.
Tue, May 26, 6:09 PM · ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki
Scott_French created T427312: Build PHP 8.3 packages for bookworm.
Tue, May 26, 6:09 PM · ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki

Fri, May 22

Scott_French added a comment to T427024: Move MediaWiki envoy drain configuration from mesh.extra_env to mesh.admin.

@MLechvien-WMF - Fair question! Given the amount of other high-priority work slated for this quarter that is currently blocked, but may be ready to resume soon (e.g., conf* node work), it's probably safer to push this out to next (as much as it pains me to leave this debt around longer, heh). Done.

Fri, May 22, 5:02 PM · ServiceOps new (Next quarter), MW-on-K8s, ServiceOps-Mediawiki
Scott_French moved T427024: Move MediaWiki envoy drain configuration from mesh.extra_env to mesh.admin from Scheduled (this Q) to Next quarter on the ServiceOps new board.
Fri, May 22, 5:02 PM · ServiceOps new (Next quarter), MW-on-K8s, ServiceOps-Mediawiki
Scott_French placed T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis up for grabs.

I've updated the task description to (1) reflect a broader description of the problem as we understand it and (2) reflect a recent observation by @tstarling that a straightforward mechanism to accurately measure the rate of DeferredUpdates "loss" would be the most valuable next step - i.e., it would both function to identify potential triggers and assess effectiveness of any fixes.

Fri, May 22, 12:26 AM · Patch-For-Review, ServiceOps new, MW-on-K8s, MediaWiki-Recent-changes
Scott_French triaged T427024: Move MediaWiki envoy drain configuration from mesh.extra_env to mesh.admin as Medium priority.
Fri, May 22, 12:14 AM · ServiceOps new (Next quarter), MW-on-K8s, ServiceOps-Mediawiki
Scott_French created T427024: Move MediaWiki envoy drain configuration from mesh.extra_env to mesh.admin.
Fri, May 22, 12:13 AM · ServiceOps new (Next quarter), MW-on-K8s, ServiceOps-Mediawiki
Scott_French updated the task description for T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis.
Fri, May 22, 12:02 AM · Patch-For-Review, ServiceOps new, MW-on-K8s, MediaWiki-Recent-changes

Wed, May 20

Scott_French added a comment to T417704: [FY25-26 WE6.1.4] Establish Pretrain production design for MVP.

Thank you for doing so, and apologies for the delayed response. I'll give some thought to how we can distill the most useful details from those docs.

Wed, May 20, 12:31 AM · ServiceOps new, Goal, Release-Engineering-Team (Doing 😎)

Mon, May 18

Scott_French added a comment to T390573: Consider removing envvars.inc from MediaWiki images.

The change on my end is that the PHP version embedded in the ConfigMap-volume mount path is now provided as metadata alongside the MediaWiki image label (i.e., in helmfile values).

Mon, May 18, 11:52 PM · ServiceOps new, ServiceOps-Mediawiki
Scott_French assigned T426688: Integrate wikikube-worker2331 to jasmine_.

@jasmine_ has kindly offered to take this. Thank you!

Mon, May 18, 9:51 PM · Patch-For-Review, ServiceOps-Upgrades-Hardware, ServiceOps new
Scott_French triaged T426688: Integrate wikikube-worker2331 as Medium priority.
Mon, May 18, 9:50 PM · Patch-For-Review, ServiceOps-Upgrades-Hardware, ServiceOps new
Scott_French created T426688: Integrate wikikube-worker2331.
Mon, May 18, 9:50 PM · Patch-For-Review, ServiceOps-Upgrades-Hardware, ServiceOps new
Scott_French added a comment to T426683: wikikube-worker2190.codfw.wmnet failure at reboot.

Thank you very much, @Jhancock.wm! Looks good - feel free to close this out. I'll repool the host shortly.

Mon, May 18, 9:27 PM · SRE, ServiceOps new, ops-codfw, DC-Ops
Scott_French moved T426683: wikikube-worker2190.codfw.wmnet failure at reboot from Inbox to Radar (Pending) on the ServiceOps new board.
Mon, May 18, 7:35 PM · SRE, ServiceOps new, ops-codfw, DC-Ops
Scott_French created T426683: wikikube-worker2190.codfw.wmnet failure at reboot.
Mon, May 18, 7:34 PM · SRE, ServiceOps new, ops-codfw, DC-Ops

Apr 30 2026

Scott_French added a comment to T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis.

Thank you very much for the additional analysis, @tstarling - both the reanalysis of enwiki and digging into the nlwiktionary example.

Apr 30 2026, 11:56 PM · Patch-For-Review, ServiceOps new, MW-on-K8s, MediaWiki-Recent-changes
Scott_French added a comment to T418200: Migrate Service Ops Docker images running in production away from Bullseye.

For MediaWiki and Shellbox, there are a couple of steps involved. Here's a quick overview of what I'm imagining.

Apr 30 2026, 9:42 PM · Patch-For-Review, ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki
Scott_French added a comment to T424975: Certain deployment logs cause Spiderpig to crash the browser.

Under the hood, helmfile apply will call helm diff upgrade, which (1) communicates to helmfile whether there is in fact a diff (exit code) and (2) reports said diff via stdout for helmfile to in turn display to the user.

Apr 30 2026, 4:29 PM · Release-Engineering-Team (Doing 😎), Scap (SpiderPig 🕸️)

Apr 29 2026

Scott_French added a comment to T422967: Investigate DNS query improvements in MediaWiki-on-k8s.

Ah, got it - thanks, @JMeybohm. For some reason, I thought there was something else beyond the dot-suffixing being proposed. In any case, +1 to starting with just the dot-suffixing, as it's clearly beneficial and implies minimal behavior change.

Apr 29 2026, 11:57 PM · ServiceOps new (Next quarter), ServiceOps-SharedInfra
Scott_French added a comment to T417166: Improve scap/deploy rollback turnaround times.

This is a challenging action item to make actionable, given the generic problem statement and level of detail available in the linked doc.

Apr 29 2026, 7:45 PM · Release-Engineering-Team, Scap, Sustainability (Incident Followup)

Apr 28 2026

Scott_French updated the task description for T424671: Migrate ServiceOps Envoy TLS proxy services to the 2026 discovery intermediate.
Apr 28 2026, 9:58 PM · ServiceOps new, Infrastructure-Foundations
Scott_French updated the task description for T424671: Migrate ServiceOps Envoy TLS proxy services to the 2026 discovery intermediate.
Apr 28 2026, 9:00 PM · ServiceOps new, Infrastructure-Foundations
Scott_French added a comment to T423619: Should we skip some directories from deploy backups?.

Thanks for updating the fileset @jcrespo.

Apr 28 2026, 3:42 PM · User-Raine, ServiceOps-SharedInfra, ServiceOps new, DC-Ops
Scott_French added a comment to T424266: Develop a plan for integrating conf200[7-9].

[...]
You don't need to import anything :-) The jmx exporter is already available for trixie and in fact already used on various roles (e.g. the IDPs and kafka-text). It's the same package as also used on bullseye and bookworm (which works since it only contains four JAR archives). At some point we should update it (that work is tracked at https://phabricator.wikimedia.org/T341439), but it hasn't been a priority so far.

Apr 28 2026, 2:52 PM · ServiceOps new, ServiceOps-Upgrades-Hardware
Scott_French added a comment to T371069: Add helm rollback functionality to scap.

Very nice description of the problem statement in T371069#11853857 @RLazarus.

Apr 28 2026, 1:08 AM · ServiceOps new, Release-Engineering-Team (Priority Backlog 📥), MW-on-K8s, Scap

Apr 27 2026

Scott_French added a comment to T424266: Develop a plan for integrating conf200[7-9].

I was able to do some testing today with the new 3.4.13-6+deb11u1~wmf13u1 package (basic functionality, mixed bullseye / trixie cluster compatibility, etc.) and everything seems to work as expected.

Apr 27 2026, 9:36 PM · ServiceOps new, ServiceOps-Upgrades-Hardware
Scott_French added a comment to T424266: Develop a plan for integrating conf200[7-9].

Thank you very much, @MoritzMuehlenhoff. I'll test the new package early this week.

Apr 27 2026, 2:31 PM · ServiceOps new, ServiceOps-Upgrades-Hardware
Scott_French moved T424266: Develop a plan for integrating conf200[7-9] from Inbox to In Progress on the ServiceOps new board.
Apr 27 2026, 2:17 PM · ServiceOps new, ServiceOps-Upgrades-Hardware

Apr 23 2026

Scott_French added a comment to T422967: Investigate DNS query improvements in MediaWiki-on-k8s.

Great, thanks @JMeybohm - Do you have a sense of what the highest-priority changes to the mesh configuration may be? i.e., some combination of dns_refresh_rate or respect_dns_ttl?

Apr 23 2026, 10:15 PM · ServiceOps new (Next quarter), ServiceOps-SharedInfra
Scott_French moved T422955: Detect elevated rates of EtcdConfig fetch failures from Needs Info / Blocked to Scheduled (this Q) on the ServiceOps new board.

Thank you both! Optimistically moving this to "this Q" given the relative implementation cost vs. benefit.

Apr 23 2026, 9:56 PM · ServiceOps new (Next quarter), ServiceOps-SharedInfra
Scott_French created T424280: Consider surfacing curl error details in MultiHttpClient.
Apr 23 2026, 9:53 PM · MediaWiki-libs-HTTP
Scott_French triaged T424266: Develop a plan for integrating conf200[7-9] as Medium priority.
Apr 23 2026, 7:49 PM · ServiceOps new, ServiceOps-Upgrades-Hardware
Scott_French created T424266: Develop a plan for integrating conf200[7-9].
Apr 23 2026, 7:48 PM · ServiceOps new, ServiceOps-Upgrades-Hardware
Scott_French added a comment to T418915: conf200[7-9] implementation tracking.

That's amazing - thank you very much, @MoritzMuehlenhoff! Yes, if you could import those into a dedicated component (component/zookeeper34 sounds good), that would be perfect.

Apr 23 2026, 4:09 PM · ServiceOps new, ServiceOps-Upgrades-Hardware, DC-Ops

Apr 20 2026

Scott_French added a comment to T420500: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE.

@AnnieKim_WMDE - Please see https://wikitech.wikimedia.org/wiki/SRE/Production_access#Access_Request_Process for details on the information you will need to provide in the task description. I've included two TODOs there for you to insert your new production public key and record the specific level of access you are now requesting.

Apr 20 2026, 2:25 PM · Patch-For-Review, Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests
Scott_French updated the task description for T420500: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE.
Apr 20 2026, 2:23 PM · Patch-For-Review, Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests

Apr 16 2026

Scott_French moved T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts from Inbox to Radar (Pending) on the ServiceOps new board.
Apr 16 2026, 6:43 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French added projects to T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts: ServiceOps new, ServiceOps-Services-Oids.
Apr 16 2026, 6:43 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French added a comment to T423619: Should we skip some directories from deploy backups?.

It seems a little surprising to me that we've not already excluded this path, so I'm wondering if there's some historical context that led to this being a conscious / intentional choice. I'll ask around in the meantime.

Apr 16 2026, 4:13 PM · User-Raine, ServiceOps-SharedInfra, ServiceOps new, DC-Ops
Scott_French added a project to T423619: Should we skip some directories from deploy backups?: ServiceOps-SharedInfra.
Apr 16 2026, 3:57 PM · User-Raine, ServiceOps-SharedInfra, ServiceOps new, DC-Ops
Scott_French added a project to T423619: Should we skip some directories from deploy backups?: ServiceOps new.

Thanks for raising this.

Apr 16 2026, 3:56 PM · User-Raine, ServiceOps-SharedInfra, ServiceOps new, DC-Ops
Scott_French added a comment to T422955: Detect elevated rates of EtcdConfig fetch failures.

@Clement_Goubert @JMeybohm - Does this sound reasonable to you? I think it should be relatively low effort (e.g., exporter rule -> task-severity alert), and is a surprising enough gap in our monitoring that it probably makes sense to prioritize soon (i.e., errorlog is basically /dev/null).

Apr 16 2026, 2:38 PM · ServiceOps new (Next quarter), ServiceOps-SharedInfra
Scott_French added a comment to T422967: Investigate DNS query improvements in MediaWiki-on-k8s.

@JMeybohm - Do you think this is something we could make meaningful progress on this quarter?

Apr 16 2026, 2:31 PM · ServiceOps new (Next quarter), ServiceOps-SharedInfra

Apr 15 2026

Scott_French added a comment to T422166: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw.

Thanks, @Blake! Two thoughts:

Apr 15 2026, 8:06 PM · Ceph, SRE-swift-storage, Patch-For-Review, ServiceOps new, Datacenter-Switchover, SRE
Scott_French added a comment to T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.

Alright, for now we're in a holding pattern until we decide how to approach the remaining item in the task description - i.e., improving the interaction between unreachable Cassandra hosts, gocql client session initialization, and service initialization (i.e., liveness and readiness).

Apr 15 2026, 6:05 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French updated the task description for T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.
Apr 15 2026, 6:03 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French updated the task description for T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.
Apr 15 2026, 6:02 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans

Apr 14 2026

Scott_French added a comment to T422166: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw.

@MLechvien-WMF - I've updated the task description to capture the discussion here. My vote is that we get the "near-term mitigation" work done this quarter, while the issue is fresh in our minds.

Apr 14 2026, 2:48 PM · Ceph, SRE-swift-storage, Patch-For-Review, ServiceOps new, Datacenter-Switchover, SRE
Scott_French updated the task description for T422166: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw.
Apr 14 2026, 2:44 PM · Ceph, SRE-swift-storage, Patch-For-Review, ServiceOps new, Datacenter-Switchover, SRE
Scott_French updated the task description for T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.
Apr 14 2026, 1:11 AM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans

Apr 13 2026

Scott_French updated the task description for T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.
Apr 13 2026, 11:20 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French removed a project from T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts: SRE.
Apr 13 2026, 8:42 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French added a comment to T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.

[...]
These almost always occur in batches (i.e. hardware refreshes, expansions, etc), usually on the order of between 3-9 hosts at a time. The new hosts go up over a period of days, or even weeks, one-by-one. That's a lot of ServiceOps pings, no?

Apr 13 2026, 8:38 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French added a comment to T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis.

Thanks for the report @MarcoSwart and for confirming that appears similar to how this issue manifests, @matej_suchanek.

Apr 13 2026, 8:02 PM · Patch-For-Review, ServiceOps new, MW-on-K8s, MediaWiki-Recent-changes
Scott_French updated the task description for T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.
Apr 13 2026, 7:35 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French lowered the priority of T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts from High to Medium.

@Eevans - Could I ask you to pick up the documentation change for Cassandra host turn-up?

Apr 13 2026, 7:32 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French updated the task description for T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.
Apr 13 2026, 7:24 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French renamed T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts from aqs-http-gateway services at risk from defunct hosts in cassandra_hosts to aqs-http-gateway services at risk due to inaccessible cassandra hosts.
Apr 13 2026, 5:51 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French added a comment to T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.

So, once the external-services network policy changes were applied, the crash-looping pod in editor-analytics was able to start successfully.

Apr 13 2026, 5:45 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French added a comment to T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.

Plot twist:

Apr 13 2026, 5:29 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French updated the task description for T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.
Apr 13 2026, 4:57 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French triaged T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts as High priority.
Apr 13 2026, 4:53 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French added a comment to T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.

I've verified that manually deleting an editor-analytics pod in staging will trigger crash looping, and then setting initialDelaySeconds on the liveness probe (in this case 40s) will resolve it.

Apr 13 2026, 4:52 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans
Scott_French created T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts.
Apr 13 2026, 4:51 PM · ServiceOps-Services-Oids, ServiceOps new, Cassandra, User-Eevans

Apr 10 2026

Scott_French closed T422455: Massive increase in "EtcdConfig failed to fetch data: Timeout was reached" warnings and errors since March 17th as Resolved.

Many thanks @Clement_Goubert and @JMeybohm. I've opened the two follow-up tasks and we can shift further discussion there.

Apr 10 2026, 6:15 PM · Patch-For-Review, MediaWiki-Engineering, ServiceOps new, Wikimedia-production-error
Scott_French triaged T422967: Investigate DNS query improvements in MediaWiki-on-k8s as Medium priority.

Similar to T422955, if this sounds reasonable, let's try to schedule it for this quarter.

Apr 10 2026, 6:13 PM · ServiceOps new (Next quarter), ServiceOps-SharedInfra
Scott_French created T422967: Investigate DNS query improvements in MediaWiki-on-k8s.
Apr 10 2026, 6:12 PM · ServiceOps new (Next quarter), ServiceOps-SharedInfra
Scott_French added a comment to T419212: Upgrade ServiceOps roles from Bullseye to Debian Trixie.

@MLechvien-WMF - Yes, exactly. My plan is for the new conf* hosts being racked in codfw to run Trixie from day 1, which will pave the way for upgrading eqiad to Trixie as well (T419212#11703456). The actual process is going to be rather involved, and I'll need to spend a bit of time to iron out the procedure first.

Apr 10 2026, 5:51 PM · Patch-For-Review, User-Raine, ServiceOps new, ServiceOps-Upgrades-Hardware
Scott_French triaged T422955: Detect elevated rates of EtcdConfig fetch failures as Medium priority.

Moving this to Needs Info while we converge on whether this sounds reasonable. If it does, I'd propose we schedule it for this quarter.

Apr 10 2026, 4:44 PM · ServiceOps new (Next quarter), ServiceOps-SharedInfra
Scott_French created T422955: Detect elevated rates of EtcdConfig fetch failures.
Apr 10 2026, 4:43 PM · ServiceOps new (Next quarter), ServiceOps-SharedInfra
Scott_French added a comment to T422166: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw.

I've eyeballed the discussion here - AFAICT apus is behaving as expected? I haven't seen persistent lag between the two clusters, but during bursts of activity replication between the DCs is asynchronous (by design). The problem is the registry (due to caching connections) writing to both clusters at the same time but assuming it's only writing to one, and thus being thrown by asynchronous replication between the clusters.

Apr 10 2026, 3:14 PM · Ceph, SRE-swift-storage, Patch-For-Review, ServiceOps new, Datacenter-Switchover, SRE

Apr 9 2026

Scott_French added a comment to T422455: Massive increase in "EtcdConfig failed to fetch data: Timeout was reached" warnings and errors since March 17th.

Ah, thanks for highlighting that @JMeybohm!

Apr 9 2026, 8:53 PM · Patch-For-Review, MediaWiki-Engineering, ServiceOps new, Wikimedia-production-error
Scott_French updated subscribers of T422166: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw.

Thanks, @Blake. Yes, excluding the apus service from switchover day 1 seems like the right approach for now given the special handling required (by analogy with other services we've excluded).

Apr 9 2026, 7:26 PM · Ceph, SRE-swift-storage, Patch-For-Review, ServiceOps new, Datacenter-Switchover, SRE