User Details
- User Since
- Jan 18 2024, 5:33 PM (124 w, 2 d)
- Availability
- Available
- LDAP User
- Scott French
- MediaWiki User
- SFrench-WMF [ Global Accounts ]
Fri, Jun 5
Great, thank you both, then!
Thu, Jun 4
Thanks for opening this, @Ottomata.
Upon closer inspection, I suspect the only Command property in the serialized client data that was not otherwise guaranteed to take a value that will satisfy type checks upon assignment in setClientData on v4.5.0 is includeStderr. Meaning, as long as that's the only issue, we probably could get away with just changing L470 in setClientData to $value ?? false. That would make that function compatible with v4.4.0-serialized client data, at the expense of introducing what feels like kind of a hack.
Wed, Jun 3
Following up on one item we discussed today:
Alas, an attempt to deploy newly built Shellbox images (i.e., reflecting all code changes through v4.5.0) on Tuesday surfaced T428013. Unless that's resolved in the interim, we'll need to use a different approach to trigger image builds after the php8.3 production images switch to bookworm. For that, I've found the job-replay approach described in Shellbox#Deploying_a_new_version to work reasonably well.
Tue, Jun 2
Moving this to our Radar (Pending) since it should not block work in T427820: Migrate Shellbox image to Bookworm (i.e., we have options available for triggering rebuilds at older commits), but it will have implications for how we proceed.
Fri, May 29
So, it's entirely possible that I'm mistaken, but if not, then I have good news:
Thu, May 28
I'll post the production-images patch in T427312#11964352 shortly, which we can reuse for the actual switch (wherever that ends up being tracked). For now, I think that's everything tracked here.
Thanks for the pointers, Moritz!
In terms of what's explicitly tracked in this task, I believe all that remains is:
- Verify that the proof-of-concept improvements in hoarde work as expected.
- Summarize those improvements in a way that AQS 2.0 service owners can easily replicate (maybe this requires applying the same to kask and data-gateway in order to assess how well it generalizes?).
- Open follow-on tasks for that work.
Alright, once I fixed the pristine-tar init on my import-dsc, everything works smoothly. I've now built, but not yet fully included (i.e., only those necessary to satisfy inter-package dependencies), all of the necessary packages.
Wed, May 27
Made some progress today - package builds are progressing, and indeed there have been no build-time surprises thus far.
Tue, May 26
Fri, May 22
@MLechvien-WMF - Fair question! Given the amount of other high-priority work slated for this quarter that is currently blocked, but may be ready to resume soon (e.g., conf* node work), it's probably safer to push this out to next (as much as it pains me to leave this debt around longer, heh). Done.
I've updated the task description to (1) reflect a broader description of the problem as we understand it and (2) reflect a recent observation by @tstarling that a straightforward mechanism to accurately measure the rate of DeferredUpdates "loss" would be the most valuable next step - i.e., it would both function to identify potential triggers and assess effectiveness of any fixes.
Wed, May 20
Thank you for doing so, and apologies for the delayed response. I'll give some thought to how we can distill the most useful details from those docs.
Mon, May 18
The change on my end is that the PHP version embedded in the ConfigMap-volume mount path is now provided as metadata alongside the MediaWiki image label (i.e., in helmfile values).
@jasmine_ has kindly offered to take this. Thank you!
Thank you very much, @Jhancock.wm! Looks good - feel free to close this out. I'll repool the host shortly.
Apr 30 2026
Thank you very much for the additional analysis, @tstarling - both the reanalysis of enwiki and digging into the nlwiktionary example.
For MediaWiki and Shellbox, there are a couple of steps involved. Here's a quick overview of what I'm imagining.
Under the hood, helmfile apply will call helm diff upgrade, which (1) communicates to helmfile whether there is in fact a diff (exit code) and (2) reports said diff via stdout for helmfile to in turn display to the user.
Apr 29 2026
Ah, got it - thanks, @JMeybohm. For some reason, I thought there was something else beyond the dot-suffixing being proposed. In any case, +1 to starting with just the dot-suffixing, as it's clearly beneficial and implies minimal behavior change.
This is a challenging action item to make actionable, given the generic problem statement and level of detail available in the linked doc.
Apr 28 2026
Thanks for updating the fileset @jcrespo.
Very nice description of the problem statement in T371069#11853857 @RLazarus.
Apr 27 2026
I was able to do some testing today with the new 3.4.13-6+deb11u1~wmf13u1 package (basic functionality, mixed bullseye / trixie cluster compatibility, etc.) and everything seems to work as expected.
Thank you very much, @MoritzMuehlenhoff. I'll test the new package early this week.
Apr 23 2026
Great, thanks @JMeybohm - Do you have a sense of what the highest-priority changes to the mesh configuration may be? i.e., some combination of dns_refresh_rate or respect_dns_ttl?
Thank you both! Optimistically moving this to "this Q" given the relative implementation cost vs. benefit.
That's amazing - thank you very much, @MoritzMuehlenhoff! Yes, if you could import those into a dedicated component (component/zookeeper34 sounds good), that would be perfect.
Apr 20 2026
@AnnieKim_WMDE - Please see https://wikitech.wikimedia.org/wiki/SRE/Production_access#Access_Request_Process for details on the information you will need to provide in the task description. I've included two TODOs there for you to insert your new production public key and record the specific level of access you are now requesting.
Apr 16 2026
It seems a little surprising to me that we've not already excluded this path, so I'm wondering if there's some historical context that led to this being a conscious / intentional choice. I'll ask around in the meantime.
Thanks for raising this.
@Clement_Goubert @JMeybohm - Does this sound reasonable to you? I think it should be relatively low effort (e.g., exporter rule -> task-severity alert), and is a surprising enough gap in our monitoring that it probably makes sense to prioritize soon (i.e., errorlog is basically /dev/null).
@JMeybohm - Do you think this is something we could make meaningful progress on this quarter?
Apr 15 2026
Thanks, @Blake! Two thoughts:
Alright, for now we're in a holding pattern until we decide how to approach the remaining item in the task description - i.e., improving the interaction between unreachable Cassandra hosts, gocql client session initialization, and service initialization (i.e., liveness and readiness).
Apr 14 2026
@MLechvien-WMF - I've updated the task description to capture the discussion here. My vote is that we get the "near-term mitigation" work done this quarter, while the issue is fresh in our minds.
Apr 13 2026
Thanks for the report @MarcoSwart and for confirming that appears similar to how this issue manifests, @matej_suchanek.
@Eevans - Could I ask you to pick up the documentation change for Cassandra host turn-up?
So, once the external-services network policy changes were applied, the crash-looping pod in editor-analytics was able to start successfully.
Plot twist:
I've verified that manually deleting an editor-analytics pod in staging will trigger crash looping, and then setting initialDelaySeconds on the liveness probe (in this case 40s) will resolve it.
Apr 10 2026
Many thanks @Clement_Goubert and @JMeybohm. I've opened the two follow-up tasks and we can shift further discussion there.
Similar to T422955, if this sounds reasonable, let's try to schedule it for this quarter.
@MLechvien-WMF - Yes, exactly. My plan is for the new conf* hosts being racked in codfw to run Trixie from day 1, which will pave the way for upgrading eqiad to Trixie as well (T419212#11703456). The actual process is going to be rather involved, and I'll need to spend a bit of time to iron out the procedure first.
Moving this to Needs Info while we converge on whether this sounds reasonable. If it does, I'd propose we schedule it for this quarter.
Apr 9 2026
Ah, thanks for highlighting that @JMeybohm!
Thanks, @Blake. Yes, excluding the apus service from switchover day 1 seems like the right approach for now given the special handling required (by analogy with other services we've excluded).