User Details
- User Since
- Jan 18 2024, 5:33 PM (107 w, 1 d)
- Availability
- Available
- LDAP User
- Scott French
- MediaWiki User
- SFrench-WMF [ Global Accounts ]
Today
So, I'd say the main problem is really that we've introduced tight coupling between releases, which makes bootstrapping challenging since it forces sequencing. Investing in loosening that coupling (best possible solution), or ensuring that the appropriate tooling understands those constraints, seems like the right path here.
Wed, Feb 4
Roughly 9h after switching back to node 22 with --max-old-space-size=4096 and --max-semi-space-size=16, we're seeing some interesting results:
Tue, Feb 3
Yes, I believe we can mark this resolved, now that the feature is in wider use and seems to be working as expected.
Alright, I've updated the task description on T405703: Update wikikube eqiad to kubernetes 1.31 to reflect two points:
- During the "Deploy mediawiki" phase, the sequencing constraints we've discussed here, together with example commands for bringing up the support releases.
- During the "Deploy all the services" phase, charlie in its current form will operate on all mediawiki services as well, which is probably not what we want in practice if we want to do that via scap (or, if we do want to operate on them in this phase, that should be possible if we move the support-release bring-up earlier to ensure it happens first).
Mon, Feb 2
Thank you, @elukey!
Thanks for the additional investigation, @Jgiannelos.
Fri, Jan 23
I've now also merged T406392, for the same reason.
Since this is fundamentally the same class of failure mode as already tracked in T390251, I am going to duplicate this into the latter as canonical.
Thu, Jan 22
A couple of hours in after @Jgiannelos set --max-old-space-size (and deployed the new node 22-based image), we're seeing cyclic latency excursions as measured from Envoy's view on the Wikifeeds side of things, which again seem to correlate with CPU and memory (note: these are totals, not per-pod behavior) bumps.
I've merged T412265: Pushing to the docker registry fails with 500 Internal Server Error into this task, as we believe it's another manifestation of the same class of failure modes discussed here.
Since this is fundamentally the same class of failure mode as already tracked and reported in T390251, I am going to duplicate this into the latter as canonical.
@Urbanecm_WMF - Could you please confirm whether #1 from T398592#11539714 is correct or not? We'd like to try to confirm that immediate-term need has been met. Longer term, we would try to prioritize #2 once that functionality exists in the relevant maintenance scripts. Please unassign once responded.
Wed, Jan 21
Ah, of course ... the "inner" commitPrimaryChanges call (i.e., from within WebVideoTranscodeJob::actuallyRun) raises DBTransactionError due to the name mismatch. This is then caught in run which returns (rather than re-raises), while trxRoundStage is left in a bad state, leading to the effect seen here. Good find!
Revisiting this, I believe we understand what happened. What remains to be decided is what we plan to do about it, if anything.
I suspect this need is now met by Charlie, but am not certain.
Speculatively moving this to Scheduled, since it would be good to make the respective documentation changes prior to the upcoming switchover.
Triaging as "Low" since, in practice, the main issue we've run into historically is DNS hosts that do not respond at all, which is (now) addressed by setting proper query timeouts. Also moving to backlog.
Reopening, since we'll likely need to make changes to sre.discovery.datacenter to adopt the functionality discussed in T375014 once it lands in Spicerack. I'm making that dependency explicit now, and will update the description shortly.
Confirmed that transcodes are completing once again following the rollback.
From a quick scan of the code, we're somehow entering the commitPrimaryChanges call right after WebVideoTranscodeJob::run returns in JobExecutor::execute while still in LBFactory::ROUND_COMMITTING. That feels like a call to commitPrimaryChanges from within WebVideoTranscodeJob threw before resetting`trxRoundStage`, but it was caught.
Following up from T415169, I suspect video transcodes are no longer completing due to this error. Can we roll the wmf.12 back to group 0? (i.e., get commons back to wmf.11)
These are indeed all WebVideoTranscodeJob. From a quick spot-check of Special:NewFiles on commons, together with the videoscaling error rates over the last 12h, I don't think transcodes are succeeding.
Updated the task description to reflect the point about the hardware refresh (and mark the gutter pool complete). This will still need info from @jijiki on feasibility of substantially completing the entire scope of work in Q3.
Tue, Jan 20
My understanding is that the discussion here has converged on a "sketch of an implementation" - i.e., expose a namespace-to-team mapping by way of kube-state-metrics, which is then available to join with for alerts that require team-level (non-default) routing. However, the former is currently blocked on T303744.
It seems like there are two different time horizons to this task, as initially framed:
- In the near term, restoring the earlier behavior on mwmaint hosts (i.e., Problem #4) as a error-handling model that may be more appropriate for Growth-Team's use case. My understanding that this is possible using FOREACHWIKI_IGNORE_ERRORS per T398592#10971912, but please correct me if I'm wrong here.
- Introducing a more sophisticated error handling model that, for example, allows different error conditions / exceptions to surface different exit status codes (i.e., changes to maintenance scripts themselves and / or the abstractions they're built on) and accompanying support in mw-cron for status-depending handling (Problem #3).
@jijiki - Is this something you anticipate we might be picking back up in Q3? Also, is there anything blocking this other than just finding time? (I see the gutter-pool change was merged and doesn't appear to have been reverted.)
Thank you very much @elukey - that's great news!
@MLechvien-WMF - Thanks for checking. Agreed with Timo in T384294#10494399 that this should wait until we're on PHP 8.4 or later, which is when the re-written JIT compiler lands. That means some time during / after H1 FY26-27. I've updated the task description to highlight this and am moving it to backlog.
Thu, Jan 15
Wed, Jan 14
This happened in the patch series starting at https://gerrit.wikimedia.org/r/c/mediawiki/libs/Shellbox/+/1211261.
Tue, Jan 13
@Clement_Goubert @daniel - If you could provide more detail on sizing, timing, and priority at your convenience, that would be greatly appreciated.
Sun, Jan 11
@Benwing2 - Thanks for calling our attention to the Retry-After response header format issue. We've made a change that we believe should address this, which should be live everywhere as of roughly 20:30 UTC today. Please let us know if you continue to see the unexpected float-like format.
Fri, Jan 9
After some discussion on https://gerrit.wikimedia.org/r/1224041 and a chat earlier on Thursday with @Blake, I wanted to follow up here about the state of the sre.switchdc.services cookbook and deprecation of the EXCLUDED_SERVICES constant.
Dec 19 2025
I've made a handful of (mostly minor) additional revisions to the SRE-driven items in the checklist:
- (diff) Preparation for WMF: Extend SRE tasks with 8.3 migration examples
- (diff) Preparation for WMF: Add note about coordination with MW for PHP extension versions
- This makes explicit that although SRE drives the package build process, there is some amount of coordination with MediaWiki Engineering to review the selected PHP extension versions (and more rarely, assist with migrating to appropriate alternatives when an extension is no longer supported, as in the case of tideways this time around).
- (diff) Preparation for WMF: Explicitly refer to maintenance scripts to avoid possible confusion with mwscript
- (diff) Rollout: Add context on rollout sequence, references to T405955 and hypothetical schedule sheet
- This adds a bit more detail on the high-level structure of the production rollout and links example artifacts from the 8.3 migration (e.g., the rollout task that reflects "what happened" and a sheet containing a hypothetical progressive rollout schedule).
- (diff) Post-rollout: Add previous examples for SRE items and note that title-case mapping cleanup may skipped if none was necessary
- (diff) Preparation for WMF: Add note that title-case mapping may not be needed in some cases
Dec 18 2025
Many thanks for investigating this @MatthewVernon and @elukey.
@thcipriani - Thanks for pulling together T412265#11471277. Indeed, your understanding here is correct.
Dec 5 2025
Thank you for taking care of the mediawiki-dumps-legacy toolbox release, @BTullis.
Dec 4 2025
Summarizing discussion with @BTullis:
T406392 is a good reminder of the fact that the bandaids we might otherwise fall back on (e.g., sleeps, internal retries) are not available in all contexts, so even though we've largely focused on the MediaWiki image use case here, we really need a more systematic solution.
Dec 3 2025
Following up on T405955#11408087, I stand corrected - it seems that the reference to mediawiki-multiversion-cli:2025-07-23-203525-publish-81 is used - i.e., it's used by the mediawiki-dumps-legacy "resources" helmfile release, which includes the "toolbox" deployment and does not inherit the scap-managed overrides.
Although there may be some further cleanups to remove the now-unused certificates, no further action is planned as part of this task.
Dec 2 2025
Thanks for the heads-up, @Marostegui.
Alright, that should be everything. All configcluster hosts have had etcd migrated to use cfssl-based PKI, which should unblock migration to Puppet 7.