In T418521#11807710, @jnuche wrote:
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Feed Search
Yesterday
Yesterday
Fri, Apr 10
Fri, Apr 10
Thu, Apr 9
Thu, Apr 9
As a first step, I'll test locally and on catalyst-dev how risky the v1.28 -> v1.31 jump is
jnuche moved T400077: Upgrade K3s cluster to most recent stable version from Ready to In progress on the Catalyst (Luka Ijo Pimeja Jan) board.
jnuche closed T421181: patchdemo staging: Creation of new demos with catalyst backend fails on connecting to db as Resolved.
I can't reproduce the issue on staging anymore after my changes. No error during the creation of ~20 envs.
Wed, Apr 8
Wed, Apr 8
jnuche moved T400077: Upgrade K3s cluster to most recent stable version from Backlog to Ready on the Catalyst (Luka Ijo Pimeja Jan) board.
Tested on catalyst-dev:
- Created a new pod "A"
- Used the new service to create backups on all hosts
- Created a new pod "B"
- Messed up the cluster by deleting several cluster directories on two of the nodes, including /mnt/k3s-data/k3s/data and /mnt/k3s-data/k3s/server on the primary host
- Stopped all K3s systemd services across the cluster
- Rsync'd back from the backups on all hosts
- Restarted systemd services
- Cluster is healthy: Pod "A" is back and pod "B" is not
jnuche closed T419580: Disaster recovery for k8s upgrade, a subtask of T400077: Upgrade K3s cluster to most recent stable version, as Resolved.
jnuche moved T419580: Disaster recovery for k8s upgrade from In progress to Done on the Catalyst (Luka Ijo Pimeja Jan) board.
Tue, Apr 7
Tue, Apr 7
jnuche added a comment to T422455: Massive increase in "EtcdConfig failed to fetch data: Timeout was reached" warnings and errors since March 17th.
- We could try temporarily reverting (at least group1 and group2) to php-1.46.0-wmf.21 to confirm the correlation described above. This may of course end badly if anyone has already assumed .22 is rollback-safe and made dependent changes.
That could be problematic at this point. A quick glance already shows backports for 1.46.0-wmf.22, e.g.: https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/1267214 or https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1268281. The latter fixing a mobile login bug T422320
Thu, Apr 2
Thu, Apr 2
jnuche updated the task description for T422154: Error streaming logs: A chunk passthru must yield an "isFirst()" chunk before any content chunk..
jnuche updated the task description for T422112: PHP Warning: Trying to access array offset on null.
jnuche lowered the priority of T422027: TypeError: MediaWiki\Api\ApiAuthManagerHelper::formatMessage(): Argument #3 ($message) must be of type MediaWiki\Message\Message, null given from Unbreak Now! to Needs Triage.
I've backported the fix. Thank you for the patch @matmarex
Wed, Apr 1
Wed, Apr 1
jnuche closed T421988: Failing deployment checks: URLs in Location header expected to be absolute, but relative found as Resolved.
Tests now passed:
13:02:05 Started check-testservers 13:02:05 Executing check 'check_testservers_k8s-1_of_2' 13:02:05 Executing check 'check_testservers_k8s-2_of_2' 13:02:28 Finished check-testservers (duration: 00m 22s)
Thank you!
jnuche added a comment to T421988: Failing deployment checks: URLs in Location header expected to be absolute, but relative found.
In T421988#11777517, @SomeRandomDeveloper wrote:(AFAICS the tests live in the puppet repo as well: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/abc1f4f57135615c2aef77c793f34a21c58fd199/modules/profile/files/httpbb/appserver)
jnuche added a comment to T421988: Failing deployment checks: URLs in Location header expected to be absolute, but relative found.
In T421988#11777426, @matmarex wrote:It was an intended change. Can you point me to where these tests live, so I can update them?
However, note that the header for es.wikibooks.org seems to be an absolute URL missing the schema
That's expected, it's a protocol-relative URL. We only serve wikis over HTTPS these days, but we still have a lot of legacy configs from back in the day when we supported HTTP too.
jnuche renamed T421988: Failing deployment checks: URLs in Location header expected to be absolute, but relative found from Failing deployment checks: URLs in Location header exepcted to be absolute, but relative found to Failing deployment checks: URLs in Location header expected to be absolute, but relative found.
jnuche removed a parent task for T421828: PHP Warning: Undefined array key "user_identifier_type": T420480: 1.46.0-wmf.22 deployment blockers.
Tue, Mar 31
Tue, Mar 31
jnuche lowered the priority of T421828: PHP Warning: Undefined array key "user_identifier_type" from Unbreak Now! to Needs Triage.
In T421828#11773269, @cjming wrote:backported UBN fix to wmf.22 and it appears errors are dropping 😌
- mwversion: 1.46.0-wmf.22
- timestamp: 2026-03-31T07:54:04.828Z
- phpversion: 8.3.30
- reqId: 95100eec-b69f-4dc3-8c79-e3d42a5124ae
- Find reqId in Logstash
- mwversion: 1.46.0-wmf.22
- timestamp: 2026-03-31T07:50:02.298Z
- phpversion: 8.3.30
- reqId: 2d0e0e52-7f9c-4a9b-8e06-cca6d99b99b4
- Find reqId in Logstash
jnuche updated the task description for T421828: PHP Warning: Undefined array key "user_identifier_type".
jnuche added a parent task for T421828: PHP Warning: Undefined array key "user_identifier_type": T420480: 1.46.0-wmf.22 deployment blockers.
jnuche triaged T421828: PHP Warning: Undefined array key "user_identifier_type" as Unbreak Now! priority.
Fri, Mar 27
Fri, Mar 27
jnuche moved T421394: Patchdemo timing out while streaming env creation logs from In progress to Done on the Catalyst (Luka Ijo Pimeja Jan) board.
Thu, Mar 26
Thu, Mar 26
jnuche updated the task description for T421394: Patchdemo timing out while streaming env creation logs.
jnuche set the point value for T421394: Patchdemo timing out while streaming env creation logs to 1.
jnuche moved T421394: Patchdemo timing out while streaming env creation logs from Backlog to In progress on the Catalyst (Luka Ijo Pimeja Jan) board.
jnuche renamed T421394: Patchdemo timing out while streaming env creation logs from Patchdemo timing out while stream env creation logs to Patchdemo timing out while streaming env creation logs.
Fri, Mar 20
Fri, Mar 20
jnuche moved T420596: Fix access check for multiple env deletion endpoint from In progress to Done on the Catalyst (Luka Ijo Pimeja Jan) board.
Fix is in production
Thu, Mar 19
Thu, Mar 19
jnuche moved T420596: Fix access check for multiple env deletion endpoint from Backlog to In progress on the Catalyst (Luka Ijo Pimeja Jan) board.
jnuche added a comment to T406850: TypeError: trim(): Argument #1 ($string) must be of type string, MediaWiki\Extension\Math\WikiTexVC\Nodes\Fun1nb given.
@jnuche would you be able to review the suggested change?
Wed, Mar 18
Wed, Mar 18
The team went with the MariaDB operator as a replacement.
Possibly related to T419092
Tue, Mar 17
Tue, Mar 17
Disk quotas for both catalyst and catalyst-dev will need to be raised by 760GB from the current 1200GB to a total of 1520GB
jnuche moved T419580: Disaster recovery for k8s upgrade from Backlog to In progress on the Catalyst (Luka Ijo Pimeja Jan) board.
CI pipelines from Abstract Wikipedia now have a limit of 15 non-deleted environments.
jnuche closed T417304: Put a limit on demos created by ci, a subtask of T419188: Use Catalyst API to deploy Ultraviolet in CI for E2E testing, as Resolved.
jnuche moved T417304: Put a limit on demos created by ci from In progress to Done on the Catalyst (Luka Ijo Pimeja Jan) board.
Mar 13 2026
Mar 13 2026
As an interesting note, here's an histogram of Catalyst environment usage by CI over time.
All of our MariaDB databases have been now migrated away from bitnami
jnuche closed T408115: Migration to MariaDB operator: Shared environment DB, a subtask of T408114: Migrate bitnami MariaDB charts to MariaDB operator, as Resolved.
jnuche moved T408115: Migration to MariaDB operator: Shared environment DB from In progress to Done on the Catalyst (Luka Ijo Pimeja Jan) board.
Mar 11 2026
Mar 11 2026
We know from T405224 that we can recover the entire cluster from the K3s data volumes. Additionally, the folks over at Cloud have told us in the past that it's not possible to automate/schedule the creation of OpenStack snapshots natively.
Mar 10 2026
Mar 10 2026
As usual, we will announce a maintenance window for the migration.
Mar 9 2026
Mar 9 2026
jnuche moved T408115: Migration to MariaDB operator: Shared environment DB from Backlog to In progress on the Catalyst (Luka Ijo Pimeja Jan) board.
Mar 5 2026
Mar 5 2026
jnuche moved T417455: Blast of broken pipe errors from catalyst-api container log stream from Backlog to Done on the Catalyst (Luka Ijo Pimeja Jan) board.
Operation times in production have significantly improved and gone back to the performance levels we had by the end of November 2025.
jnuche moved T417304: Put a limit on demos created by ci from Ready to In progress on the Catalyst (Luka Ijo Pimeja Jan) board.
jnuche moved T417689: Investigate causes of Catalyst slowness from In progress to Done on the Catalyst (Luka Ijo Pimeja Jan) board.
Mar 4 2026
Mar 4 2026
jnuche closed T417455: Blast of broken pipe errors from catalyst-api container log stream as Resolved.
jnuche set the point value for T417455: Blast of broken pipe errors from catalyst-api container log stream to 1.
Feb 24 2026
Feb 24 2026
jnuche moved T418230: Deploy second worker to catalyst-dev project from Backlog to Done on the Catalyst (Luka Ijo Pimeja Jan) board.
Feb 23 2026
Feb 23 2026
Similarly to what happened with secrets, recreated envs have been leaving behind a significant number of replica sets behind:
kubectl -n cat-env get rs | grep 750c4d946d wiki-750c4d946d-3895-mediawiki-66bccd4d77 0 0 0 15d wiki-750c4d946d-3895-mediawiki-8669b898df 0 0 0 15d wiki-750c4d946d-3895-mediawiki-668466c6b8 0 0 0 14d wiki-750c4d946d-3895-mediawiki-c556c45c5 0 0 0 14d wiki-750c4d946d-3895-mediawiki-76c5fb6d87 0 0 0 14d wiki-750c4d946d-3895-mediawiki-6968d99c4d 0 0 0 14d wiki-750c4d946d-3895-mediawiki-6b5b86ff6f 0 0 0 14d wiki-750c4d946d-3895-mediawiki-68cd949987 0 0 0 14d wiki-750c4d946d-3895-mediawiki-75749d974b 0 0 0 14d wiki-750c4d946d-3895-mediawiki-5d46bb4dc5 0 0 0 14d wiki-750c4d946d-3895-mediawiki-68c5994db9 1 1 1 11d
Feb 20 2026
Feb 20 2026
Some of the helm commands had started slowing to a crawl; as it turns out we had 912 secrets in the cat-env namespace. A majority of those were originating in helm history revisions. A lot of those revisions could be safely removed and after doing that helm commands became noticeably more responsive again. Plus env creation times have gone down again to the levels we had a few months ago:
Feb 19 2026
Feb 19 2026
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL · Credits
