Page MenuHomePhabricator

Upgrade httpd images to bullseye or bookworm
Closed, ResolvedPublic

Description

Background

The httpd production images, along with the dependent httpd-fcgi and mediawiki-httpd images, are currently based on buster (apache2 2.4.59).

Since upgrading to either bullseye or bookworm gives us apache2 2.4.62, there's no strong argument to step to bullseye first, so let's aim for bookworm.

For PHP services, we can use this as a comparatively low-risk test for parts of the procedure we'll use to migrate the app images to 8.1 (e.g., for MediaWiki, it could be used to validate scap's ability to build multiple image "flavors" from different base images, once available).

Edit: We did not end up using this to test multi-flavor builds, and instead deferred until after the 8.1 migration.

In any case, to do this in a controlled way, we'll need buster- and bookworm-based production images to coexist for a time. I'd propose we do this by introducing a transitional "bookworm" track for the three production images, which would later merge back into the existing ones (which would really just be changing the Dockerfile template in httpd and bumping the changelogs).

The other notable (static content) use case we'll have to coordinate is updating the various miscweb deployments (https://gitlab.wikimedia.org/repos/sre/miscweb).

Migration

As of late June 2025, the migration is in progress. A high-level overview of the process we're using and migration status can be found in T378128#10906441.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Scott_French the procedure you outline seems good to me, but I would add, after merging the first patch, a deployment for shellbox. That should give you the first smoke-test of how bookworm's apache works operationally there.

Procedure and images lgtm, let's go! Agreed with @Joe that we can smoke test using shellboxens.

This won't work as moving to Bookworm would require an ICU transition first? We're currently on ICU 67 and Bookworm comes with ICU 72

This won't work as moving to Bookworm would require an ICU transition first? We're currently on ICU 67 and Bookworm comes with ICU 72

Given this is just the image running apache, that shouldn't be a problem. PHP is not being executed here.

Or am I missing something?

This won't work as moving to Bookworm would require an ICU transition first? We're currently on ICU 67 and Bookworm comes with ICU 72

Given this is just the image running apache, that shouldn't be a problem. PHP is not being executed here.

Or am I missing something?

Ignore, I had misread the task description and assumed this was for main wiki PHP images.

Thanks, all!

Inverting the order and piloting on shellbox early on sounds good. The only downside to that is the necessary change to the chart, but that's really quite easy.

One additional point of note which I forgot to mention: We're using the target apache2 version (2.4.62) already, albeit the the bullseye build thereof, on the recently reimaged mwdebug hosts. Thus, we do at least have some miles on this (e.g., httpbb checks consistently pass).

Change #1081989 merged by Scott French:

[operations/docker-images/production-images@master] httpd: introduce -bookworm track and cascade

https://gerrit.wikimedia.org/r/1081989

Change #1156354 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox: define httpd image name in values

https://gerrit.wikimedia.org/r/1156354

Change #1156354 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: define httpd image name in values

https://gerrit.wikimedia.org/r/1156354

Alright, we now have the ability to override the httpd image name easily. I'd propose we start with a pilot on a single shellbox service in two steps (fraction of traffic -> all traffic), then expand to the remaining services, similar to what we did for the PHP 8.1 migration (although we can and should go much faster here).

Somewhat arbitrarily, we can start with syntaxhighlight, loosely driven by a combination of familiarity (i.e., for troubleshooting), a low-but-consistent baseline request rate (e.g., a fractional pilot actually provides useful signal), and relatively low impact of deployments (e.g., vs. shellbox-video).

Change #1156442 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-syntaxhighlight: pilot bookworm-based httpd image (1 replica)

https://gerrit.wikimedia.org/r/1156442

Change #1156443 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-syntaxhighlight: migrate to bookworm-based httpd image

https://gerrit.wikimedia.org/r/1156443

Change #1156442 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-syntaxhighlight: pilot bookworm-based httpd image (1 replica)

https://gerrit.wikimedia.org/r/1156442

Change #1156443 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-syntaxhighlight: migrate to bookworm-based httpd image

https://gerrit.wikimedia.org/r/1156443

Mentioned in SAL (#wikimedia-operations) [2025-06-17T17:19:17Z] <swfrench-wmf> migrated shellbox-syntaxhighlight to bookworm-based httpd images - T378128

After about and hour of soak with 1 replica per DC on the new httpd images and no issues observed, I've now moved all of syntaxhighlight forward. I've been keeping an eye on general service health in grafana (eqiad, codfw), httpd container logs (manual tailing with kubectl), mediawiki exec-channel errors (logstash), and ShellboxError exceptions (logstash), and will be checking in periodically throughout the day.

Change #1160223 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox: migrate to bookworm-based httpd image

https://gerrit.wikimedia.org/r/1160223

swfrench opened https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/184

Draft: make-container-image: introduce webserver-bookworm flavour

Change #1160223 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: migrate to bookworm-based httpd image

https://gerrit.wikimedia.org/r/1160223

Mentioned in SAL (#wikimedia-operations) [2025-06-18T17:39:31Z] <swfrench-wmf> migrated all shellbox instances to bookworm-based httpd images in codfw - T378128

Mentioned in SAL (#wikimedia-operations) [2025-06-18T17:58:53Z] <swfrench-wmf> migrated all shellbox instances to bookworm-based httpd images in eqiad - T378128

After no issues were uncovered for shellbox-syntaxhighlight with ~ 24h on the new images, the remaining (5) shellbox instances have now been updated as well (staggered by datacenter by ~ 20m). Validating using the same graphs and logs as in T378128#10925040, no issues have been uncovered so far, though again I'll check in periodically throughout the day.

Edit: A couple of hours in, still no issues encountered. In the unlikely event that issues do arise while I'm out tomorrow, a ready-made revert patch can be found in https://gerrit.wikimedia.org/r/1161058.

Change #1162030 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/docker-images/production-images@master] httpd: Rebase on bookworm and cascade

https://gerrit.wikimedia.org/r/1162030

Change #1162036 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] deployment_server: use bookworm httpd in mw-debug/next mw-*/migration

https://gerrit.wikimedia.org/r/1162036

Change #1162962 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mw-(api-ext|web): pilot 5% of traffic on new httpd images

https://gerrit.wikimedia.org/r/1162962

Change #1162036 merged by Scott French:

[operations/puppet@production] deployment_server: use bookworm httpd in mw-debug/next mw-*/migration

https://gerrit.wikimedia.org/r/1162036

Mentioned in SAL (#wikimedia-operations) [2025-06-23T17:15:06Z] <swfrench@deploy1003> Started scap sync-world: Deploy bookworm httpd images to mw-debug/next - T378128

Mentioned in SAL (#wikimedia-operations) [2025-06-23T17:16:34Z] <swfrench@deploy1003> swfrench: Deploy bookworm httpd images to mw-debug/next - T378128 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-06-23T17:22:30Z] <swfrench@deploy1003> Finished scap sync-world: Deploy bookworm httpd images to mw-debug/next - T378128 (duration: 08m 00s)

The webserver-bookworm image flavour is now live in mw-debug/next, passing httpbb checks and manual kicking-of-tires by me. No errors / issues surfaced in httpd container logs. None of this is surprising, given that apache 2.4.62 has been live on the mwdebug hosts for some time without issue.

The next step would be to pilot some real production traffic on the new images, ideally in mw-api-ext and mw-web, as they'll see the greatest range of variation (i.e., to flush out edge cases, which are presumably where issues will lurk).

https://gerrit.wikimedia.org/r/1162962 proposes to do just that, directing ~ 5% of traffic to the migration releases, which are now also configured to use webserver-bookworm.

Change #1162962 merged by jenkins-bot:

[operations/deployment-charts@master] mw-(api-ext|web): pilot 5% of traffic on new httpd images

https://gerrit.wikimedia.org/r/1162962

Mentioned in SAL (#wikimedia-operations) [2025-06-24T17:11:18Z] <swfrench-wmf> serving ~ 5% of mw-api-ext and mw-web traffic in codfw via bookworm-based httpd image - T378128

Mentioned in SAL (#wikimedia-operations) [2025-06-24T17:25:02Z] <swfrench-wmf> serving ~ 5% of mw-api-ext and mw-web traffic in eqiad via bookworm-based httpd image - T378128

As of ~ 17:30 UTC today, both mw-api-ext and mw-web are serving ~ 5% of traffic via the migration releases, which are in turn using the bookworm webserver image.

No obvious issues surfaced so far, where I'm mainly looking at (1) httpd container logs (e.g., errors / warnings suggesting config compatibility issues), (2) mediawiki error channels (e.g., errors that suggest incorrect rewrites emitted by pods with -migration servergroups), and (3) roughly consistent distribution of non-2xx response status codes between the main and migration releases.

Rollback: In the event something goes wrong, the rollback procedure is simple: Revert https://gerrit.wikimedia.org/r/1162962 and then helmfile apply both mw-api-ext and mw-web in both codfw and eqiad.

Change #1164236 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] deployment_server: use bookworm httpd everywhere

https://gerrit.wikimedia.org/r/1164236

Change #1164242 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] Revert "mw-(api-ext|web): pilot 5% of traffic on new httpd images"

https://gerrit.wikimedia.org/r/1164242

Change #1164236 merged by Scott French:

[operations/puppet@production] deployment_server: use bookworm httpd in all mediawiki releases

https://gerrit.wikimedia.org/r/1164236

Mentioned in SAL (#wikimedia-operations) [2025-06-26T17:09:22Z] <swfrench@deploy1003> Started scap sync-world: Migrate all mediawiki releases to bookworm httpd images - T378128

Mentioned in SAL (#wikimedia-operations) [2025-06-26T17:10:19Z] <swfrench@deploy1003> swfrench: Migrate all mediawiki releases to bookworm httpd images - T378128 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-06-26T17:21:58Z] <swfrench@deploy1003> Finished scap sync-world: Migrate all mediawiki releases to bookworm httpd images - T378128 (duration: 13m 01s)

Change #1164242 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "mw-(api-ext|web): pilot 5% of traffic on new httpd images"

https://gerrit.wikimedia.org/r/1164242

As of 17:20 UTC, all mediawiki releases have now migrated to the bookworm-based webserver image.

As before, no notable changes have been observed in the distribution of non-2xx status codes. Similarly, mediawiki error channels do not contain new errors that suggest, e.g., incorrect rewrites. Basic functional testing of cases where stand-alone apache configs are used (e.g., noc.wikimedia.org) also does not surface any issues.

Rollback: In the event something goes wrong, the rollback procedure can be found in https://gerrit.wikimedia.org/r/1164279. Simply merge that patch, and follow the instructions in the commit message.

Scott_French added a subscriber: Jelto.

I was chatting with @Jelto earlier today about migrating miscweb, and it sounds like it should be doable / preferable to migrate in two steps, similar to what we're doing with shellbox and mediawiki - i.e., switch to httpd-bookworm and deploy / verify, then switch back to httpd once the latter has been rebased on bookworm.

This work also overlaps to some degree with T384595: Upgrade Collab hosts to Bookworm. Adding collaboration-services as well.

FYI, I will be out next week and intend to pick this back up when I return. Assuming all goes smoothly with the miscweb migration, the next step is to rebase the "plain" httpd image stack on bookworm via https://gerrit.wikimedia.org/r/1162030 and deprecate the -bookworm track.

All miscweb images are running httpd-bookworm now. From our side the httpd image can be bumped to bookworm.

dancy raised the priority of this task from Medium to Unbreak Now!.Jul 15 2025, 5:16 PM
dancy subscribed.

Since apt-get update no longer works for buster-based images (T397209#11003387), and mediawiki-httpd is still referenced in https://gitlab.wikimedia.org/repos/releng/release/-/blob/main/make-container-image/build-images.py?ref_type=heads#L29, the periodic job which validates that a single-version production image of mediawiki can be built and published is now failing, which is an indication that the mediawiki deployments in production are at risk.

Thanks for the heads-up, Ahmon!

Alright, since nothing references the webserver flavour anymore, it should be safe to simply remove its definition from build-images.py as a quick fix.

Once we rebase the mediawiki-httpd production images on bookworm - which we should do ASAP since presumably they will fail to build if based on buster - we can sort out the desired naming for the flavour we return to.

Mentioned in SAL (#wikimedia-operations) [2025-07-15T17:58:44Z] <swfrench@deploy1003> Started scap sync-world: Stop building buster-based webserver flavour images - T378128

Mentioned in SAL (#wikimedia-operations) [2025-07-15T18:01:05Z] <swfrench@deploy1003> Finished scap sync-world: Stop building buster-based webserver flavour images - T378128 (duration: 02m 21s)

Scott_French lowered the priority of this task from Unbreak Now! to High.Jul 15 2025, 6:10 PM

Alright, build-images.py now only builds the bookworm-based webserver-bookworm flavour, which should prevent any impact to production scap deployments related to archival of buster.

Given that, I'm dropping this to High while I investigate the status of production image builds.

I've confirmed that weekly rebuilds of the "plain" (buster-based) httpd image stack are indeed failing and have been for roughly the last two weeks (note: individual build failures are non-fatal for the overall weekly rebuild process).

Rebasing them on bookworm via https://gerrit.wikimedia.org/r/1162030 will fix that, which we plan to do shortly before retiring the transitional -bookworm image track.

This of course assumes we plan to stick with the original plan of only maintaining the latter temporarily, rather than adopting the idiom of always embedding the debian release in the image name (and retiring the "plain" ones). I still think that makes sense (i.e. having multiple tracks should be a temporary thing) at the expense of some extra cleanup.

Change #1170173 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox: revert to httpd-fcgi image

https://gerrit.wikimedia.org/r/1170173

Change #1162030 merged by Scott French:

[operations/docker-images/production-images@master] httpd: Rebase on bookworm and cascade

https://gerrit.wikimedia.org/r/1162030

The "plain" httpd image stack has now been rebuilt on bookworm, so we should be good to move miscweb, Shellbox, and MediaWiki back to them.

Change #1170173 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: revert to httpd-fcgi image

https://gerrit.wikimedia.org/r/1170173

Mentioned in SAL (#wikimedia-operations) [2025-07-17T17:05:04Z] <swfrench@deploy1003> Started scap sync-world: Migrate webserver-bookworm flavour back to (bookworm) mediawiki-httpd images - T378128

Mentioned in SAL (#wikimedia-operations) [2025-07-17T17:06:41Z] <swfrench@deploy1003> swfrench: Migrate webserver-bookworm flavour back to (bookworm) mediawiki-httpd images - T378128 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-07-17T17:14:28Z] <swfrench@deploy1003> Finished scap sync-world: Migrate webserver-bookworm flavour back to (bookworm) mediawiki-httpd images - T378128 (duration: 09m 56s)

Change #1170405 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/docker-images/production-images@master] httpd: clean up transitional -bookworm track

https://gerrit.wikimedia.org/r/1170405

Alright, both Shellbox and MediaWiki are back on the "normal" httpd images. Once miscweb does the same, we can retire the transitional images (https://gerrit.wikimedia.org/r/1170405).

I still need to look into how to to clean up the retired images. I believe this is https://wikitech.wikimedia.org/wiki/Docker-registry#Deleting_images.

Change #1170405 merged by Scott French:

[operations/docker-images/production-images@master] httpd: clean up transitional -bookworm track

https://gerrit.wikimedia.org/r/1170405

I'll return to this next week to clean up the no-longer-maintained -bookworm images from the registry, at which point this will finally be done.

Mentioned in SAL (#wikimedia-operations) [2025-07-23T17:22:41Z] <swfrench-wmf> deleted tags for docker-registry.discovery.wmnet/mediawiki-httpd-bookworm - T378128

Mentioned in SAL (#wikimedia-operations) [2025-07-23T17:23:59Z] <swfrench-wmf> deleted tags for docker-registry.discovery.wmnet/httpd-fcgi-bookworm - T378128

Mentioned in SAL (#wikimedia-operations) [2025-07-23T17:25:00Z] <swfrench-wmf> deleted tags for docker-registry.discovery.wmnet/httpd-bookworm - T378128

The tags have been deleted for all three transitional images, which should be the very last of the lingering cleanup here. Thanks, all!