Page MenuHomePhabricator

Migrate production Shellbox services to PHP 8.3
Closed, ResolvedPublic

Description

In advance of starting the production MediaWiki migration to PHP 8.3 (targeting the start of Q2), it would be good to put some production miles on 8.3 by migrating Shellbox.

See T377038 for prior art from the migration to PHP 8.1. Importantly, this does not change the underlying Debian version (and in turn, dependency package versions), in contrast to the 8.1 migration where this was a significant source of risk.


Services to migrate:


Monitoring:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Reviewing changes merged between that of the current production Shellbox image version (2025-07-28-151806) and the first available with PHP 8.3 images (2025-08-29-172844), most are fairly straightforward code modernization or dev-dependency bumps.

The one change I'll look at more closely is https://gerrit.wikimedia.org/r/1179005, which is a major version bump for wikipeg (5.x.x to 6.x.x) and regeneration of the parser.

The ShellParser tests pass, so this should be fine, but I'd still like to understand which production-relevant code paths may exercise the parser. From a very quick scan of the code, it might(?) not actually be used in our server deployment - we do not configure any routeSpecs (i.e., parsing for the purposes of spec validation is skipped) and we do not use Firejail in this specific context (in favor of other isolation mechanisms).


Separately, I've done some very basic local testing of an 8.3 production variant for the shell action using a production-like Shellbox config, and have encountered no issues.

If we can resolve the question of whether the wikipeg update poses some non-obvious risk, I think we're in a good spot to pilot traffic on 8.3 for one of the production services - ideally one with a high enough baseline request rate to provide signal quickly (e.g., syntaxhighlight).

Change #1184177 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-syntaxhighlight: pilot 1 replica on 8.3

https://gerrit.wikimedia.org/r/1184177

After a bit of discussion, it seems wikipeg 6.0.0 primarily brings performance improvements to the base parser, which necessitate code regeneration (thus the major version bump). Those improvements are less critical for the Shellbox use case, and more so for Parsoid, where this version has now been in production use for weeks. In any case, as noted in T403284#11141844, given that the ShellParser tests pass, this is unlikely to be a source of issues.

Moving forward with a limited pilot seems like the right next step at this point.

Change #1184177 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-syntaxhighlight: pilot 1 replica on 8.3

https://gerrit.wikimedia.org/r/1184177

Mentioned in SAL (#wikimedia-operations) [2025-09-04T16:39:19Z] <swfrench-wmf> started single-replica PHP 8.3 pilot on shellbox-syntaxhighlight in codfw - T403284

Mentioned in SAL (#wikimedia-operations) [2025-09-04T16:52:58Z] <swfrench-wmf> started single-replica PHP 8.3 pilot on shellbox-syntaxhighlight in eqiad - T403284

We're now serving ~ 8% of traffic on PHP 8.3 in shellbox-syntaxhighlight in both DCs. No issues observed so far:

  • General service health (errors, latency) looks good (codfw, eqiad).
  • No evidence of syntaxhighlight-related issues surfaced as WARNING-and-higher log events on the MediaWiki exec channel (logstash).
  • No evidence of syntaxhighlight-related issues surfaced as ShellboxError or implicating SyntaxHighlight code paths in mediawiki-errors logstash.

I'll check in periodically throughout the day, and plan to revert the pilot by the end of Americas business hours.

Change #1184950 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-syntaxhighlight: revert single-replica 8.3 pilot

https://gerrit.wikimedia.org/r/1184950

Change #1184950 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-syntaxhighlight: revert single-replica 8.3 pilot

https://gerrit.wikimedia.org/r/1184950

Mentioned in SAL (#wikimedia-operations) [2025-09-04T23:42:31Z] <swfrench-wmf> finished single-replica PHP 8.3 pilot on shellbox-syntaxhighlight - T403284

Alright, after running with ~ 8% of traffic on 8.3 for the last ~ 7 hours in shellbox-syntaxhighlight, no issues have surfaced by way of the metrics and logs mentioned in T403284#11149151.

I think this puts us in a good spot to prepare and execute the wider rollout. Next steps:

  • Upgrade all shellbox instances to the 2025-08-29-172844 image version, regardless of PHP version.
  • Begin the rollout, likely starting with the complete migration of syntaxhighlight. shellbox-constraints may be an interesting next candidate, as it will test out the call action case.

Change #1185961 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox: update to 2025-08-29-172844 image

https://gerrit.wikimedia.org/r/1185961

Change #1185961 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: update to 2025-08-29-172844 image

https://gerrit.wikimedia.org/r/1185961

Mentioned in SAL (#wikimedia-operations) [2025-09-08T17:23:26Z] <swfrench-wmf> updated all shellbox services to 2025-08-29-172844 (+ envoy 1.26.8-1) in codfw - T403284

Mentioned in SAL (#wikimedia-operations) [2025-09-08T17:38:19Z] <swfrench-wmf> updated all shellbox services to 2025-08-29-172844 (+ envoy 1.26.8-1) in eqiad - T403284

No issues encountered thus far after updating to the 2025-08-29-172844 images (nor the 1.26.8-1 envoy image, as part of an ongoing fleet-wide upgrade effort), using roughly the same signals as described in T403284#11149151. I'll check in periodically throughout the day, but unless anything comes up, the next step is to begin the 8.3 upgrades.

Change #1186009 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-syntaxhighlight: upgrade to PHP 8.3

https://gerrit.wikimedia.org/r/1186009

Change #1186009 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-syntaxhighlight: upgrade to PHP 8.3

https://gerrit.wikimedia.org/r/1186009

Mentioned in SAL (#wikimedia-operations) [2025-09-09T17:15:21Z] <swfrench-wmf> migrated shellbox-syntaxhighlight to PHP 8.3 in codfw - T403284

Mentioned in SAL (#wikimedia-operations) [2025-09-09T17:38:16Z] <swfrench-wmf> migrated shellbox-syntaxhighlight to PHP 8.3 in eqiad - T403284

As of a bit before 17:40 UTC, shellbox-syntaxhighlight has been migrated to PHP 8.3 in both DCs. So far, everything is looking good in terms of:

I've also manually action=purge'd some pages to trigger some amount of re-highlighting that I could visually inspect.

As usual, I'll keep an eye on things throughout the day.

Edit: Roughly 6 hours on, still no issues have arisen. However, if an issue does surface while I'm not around, and it is believed that the switch to 8.3 is the culprit, simply:

  1. Revert https://gerrit.wikimedia.org/r/1186009 and merge.
  2. Run helmfile -e $DC -i apply --context 5 for each of codfw and eqiad in helmfile.d/services/shellbox-syntaxhighlight.
Scott_French changed the task status from Open to In Progress.Sep 9 2025, 6:22 PM

Change #1186576 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-constraints: pilot 1 replica on 8.3

https://gerrit.wikimedia.org/r/1186576

In terms of workload diversity, focusing on shellbox-constraints for the next pilot and migration would be ideal - i.e., a service for which clients exercise the PHP call action rather than shell execution.

Reviewing PCRE-related migration notes for 8.2 and 8.3 given the WikibaseQualityConstraints use case, the only item that appears relevant is this one related to NUL characters in pattern strings, which seems low-risk.

I'll ask around to see whether folks with expertise in that extension have additional concerns, but if not, I think we're in a good spot to proceed with a pilot.

In terms of workload diversity, focusing on shellbox-constraints for the next pilot and migration would be ideal - i.e., a service for which clients exercise the PHP call action rather than shell execution.

Reviewing PCRE-related migration notes for 8.2 and 8.3 given the WikibaseQualityConstraints use case, the only item that appears relevant is this one related to NUL characters in pattern strings, which seems low-risk.

I'll ask around to see whether folks with expertise in that extension have additional concerns, but if not, I think we're in a good spot to proceed with a pilot.

Sounds good to me – AFAICT there are no format constraints with \0 in the pattern anyway. (And I’m pretty sure Wikibase would block actual NUL characters from being saved in the pattern, so it would have to be specified as \0.)

Thank you very much, @Lucas_Werkmeister_WMDE. I'll get the pilot started today.

Change #1186576 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-constraints: pilot 1 replica on 8.3

https://gerrit.wikimedia.org/r/1186576

Mentioned in SAL (#wikimedia-operations) [2025-09-10T16:55:32Z] <swfrench-wmf> started single-replica PHP 8.3 pilot on shellbox-constraints - T403284

A bit more than 5 hours into the pilot, no issues observed so far with a fraction of shellbox-constraints traffic serving on 8.3. This is based on the same set of monitoring signals as used previously to validate syntaxhighlight (T403284#11164363), although now including Mediawiki exceptions implicating WikibaseQualityConstraints-related code paths.

I continue the pilot through the rest of Americas business hours today, at which point I'll revert to the prior state. Then, as long as no issues newly emerge, we can proceed with the full migration tomorrow (Thursday).

Change #1187153 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-constraints: end single-replica 8.3 pilot

https://gerrit.wikimedia.org/r/1187153

Change #1187153 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-constraints: end single-replica 8.3 pilot

https://gerrit.wikimedia.org/r/1187153

Mentioned in SAL (#wikimedia-operations) [2025-09-11T00:45:48Z] <swfrench-wmf> finished single-replica PHP 8.3 pilot on shellbox-constraints - T403284

As planned, I've reverted the pilot at the end of the day today. Still no issues apparent after nearly 8 hours with ~ 9% of traffic serving on 8.3. Unless something surfaces upon closer inspection in the interim, I think we're in a good spot to migrate the service tomorrow.

Change #1187162 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-constraints: upgrade to PHP 8.3

https://gerrit.wikimedia.org/r/1187162

Change #1187162 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-constraints: upgrade to PHP 8.3

https://gerrit.wikimedia.org/r/1187162

Mentioned in SAL (#wikimedia-operations) [2025-09-11T14:26:37Z] <swfrench-wmf> migrated shellbox-constraints to PHP 8.3 - T403284

After a couple of hours with shellbox-constraints fully migrated, no issues encountered so far, based on:

I'll continue to monitor throughout the day. Barring any issues that arise, no further action is expected until early next week when we pick up the remaining migrations, which we can likely pursue fairly expeditiously.

Change #1188364 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-media: upgrade to PHP 8.3

https://gerrit.wikimedia.org/r/1188364

Change #1188364 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-media: upgrade to PHP 8.3

https://gerrit.wikimedia.org/r/1188364

Mentioned in SAL (#wikimedia-operations) [2025-09-15T17:41:34Z] <swfrench-wmf> migrated shellbox-media to PHP 8.3 - T403284

We're now between 1.5 and 2 hours after the switch, and so far things are looking good for shellbox-media on 8.3.

While this is a lower-traffic service than the others we've already migrated, so we'll need more time to state with high confidence that all is well, I'm not seeing errors implicating code paths relevant to the three core use cases for this for this instance (the PdfHandler and PagedTiffHandler extensions, and core's DjVuImage). Similarly, the mix of non-INFO level exec-channel log events has not changed noticeably since the switch, in that these are all "typical" errors we see at steady state (the vast majority being warnings emitted during TIFF processing, e.g., due to unrecognized tags).

As usual, I'll be checking in intermittently throughout the day for new / novel errors.

Edit: Roughly 6 hours on, still no issues have arisen. However, if an issue does surface while I'm not around, and it is believed that 8.3 is at fault:

  1. Revert https://gerrit.wikimedia.org/r/1188364 and merge.
  2. Run helmfile -e $DC -i apply --context 5 for each of codfw and eqiad in helmfile.d/services/shellbox-media.

Change #1188817 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox: upgrade to PHP 8.3

https://gerrit.wikimedia.org/r/1188817

Change #1188818 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-timeline: upgrade to PHP 8.3

https://gerrit.wikimedia.org/r/1188818

Change #1188819 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-video: upgrade to PHP 8.3

https://gerrit.wikimedia.org/r/1188819

About 22 hours on, still no issues identified after migrating shellbox-media to 8.3. Given the that we're ~ 50% through shellbox instances (>> 90% of traffic) and the wide range of use cases bundled into -media (e.g., this is where we finally surfaced issues during the 8.1 migration), accelerating a bit for the remaining instances seems reasonable.

Change #1188817 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: upgrade to PHP 8.3

https://gerrit.wikimedia.org/r/1188817

Mentioned in SAL (#wikimedia-operations) [2025-09-16T20:00:12Z] <swfrench-wmf> migrated shellbox (score) to PHP 8.3 - T403284

Although the codfw and eqiad updates got spread out by some work on the docker registry hosts (and then later, the train window), so far all is looking well after the migration of shellbox (score) to 8.3. As usual, this is based on general service health and logged exceptions on Shellbox-related code paths, as well as exceptions involving the Score extension specifically.

In any case, I will continue to keep an eye on things throughout the day today.

Edit: Still no issues by the end of the day. If anything does surface while I'm not around, and it is believed that 8.3 is at fault:

  1. Revert https://gerrit.wikimedia.org/r/1188817 and merge.
  2. Run helmfile -e $DC -i apply --context 5 for each of codfw and eqiad in helmfile.d/services/shellbox.

Change #1188818 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-timeline: upgrade to PHP 8.3

https://gerrit.wikimedia.org/r/1188818

Mentioned in SAL (#wikimedia-operations) [2025-09-17T14:57:46Z] <swfrench-wmf> migrated shellbox-timeline to PHP 8.3 - T403284

A bit more than 90 minutes on, no issues encountered for shellbox-timeline on 8.3, going by the usual set of service health and generic Shellbox-related error logs, as well as exceptions involving Timeline extension code paths.

I plan to proceed with the next and final migration, shellbox-video, shortly.

Edit: Still no issues by the end of the day. If anything does surface while I'm not around, and it is believed that 8.3 is at fault:

  1. Revert https://gerrit.wikimedia.org/r/1188818 and merge.
  2. Run helmfile -e $DC -i apply --context 5 for each of codfw and eqiad in helmfile.d/services/shellbox-timeline.

Change #1188819 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: upgrade to PHP 8.3

https://gerrit.wikimedia.org/r/1188819

Mentioned in SAL (#wikimedia-operations) [2025-09-17T17:37:41Z] <swfrench-wmf> migrated shellbox-video to PHP 8.3 - T403284

A bit more than 30 minutes on from migrating shellbox-video, we're still waiting for new webVideoTranscode* jobs to confirm things work as expected (mercurius dashboard). Once that happens, I'll be keeping an eye on exceptions involving TimedMediaHandler extension code paths.

Edit: Around 18:30 UTC, the first transcode jobs started coming in post-migration, and indeed these appear to be completing successfully. I'll check in again throughout the day to confirm things stay that way.

Edit: Still no issues by the end of the day. If anything does surface while I'm not around, and it is believed that 8.3 is at fault:

  1. Revert https://gerrit.wikimedia.org/r/1188819 and merge.
  2. Run helmfile -e $DC -i apply --context 5 for each of codfw and eqiad in helmfile.d/services/shellbox-video.

Change #1189336 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox*: update flavour override comments

https://gerrit.wikimedia.org/r/1189336

I've just made a another pass over the logs and service health dashboards, and things are still looking good. I've also updated the task description to consolidate links to what exactly I've been monitoring.

Unless new issues arise, I'll plan to wrap up work / monitoring this week.

Change #1189336 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox*: update flavour override comments

https://gerrit.wikimedia.org/r/1189336

Another day has passed, and still no issues have surfaced that are plausibly the result of the 8.3 migration. I am going to optimistically resolve this. Follow-on work to clean up 8.1 image builds in the Shellbox repository will be tracked separately.