Page MenuHomePhabricator

Migrate production Shellbox variants to PHP 8.1
Closed, ResolvedPublic

Description

The 8.1-based Shellbox deployments are being turned up and undergoing some basic testing in staging in T375243.

As those become ready, this task will track migration of production traffic to those deployments using a simple capacity-based migration that's roughly proportional to traffic fraction (i.e., progressively shifting replica counts from the "main" release to the "migration" release, which route via the same Service object).

As noted in T375243#10203654, this will start with syntaxhighlight, as it both (1) sees a good amount of traffic at steady state (i.e., should provide good signal) and (2) has low exposure to changes in Debian package dependencies affecting the wrapped commands (i.e., switching from python 3.7 to 3.9, while holding the version of Pygments fixed).

Details

SubjectRepoBranchLines +/-
mediawiki/libs/Shellboxmaster+20 -171
mediawiki/libs/Shellboxmaster+2 -37
operations/deployment-chartsmaster+49 -37
operations/deployment-chartsmaster+1 -15
operations/deployment-chartsmaster+11 -3
operations/deployment-chartsmaster+11 -2
operations/docker-images/production-imagesmaster+21 -0
operations/docker-images/production-imagesmaster+18 -2
operations/docker-images/production-imagesmaster+21 -0
mediawiki/libs/Shellboxmaster+1 -1
operations/deployment-chartsmaster+2 -12
operations/deployment-chartsmaster+12 -2
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+6 -6
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+6 -2
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -4
operations/deployment-chartsmaster+2 -5
operations/deployment-chartsmaster+1 -4
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+4 -1
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+1 -7
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+3 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We'll be moving ahead with at least the pilots for both constraints and video today (shortly), as described in T377038#10482179. If no issues arise, we'll proceed through the remaining phases over the next 24h or so.

Change #1113217 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-constraints: 1 eqiad replica on 8.1 (change 1/3)

https://gerrit.wikimedia.org/r/1113217

Mentioned in SAL (#wikimedia-operations) [2025-01-28T18:04:23Z] <swfrench-wmf> starting shellbox-constraints pilot on PHP 8.1 (1 replica, eqiad only) - T377038

Change #1113213 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: 3 codfw replicas on 8.1 (change 1/4)

https://gerrit.wikimedia.org/r/1113213

Mentioned in SAL (#wikimedia-operations) [2025-01-28T18:18:10Z] <swfrench-wmf> starting shellbox-video pilot on PHP 8.1 (3 replicas, codfw only) - T377038

Alright, we're up and running.

shellbox-constraints: No issues observed so far - error rates remain ~ 0 for now, latency is comparable to 7.4. I've been keeping an eye on the exception log channel for wikis that use WikibaseQualityConstraints (e.g., for unexpected ShellboxError), and again things are looking clean.

Note that the pilot is specifically in eqiad - i.e., if there's a problem that's localized there, then that potentially implicates the pilot (while a problem that also exists in codfw should not).

In the event that any issues arise, the preferred rollback strategy at this point is to revert https://gerrit.wikimedia.org/r/1113217 and helmfile -e eqiad -i apply --context 5.

shellbox-video: Given the irregular arrival rate of transcodes, this is going to require a bit of a longer burn-in to validate. What I can say at this point is that we have seen successful transcodes executed on the pilot pods, no errors have been emitted to logs from mw-videoscaler jobs, and in general the service appears healthy.

In the event that any issues arise, the preferred rollback strategy at this point is to revert https://gerrit.wikimedia.org/r/1113213 and helmfile -e codfw -i apply --context 5.

End-of-day edit: Still no issues observed over the course of the day today for either service, so I think we're good to move forward with the next increments on Wednesday.

Change #1113214 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: 50% of codfw replicas to 8.1 (change 2/4)

https://gerrit.wikimedia.org/r/1113214

Change #1113218 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-constraints: all eqiad replicas on 8.1 (change 2/3)

https://gerrit.wikimedia.org/r/1113218

Change #1113219 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-constraints: all replicas on PHP 8.1 (change 3/3)

https://gerrit.wikimedia.org/r/1113219

shellbox-constraints: Fully migrated, nothing standing out so far
shellbox-video: 50% migrated, we are standing by for potential transcoding issues.

In case of errors, please revert the relevant patches

Change #1113215 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: all codfw replicas to 8.1 (change 3/4)

https://gerrit.wikimedia.org/r/1113215

Change #1113216 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: all replicas on PHP 8.1 (change 4/4)

https://gerrit.wikimedia.org/r/1113216

Change #1115135 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] [DNM] shellbox-constraints: revert to PHP 8.1

https://gerrit.wikimedia.org/r/1115135

Change #1115136 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] [DNM] shellbox-video: revert to PHP 8.1

https://gerrit.wikimedia.org/r/1115136

shellbox-constraints: Thanks to @jijiki, this has been running on 8.1 since ~ 11:30 UTC today, with no issues detected thus far.

In the event that an issue arises that implicates 8.1, merge https://gerrit.wikimedia.org/r/1115135 and helmfile-apply in both sites.

shellbox-video: No issues have been discovered since moving to 50% on 8.1 at ~ 11:15 UTC today, so I've moved forward to 100% on 8.1 as of ~ 23:20 UTC.

In the event that an issue arises that implicates 8.1, merge https://gerrit.wikimedia.org/r/1115136 and helmfile-apply in both sites.

Note: I have not yet scaled down the "migration" release of shellbox-video in codfw, as I'm waiting for some running transcodes to complete. I hope to do that shortly.

Mentioned in SAL (#wikimedia-operations) [2025-01-30T02:45:51Z] <swfrench-wmf> scaled down shellbox-video/migration after switch to PHP 8.1 - T377038

Change #1116837 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] shellbox: 1 replica on 8.1 for each DC

https://gerrit.wikimedia.org/r/1116837

Change #1116838 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] shellbox-media: 1 replica on 8.1 for each DC

https://gerrit.wikimedia.org/r/1116838

Change #1116837 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: 1 replica on 8.1 for each DC

https://gerrit.wikimedia.org/r/1116837

Change #1117506 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] shellbox: all replicas on PHP 8.1 (score)

https://gerrit.wikimedia.org/r/1117506

Change #1117506 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: all replicas on PHP 8.1 (score)

https://gerrit.wikimedia.org/r/1117506

jijiki updated the task description. (Show Details)

Change #1116838 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-media: 1 replica on 8.1 for each DC

https://gerrit.wikimedia.org/r/1116838

Change #1122547 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] shellbox-timeline: 1 replica on 8.1 for each DC

https://gerrit.wikimedia.org/r/1122547

Change #1122547 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-timeline: 1 replica on 8.1 for each DC

https://gerrit.wikimedia.org/r/1122547

We have deployed 1 php8.1 replica per DC for shellbox-timeline and shellbox-media. If everything is uneventful until tomorrow, we will move forward and complete the migration

Change #1122923 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] shellbox: all replicas on PHP 8.1 (media & timeline)

https://gerrit.wikimedia.org/r/1122923

Change #1122923 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: all replicas on PHP 8.1 (media & timeline)

https://gerrit.wikimedia.org/r/1122923

Change #1123033 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-media: revert to PHP 7.4

https://gerrit.wikimedia.org/r/1123033

Roughly correlated with the switch to 8.1, we're now seeing a trickle of ShellboxErrors like the following:

normalized_message
[{reqId}] {exception_url}   Shellbox\ShellboxError: Shellbox server returned incorrect Content-Type
FrameLocationCall
from/srv/mediawiki/php-1.44.0-wmf.17/vendor/wikimedia/shellbox/src/Client.php(191)
#0/srv/mediawiki/php-1.44.0-wmf.17/vendor/wikimedia/shellbox/src/Command/RemoteBoxedExecutor.php(83)Shellbox\Client->sendRequest(string, array, array, array)
#1/srv/mediawiki/php-1.44.0-wmf.17/vendor/wikimedia/shellbox/src/Command/BoxedExecutor.php(20)Shellbox\Command\RemoteBoxedExecutor->executeValid(Shellbox\Command\BoxedCommand)
#2/srv/mediawiki/php-1.44.0-wmf.17/vendor/wikimedia/shellbox/src/Command/BoxedCommand.php(406)Shellbox\Command\BoxedExecutor->execute(Shellbox\Command\BoxedCommand)
#3/srv/mediawiki/php-1.44.0-wmf.17/extensions/PdfHandler/includes/PdfImage.php(129)Shellbox\Command\BoxedCommand->execute()
#4/srv/mediawiki/php-1.44.0-wmf.17/extensions/PdfHandler/includes/PdfHandler.php(280)MediaWiki\Extension\PdfHandler\PdfImage->retrieveMetaData()
#5/srv/mediawiki/php-1.44.0-wmf.17/includes/media/MediaHandler.php(228)MediaWiki\Extension\PdfHandler\PdfHandler->getSizeAndMetadata(UploadStashFile, string)
#6/srv/mediawiki/php-1.44.0-wmf.17/includes/filerepo/file/UnregisteredLocalFile.php(203)MediaHandler->getSizeAndMetadataWithFallback(UploadStashFile, string)
#7/srv/mediawiki/php-1.44.0-wmf.17/includes/filerepo/file/UnregisteredLocalFile.php(193)UnregisteredLocalFile->getSizeAndMetadata()
#8/srv/mediawiki/php-1.44.0-wmf.17/includes/api/ApiQueryImageInfo.php(633)UnregisteredLocalFile->getMetadataArray()
#9/srv/mediawiki/php-1.44.0-wmf.17/includes/api/ApiUpload.php(227)MediaWiki\Api\ApiQueryImageInfo::getInfo(UploadStashFile, array, MediaWiki\Api\ApiResult)
#10/srv/mediawiki/php-1.44.0-wmf.17/includes/jobqueue/jobs/AssembleUploadChunksJob.php(147)MediaWiki\Api\ApiUpload->getUploadImageInfo(UploadFromChunks)
#11/srv/mediawiki/php-1.44.0-wmf.17/extensions/EventBus/includes/JobExecutor.php(80)AssembleUploadChunksJob->run()
#12/srv/mediawiki/rpc/RunSingleJob.php(60)MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#13{main}

Example logstash search: https://logstash.wikimedia.org/goto/7ec03953b7b850ec7b7480ae814c5a37

The interesting part is that only shellbox-media seems to be affected - i.e., all stack traces (43 exceptions in total) implicate code paths via PagedTiffHandler (7), PdfHandler like the above (33), or Core's use of DjVu (3).

Notably, if we're getting to this point, shellbox has returned a 200 status. Unfortunately, the exception does not actually indicate what the erroneous Content-Type header was. Also, since we raise a ShellboxError early in processing, we never make it back up to https://gerrit.wikimedia.org/g/mediawiki/libs/Shellbox/+/5f561cb8ee22cd90acef07a7b936f24a22ff1b99/src/Command/RemoteBoxedExecutor.php#95 in order to see what the output of the command was.

In any case, I'm going to revert shellbox-media to 7.4 to confirm the errors stop.

Change #1123033 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-media: revert to PHP 7.4

https://gerrit.wikimedia.org/r/1123033

Mentioned in SAL (#wikimedia-operations) [2025-02-26T22:06:25Z] <swfrench-wmf> shellbox-media reverted to PHP 7.4 - T377038

Last known error was at 22:01:19.025 UTC. I'm going to hold for a bit to see whether the errors have indeed stopped (which may take a little while to state confidently).

Stepping back, seeing as how we've not seen this in any other shellbox instance, it's puzzling that we would see it pop up now, and further affect all three shellbox-media use cases - i.e., at least superficially, it does not appear workload specific.

It has now been over 4h since reverting shellbox-media to 7.4, and no further errors of the kind in T377038#10584869 have been observed.

I'll return to this tomorrow and start figuring out how to debug this.

One notable difference between the shellbox-media workload vs. the other instances is the non-zero rate at which the wrapped commands fail (e.g., due to malformed media files) as can be seen in the mediawiki exec logging channel. This is true on either or 7.4 or 8.1.

What I wonder is whether, when running on 8.1, something is subtly different about how shellbox responds in some of these cases.

@tstarling - Does any of this sound like something you've seen before?

FWIW, “Shellbox server returned incorrect Content-Type” can mean a lot of different errors – @AudreyPenven_WMDE and I ran into it once, and it turned out that I’d forgotten the <?php at the beginning of a PHP file that was being sent to shellbox for evaluation. “Unable to parse Shellbox response as MIME multipart message” might be a more accurate message.

Thanks, Lucas!

Agreed, yeah, given the way the client error handling works, there's a lot that's potentially masked by "Shellbox server returned incorrect Content-Type” and (together with some improved logging) rephrasing along the lines of what you're suggesting would better capture what's actually happening.

Also, that's an interesting observation about execution failure (due to a malformed PHP source file) manifesting the same way. Was that a call action that executed the PHP code directly (like in WikibaseQualityConstraints) or a shell action that invoked the PHP CLI as a (sub)command?

I spent some time this afternoon looking more closely at this, and I suspect we're only going to be able to get at this with more data than we have now.

A couple of ideas:

Logging in shellbox: In normal operation, we delegate logging to the client by way of flushing the buffered logs to the json-data response part. Since we're apparently not getting a multipart response, there's not much hope in trying to learn what's happening on the client side.

However, it's not just logToClient that's true in the production shellbox config - we also set jsonLogToStderr true, but this information goes nowhere - i.e., php-fpm sends it to /dev/null in the absence of setting catch_workers_output to true. We can do that temporarily relatively easily via the FPM__catch_workers_output environment variable.

Logging in envoy: By default, we do not log non-5xx responses in envoy's access log. We know that we must be getting a 200 response in order to fail here.

We can set that temporarily to a less restrictive value via the mesh.local_access_log_min_code helmfile value (as well as vary it dynamically). With a bit more effort (requires a new mesh.configuration module, etc.), we could consider augmenting the log format with more interesting fields (e.g., including the Content-Type header in the response), given the fact that the default format is not terribly informative for our purposes.

Improved client error message and supporting logs: Aside from making the exception message more precise, we can also consider providing contextual information when this error occurs - e.g., logged to the exec channel like everything else. If we can figure out a minimal combination of details about the original request that would be sufficient to reproduce (e.g., identity of the specific file), that would be pretty amazing. Of course, given that these are client code changes that would apply to all shellbox use cases, this might be hard to get right.

So, where does that leave us?

We should be able to catch at least one of these in ~ hours running with a single pod migrated to 8.1. I think it makes sense to try doing exactly that as the next step - an experiment with a single pod 8.1 that has both catch_workers_output set to true and envoy access logging turned up. In parallel (and if needed), we can consider what client changes to pursue, if any.

Change #1123690 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-media: serve 1/8 of requests on 8.1 with more logging

https://gerrit.wikimedia.org/r/1123690

Also, that's an interesting observation about execution failure (due to a malformed PHP source file) manifesting the same way. Was that a call action that executed the PHP code directly (like in WikibaseQualityConstraints) or a shell action that invoked the PHP CLI as a (sub)command?

It was the PHP file added in this change (i.e. sent via 'classes' => […]), during earlier experimentation trying to see if that approach was viable (I don’t think the broken version even ended up on Gerrit). Without the opening <?php, I expect the full PHP source got dumped to stdout and intermixed with the “real” HTTP response data.

Change #1123690 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-media: serve 1/8 of requests on 8.1 with more logging

https://gerrit.wikimedia.org/r/1123690

Mentioned in SAL (#wikimedia-operations) [2025-03-03T15:51:29Z] <swfrench-wmf> started shellbox-media PHP 8.1 pilot with increased logging - T377038

Ah, thanks Lucas! Yes, that makes sense then, and indeed I may have seen something similar before, when shellbox is somehow induced to emit content "early" (default 200 response code, unexpected content-type, etc.).

I've started the one-pod-per-DC test described in T377038#10589628 and will be checking in throughout the day for errors.

Wow, that was quick. Got one!

Logstash: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2025.03.03?id=lwG5XJUBuXzFNByTNemC

[03-Mar-2025 15:56:33] WARNING: [pool www] child 10 said into stderr: "PHP message: PHP Warning:  Unknown: POST Content-Length of 289293710 bytes exceeds the limit of 104857600 bytes in Unknown on line 0"
[03-Mar-2025 15:56:33] WARNING: [pool www] child 10 said into stderr: "PHP message: PHP Warning:  Cannot modify header information - headers already sent in Unknown on line 0"
[03-Mar-2025 15:56:33] WARNING: [pool www] child 10 said into stderr: "{"message":"Creating base path /tmp/shellbox-b7754a4c941633b6","context":{},"level":100,"level_name":"DEBUG","channel":"shellbox","datetime":"2025-03-03T15:56:33.201988+00:00","extra":{}}"
[03-Mar-2025 15:56:33] WARNING: [pool www] child 10 said into stderr: "{"message":"Creating subdirectory scripts","context":{},"level":100,"level_name":"DEBUG","channel":"shellbox","datetime":"2025-03-03T15:56:33.202123+00:00","extra":{}}"
[03-Mar-2025 15:56:36] WARNING: [pool www] child 10 said into stderr: "{"message":"Executing: /bin/bash '/srv/app/src/Command/limit.sh' ''\\''/bin/bash'\\'' '\\''scripts/retrieveMetaData.sh'\\''' 'SB_INCLUDE_STDERR=;SB_CPU_LIMIT=50; SB_CGROUP='\\'''\\''; SB_MEM_LIMIT=1073741824; SB_FILE_SIZE_LIMIT=536870912; SB_WALL_CLOCK_LIMIT=60; SB_USE_LOG_PIPE=yes'","context":{},"level":200,"level_name":"INFO","channel":"shellbox","datetime":"2025-03-03T15:56:36.245131+00:00","extra":{}}"
[03-Mar-2025 15:56:36] WARNING: [pool www] child 10 said into stderr: "{"message":"Removed file \"/tmp/shellbox-b7754a4c941633b6/pages\"","context":{},"level":100,"level_name":"DEBUG","channel":"shellbox","datetime":"2025-03-03T15:56:36.297090+00:00","extra":{}}"
[03-Mar-2025 15:56:36] WARNING: [pool www] child 10 said into stderr: "{"message":"Removed file \"/tmp/shellbox-b7754a4c941633b6/text\"","context":{},"level":100,"level_name":"DEBUG","channel":"shellbox","datetime":"2025-03-03T15:56:36.297165+00:00","extra":{}}"
[03-Mar-2025 15:56:36] WARNING: [pool www] child 10 said into stderr: "{"message":"Removed file \"/tmp/shellbox-b7754a4c941633b6/text_exit_code\"","context":{},"level":100,"level_name":"DEBUG","channel":"shellbox","datetime":"2025-03-03T15:56:36.297194+00:00","extra":{}}"
[03-Mar-2025 15:56:36] WARNING: [pool www] child 10 said into stderr: "{"message":"Removed file \"/tmp/shellbox-b7754a4c941633b6/file.pdf\"","context":{},"level":100,"level_name":"DEBUG","channel":"shellbox","datetime":"2025-03-03T15:56:36.361993+00:00","extra":{}}"
[03-Mar-2025 15:56:36] WARNING: [pool www] child 10 said into stderr: "{"message":"Removed file \"/tmp/shellbox-b7754a4c941633b6/meta\"","context":{},"level":100,"level_name":"DEBUG","channel":"shellbox","datetime":"2025-03-03T15:56:36.362080+00:00","extra":{}}"
[03-Mar-2025 15:56:36] WARNING: [pool www] child 10 said into stderr: "{"message":"Removed file \"/tmp/shellbox-b7754a4c941633b6/scripts/retrieveMetaData.sh\"","context":{},"level":100,"level_name":"DEBUG","channel":"shellbox","datetime":"2025-03-03T15:56:36.362159+00:00","extra":{}}"
[03-Mar-2025 15:56:36] WARNING: [pool www] child 10 said into stderr: "{"message":"Removed directory \"/tmp/shellbox-b7754a4c941633b6/scripts\"","context":{},"level":100,"level_name":"DEBUG","channel":"shellbox","datetime":"2025-03-03T15:56:36.362191+00:00","extra":{}}"
[03-Mar-2025 15:56:36] WARNING: [pool www] child 10 said into stderr: "{"message":"Removed directory \"/tmp/shellbox-b7754a4c941633b6\"","context":{},"level":100,"level_name":"DEBUG","channel":"shellbox","datetime":"2025-03-03T15:56:36.362234+00:00","extra":{}}"
[03-Mar-2025 15:56:36] WARNING: [pool www] child 10 said into stderr: "{"message":"Exception of class Shellbox\\ShellboxError: PHP error in /srv/app/src/Action/MultipartAction.php line 265: Cannot modify header information - headers already sent","context":{"trace":"#0 [internal function]: Shellbox\\Server->handleError(2, 'Cannot modify h...', '/srv/app/src/Ac...', 265)\n#1 /srv/app/src/Action/MultipartAction.php(265): header('Content-Type: m...')\n#2 /srv/app/src/Action/ShellAction.php(44): Shellbox\\Action\\MultipartAction->writeResult(Array, Array, Array)\n#3 /srv/app/src/Action/MultipartAction.php(68): Shellbox\\Action\\ShellAction->execute(Array)\n#4 /srv/app/src/Server.php(137): Shellbox\\Action\\MultipartAction->baseExecute(Array)\n#5 /srv/app/src/Server.php(78): Shellbox\\Server->guardedExecute('/srv/app/config...')\n#6 /srv/app/src/Server.php(67): Shellbox\\Server->execute('/srv/app/config...')\n#7 /srv/app/index.php(3): Shellbox\\Server::main('/srv/app/config...')\n#8 {main}"},"level":400,"level_n"
[03-Mar-2025 15:56:36] WARNING: [pool www] child 10 said into stderr: "ame":"ERROR","channel":"shellbox","datetime":"2025-03-03T15:56:36.362285+00:00","extra":{}}"

Change #1124140 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] Revert "shellbox-media: serve 1/8 of requests on 8.1 with more logging"

https://gerrit.wikimedia.org/r/1124140

Change #1124140 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "shellbox-media: serve 1/8 of requests on 8.1 with more logging"

https://gerrit.wikimedia.org/r/1124140

Mentioned in SAL (#wikimedia-operations) [2025-03-03T16:10:49Z] <swfrench-wmf> finished shellbox-media PHP 8.1 pilot - T377038

Alright, progress.

After a bit of debugging, there are two things going on here.

One is that we're clearly running into post_max_size in some cases (100M), and this is true on both 7.4 and 8.1.

The other is the 8.1-specific bit: after some additional testing in staging and then reproducing locally, I'm pretty sure this is the change in default behavior for display_startup_errors on 8.0 [0] - i.e., this was previously defaulting to "0" but is now "1". This means error messages like those from running into post_max_size are displayed, which as a side effect of course means we'll send headers, etc. - e.g.

HTTP/1.1 200
 [ ... snip ... ]
Content-Type: text/html; charset=UTF-8
 [ ... snip ... ]
<br />                                                                                                                                
<b>Warning</b>:  Unknown: POST Content-Length of 50601 bytes exceeds the limit of 1024 bytes in <b>Unknown</b> on line <b>0</b><br /> 
<br />                        
<b>Warning</b>:  Cannot modify header information - headers already sent in <b>Unknown</b> on line <b>0</b><br />

Then, later on when we call header in MultipartAction::writeResult, this will fail and emit a warning, which at this point is picked up by the custom error handler and promoted to a ShellboxError, which also terminates execution.

Anyway, the least-surprising choice is likely to update our 8.1 production images to consistently set display_startup_errors and display_errors, where the latter already defaults to "stderr".

One thing that I don't quite understand is why exceeding post_max_size actually seems to be OK in practice - a POST that exceeds the limit doesn't appear to, e.g., truncate the request.

[0] https://www.php.net/manual/en/errorfunc.configuration.php#errorfunc.configuration

Change #1124194 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/docker-images/production-images@master] php8.1: Default display_startup_errors to "stderr"

https://gerrit.wikimedia.org/r/1124194

Change #1124388 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] shellbox-media: serve 1/8 of requests on 8.1 with more logging (2)

https://gerrit.wikimedia.org/r/1124388

Change #1124194 merged by Effie Mouzeli:

[operations/docker-images/production-images@master] php8.1: Set display_startup_errors consistent with display_errors

https://gerrit.wikimedia.org/r/1124194

Change #1124405 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[mediawiki/libs/Shellbox@master] Bump to rebuild

https://gerrit.wikimedia.org/r/1124405

Change #1124405 merged by jenkins-bot:

[mediawiki/libs/Shellbox@master] Bump to rebuild (php 8.1.34-1-s3)

https://gerrit.wikimedia.org/r/1124405

Change #1124444 had a related patch set uploaded (by Scott French; author: Effie Mouzeli):

[operations/docker-images/production-images@master] Revert "php8.1: Set display_startup_errors consistent with display_errors"

https://gerrit.wikimedia.org/r/1124444

Change #1124444 merged by Clément Goubert:

[operations/docker-images/production-images@master] Revert "php8.1: Set display_startup_errors consistent with display_errors"

https://gerrit.wikimedia.org/r/1124444

Change #1124476 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/docker-images/production-images@master] php8.1: Set display_startup_errors consistent with display_errors

https://gerrit.wikimedia.org/r/1124476

Change #1124476 merged by Scott French:

[operations/docker-images/production-images@master] php8.1: Set display_startup_errors consistent with display_errors

https://gerrit.wikimedia.org/r/1124476

Mentioned in SAL (#wikimedia-operations) [2025-03-04T17:31:47Z] <swfrench-wmf> built php8.1 production images with 'php8.1: Set display_startup_errors consistent with display_errors' - T377038

Mentioned in SAL (#wikimedia-operations) [2025-03-04T17:33:36Z] <swfrench@deploy2002> Started scap sync-world: Use latest php8.1 images - T377038

Alright, progress.

After a bit of debugging, there are two things going on here.

One is that we're clearly running into post_max_size in some cases (100M), and this is true on both 7.4 and 8.1.

The other is the 8.1-specific bit: after some additional testing in staging and then reproducing locally, I'm pretty sure this is the change in default behavior for display_startup_errors on 8.0 [0] - i.e., this was previously defaulting to "0" but is now "1". This means error messages like those from running into post_max_size are displayed, which as a side effect of course means we'll send headers, etc. - e.g.

<snip>

Thank you for looking into this!

One thing that I don't quite understand is why exceeding post_max_size actually seems to be OK in practice - a POST that exceeds the limit doesn't appear to, e.g., truncate the request.

We could always change the max size in the proxies before, to ensure that at least they will not make it to the app which will fail it anyway, at least for shellbox-media

Mentioned in SAL (#wikimedia-operations) [2025-03-04T17:58:05Z] <swfrench@deploy2002> Finished scap sync-world: Use latest php8.1 images - T377038 (duration: 24m 53s)

Change #1124388 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-media: serve 1/8 of requests on 8.1 with more logging (2)

https://gerrit.wikimedia.org/r/1124388

Mentioned in SAL (#wikimedia-operations) [2025-03-04T23:53:08Z] <swfrench-wmf> started shellbox-media PHP 8.1 pilot with increased logging and display_startup_errors fix - T377038

Now that the display_startup_errors fix is available, I've started a second single-replica-per-DC pilot for shellbox-media.

I'll be checking throughout the remainder of the afternoon for Shellbox server returned incorrect Content-Type errors, which I am cautiously optimistic we should not see anymore.

Aside: One odd observation while testing with catch_workers_output enabled, is that we see errors like ERROR: Connection disallowed: IP address '10.192.5.27' has been dropped. every 5 seconds, where the IP is k8s worker node IP. The period (and source IP) suggests this is the liveness check hitting port 9000 on the php-fpm (app) container, though clearly the app is indeed considered live by the kubelet, so the check must be passing. In any case, I'll have to look into that later on.

About two hours in, we've seen a handful of POST Content-Length of {size} bytes exceeds the limit of 104857600 bytes in Unknown on line 0 warnings pop up in in stderr via catch_workers_output, but no sign of associated Shellbox server returned incorrect Content-Type errors.

What remains curious - and indeed is consistent with what we've seen on 7.4 historically - is that there's no well-correlated signal of these requests failing in the exec logs. It's almost as if php://input somehow still works despite the POST request body reading code ostensibly bailing out when emitting the warning.

Change #1124767 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] shellbox-media: switch main to 8.1

https://gerrit.wikimedia.org/r/1124767

Change #1124767 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-media: switch main to 8.1

https://gerrit.wikimedia.org/r/1124767

Many thanks to @jijiki for moving the shellbox-media migration forward today. I've confirmed again that there are no further Shellbox server returned incorrect Content-Type exceptions (last one in logstash was the test in T377038#10596730 on Monday).

In addition to cleanup (e.g., turning catch_workers_output back off), there are a couple of accumulated follow-ups from what we've learned here (e.g., (re)consider limits on POST request size, whether the Connection disallowed errors are something we should address, etc.) which I'll aim to pull together tomorrow.

I spent a bit of time looking at the PHP source code yesterday evening to try to understand what's happening with post_max_size - specifically, why we see the POST Content-Length of {size} bytes exceeds the limit of 104857600 bytes in Unknown on line 0 warnings, but in practice do not see errors due to missing POST body content (e.g., MultipartError being thrown by MultipartReader).

To recap, shellbox reads multipart/mixed POST body content via php://input in its MultipartReader. It has no use for the $_POST or $_FILES superglobals.

Now, the warning is emitted early in request processing, in our case in sapi_read_standard_form_data (main/SAPI.c). We end up there by way of the default POST reader, since we're not using one of the special-cased content-types (application/x-www-form-urlencoded or multipart/form-data) with handlers to populate the superglobals (though noting it's the same reader we would have used for the former content-type).

While sapi_read_standard_form_data returns early on the above warning, what this means for us in practice is a bit more involved. Specifically, the net effect seems to be that we do not eagerly read all POST content into a temp stream that is later used to back php://input.

However, I'm relatively confident that the logic in php_stream_url_wrap_php (ext/standard/php_fopen_wrapper.c), which implements operations on php:// streams, is what saves us here.

If the temp stream for POST data is not already initialized, as in the case here, then attempting to open php://input will now do so, with php_stream_input_read (defined here) registered as the stream read op. That in turn uses sapi_read_post_block - just like we would have in sapi_read_standard_form_data - to read the POST body block-by-block.

So, to summarize: It seems post_max_size doesn't apply to "on-demand" use of php://input - just eager reading that's necessary for use cases like the handlers that populate the superglobals. This sort of makes sense in the context of the docs, which mainly feature the effect on the superglobals.

Change #1125194 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-media: clean up extra debug logging

https://gerrit.wikimedia.org/r/1125194

Change #1125194 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-media: clean up extra debug logging

https://gerrit.wikimedia.org/r/1125194

Thanks for flagging @Reedy!

So, I'm confident this is unrelated to the migration of shellbox-constraints to 8.1 (for reference, that started on the 28th of January).

Instead, this is what happens when mediawiki requests that in turn issue requests to shellbox-constraints don't terminate within the grace period we've set for envoy (i.e., what's listening on localhost:6025) termination when a pod is shutdown, particularly during mediawiki deployments.

If you zoom out to 3 months or so (logstash):

image.png (658×1 px, 121 KB)

you can see that it's nearly entirely limited to deployment days (e.g., weekends and end-of-year holiday are quiet). Also, there's no specific changepoint correlated with the shellbox-constraints migration.

There is perhaps a weak upward trend week-over-week, but one would need to normalize that to the rate of deployments to detect an actual increase in likelihood.

In any case, I believe T371633 also has discussion on this effect.

Change #1134098 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox*: normalize migration releases

https://gerrit.wikimedia.org/r/1134098

Change #1134099 had a related patch set uploaded (by Scott French; author: Scott French):

[mediawiki/libs/Shellbox@master] Stop building PHP 7.4-based images

https://gerrit.wikimedia.org/r/1134099

Change #1134245 had a related patch set uploaded (by Scott French; author: Scott French):

[mediawiki/libs/Shellbox@master] Remove PHP 7.4-based image variants

https://gerrit.wikimedia.org/r/1134245

Change #1134098 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox*: normalize migration releases

https://gerrit.wikimedia.org/r/1134098

Change #1134099 merged by jenkins-bot:

[mediawiki/libs/Shellbox@master] Stop building PHP 7.4-based images

https://gerrit.wikimedia.org/r/1134099

Change #1134245 merged by jenkins-bot:

[mediawiki/libs/Shellbox@master] Remove PHP 7.4-based image variants

https://gerrit.wikimedia.org/r/1134245

Now that 7.4-based images are no longer being built and the 'migration' releases have been normalized for their next use (8.3), no further cleanups are tracked here.

The one remaining cleanup item is tracked separately in T388260: Converge / update image version across shellbox service instances (cleanup).