Earlier today, we tried to deploy Shellbox server using a newly built image, reflecting what has now been released as Shellbox v4.5.0. While updates to 5 of the 6 server variants showed no obvious issues issues, shellbox-syntaxhighlight very much did, returning nearly all 500 errors while running the new image.
So, what's up?
While we assumed these were almost definitely the result of errors handled by Server::handleException (i.e., returned as 500 with json body containing exception details) there were no corresponding exceptions / errors surfaced in MediaWiki error channels logstash, which was puzzling. I'll get to why shortly.
In any case, without obvious logs to work from, I briefly redeployed the suspect image to a single pod in codfw with catch_workers_output=1 in FPM ini settings, which allowed me to recover (P93613):
Exception of class TypeError: Cannot assign null to property Shellbox\\Command\\Command::$includeStderr of type bool
#0 /srv/app/src/Command/BoxedCommand.php(474): Shellbox\Command\Command->setClientData(Array)
#1 /srv/app/src/Action/ShellAction.php(92): Shellbox\Command\BoxedCommand->setClientData(Array)
#2 /srv/app/src/Action/ShellAction.php(27): Shellbox\Action\ShellAction->createCommand(Array)
#3 /srv/app/src/Action/MultipartAction.php(60): Shellbox\Action\ShellAction->execute(Array)
#4 /srv/app/src/Server.php(134): Shellbox\Action\MultipartAction->baseExecute(Array)
#5 /srv/app/src/Server.php(75): Shellbox\Server->guardedExecute('/srv/app/config...')
#6 /srv/app/src/Server.php(64): Shellbox\Server->execute('/srv/app/config...')
#7 /srv/app/index.php(3): Shellbox\Server::main('/srv/app/config...')
#8 {main}In short, I believe this is a combination of https://gerrit.wikimedia.org/r/c/mediawiki/libs/Shellbox/+/1267941 and https://gerrit.wikimedia.org/r/c/mediawiki/libs/Shellbox/+/1285399.
- MediaWiki vendor contains Shellbox v4.4.0, and thus Command::includeStderr is not initialized with its (new) default value (false).
- Pygments::highlight (i.e., the majority of syntaxhighlight Shellbox invocations) executes the Shellbox call without explicitly setting includeStderr, which then serializes to null in the command data sent to Shellbox server (json).
- The call to Command::setClientData on the server using the deserialized data attempts to set a bool property to null, and sadness ensues.
Why did we only see this in syntaxhighlight?
While there are presumably other BoxedCommand execution code paths that don't explicitly set includeStderr, I suspect they're just a lot more rare than Pygments::highlight. Which is to say, if we'd not rolled back within minutes, we'd probably see other examples.
Why didn't we see anything in logstash?
The ShellboxError thrown by the client will be converted to a PygmentsException which is then only logged via wfWarn.
What should we do about it?
I suspect the simplest option is to just update MediaWiki vendor to Shellbox v4.5.0 before updating the server.
That should ensure that the Command (and BoxedCommand) properties are always serialized by the client with values that will satisfy type checks when deserialized at the server. That should also be safe to do before the server is updated (i.e., those values should indeed be backward compatible).
The way that would fail us is if we also have the same problem in reverse - i.e., if the serialized response values from a server running older code will not type check when used in newer client code. I'm cautiously optimistic that's less likely, but if it does happen, we'll need to do something more involved.