Page MenuHomePhabricator

Pages with Pygments or Timeline intermittenly fail to render (Shellbox server returned status code 503)
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   Shellbox\ShellboxError: Shellbox server returned status code 503
exception.trace
from /srv/mediawiki/php-1.38.0-wmf.2/vendor/wikimedia/shellbox/src/Client.php(184)
#0 /srv/mediawiki/php-1.38.0-wmf.2/vendor/wikimedia/shellbox/src/Command/RemoteBoxedExecutor.php(80): Shellbox\Client->sendRequest(string, array, array, array)
#1 /srv/mediawiki/php-1.38.0-wmf.2/vendor/wikimedia/shellbox/src/Command/BoxedExecutor.php(20): Shellbox\Command\RemoteBoxedExecutor->executeValid(Shellbox\Command\BoxedCommand)
#2 /srv/mediawiki/php-1.38.0-wmf.2/vendor/wikimedia/shellbox/src/Command/BoxedCommand.php(183): Shellbox\Command\BoxedExecutor->execute(Shellbox\Command\BoxedCommand)
#3 /srv/mediawiki/php-1.38.0-wmf.2/extensions/timeline/includes/Timeline.php(178): Shellbox\Command\BoxedCommand->execute()
#4 /srv/mediawiki/php-1.38.0-wmf.2/extensions/timeline/includes/Timeline.php(78): Timeline::renderTimeline(string, array)
#5 /srv/mediawiki/php-1.38.0-wmf.2/includes/parser/Parser.php(3948): Timeline::onTagHook(string, array, Parser, PPFrame_Hash)
#6 /srv/mediawiki/php-1.38.0-wmf.2/includes/parser/PPFrame_Hash.php(350): Parser->extensionSubstitution(array, PPFrame_Hash)
#7 /srv/mediawiki/php-1.38.0-wmf.2/includes/parser/Parser.php(2905): PPFrame_Hash->expand(PPNode_Hash_Tree, integer)
#8 /srv/mediawiki/php-1.38.0-wmf.2/includes/parser/Parser.php(1573): Parser->replaceVariables(string)
#9 /srv/mediawiki/php-1.38.0-wmf.2/includes/parser/Parser.php(653): Parser->internalParse(string)
#10 /srv/mediawiki/php-1.38.0-wmf.2/includes/content/WikitextContent.php(299): Parser->parse(string, Title, ParserOptions, boolean, boolean, integer)
#11 /srv/mediawiki/php-1.38.0-wmf.2/includes/content/AbstractContent.php(548): WikitextContent->fillParserOutput(Title, integer, ParserOptions, boolean, ParserOutput)
#12 /srv/mediawiki/php-1.38.0-wmf.2/includes/Revision/RenderedRevision.php(266): AbstractContent->getParserOutput(Title, integer, ParserOptions, boolean)
#13 /srv/mediawiki/php-1.38.0-wmf.2/includes/Revision/RenderedRevision.php(235): MediaWiki\Revision\RenderedRevision->getSlotParserOutputUncached(WikitextContent, boolean)
#14 /srv/mediawiki/php-1.38.0-wmf.2/includes/Revision/RevisionRenderer.php(217): MediaWiki\Revision\RenderedRevision->getSlotParserOutput(string, array)
#15 /srv/mediawiki/php-1.38.0-wmf.2/includes/Revision/RevisionRenderer.php(154): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(MediaWiki\Revision\RenderedRevision, array)
#16 [internal function]: MediaWiki\Revision\RevisionRenderer->MediaWiki\Revision\{closure}(MediaWiki\Revision\RenderedRevision, array)
#17 /srv/mediawiki/php-1.38.0-wmf.2/includes/Revision/RenderedRevision.php(197): call_user_func(Closure, MediaWiki\Revision\RenderedRevision, array)
#18 /srv/mediawiki/php-1.38.0-wmf.2/includes/poolcounter/PoolWorkArticleView.php(137): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput()
#19 /srv/mediawiki/php-1.38.0-wmf.2/includes/poolcounter/PoolCounterWork.php(162): PoolWorkArticleView->doWork()
#20 /srv/mediawiki/php-1.38.0-wmf.2/includes/page/ParserOutputAccess.php(281): PoolCounterWork->execute()
#21 /srv/mediawiki/php-1.38.0-wmf.2/includes/page/Article.php(697): MediaWiki\Page\ParserOutputAccess->getParserOutput(WikiPage, ParserOptions, MediaWiki\Revision\RevisionStoreRecord, integer)
#22 /srv/mediawiki/php-1.38.0-wmf.2/includes/page/Article.php(509): Article->generateContentOutput(User, ParserOptions, integer, OutputPage, array)
#23 /srv/mediawiki/php-1.38.0-wmf.2/includes/actions/ViewAction.php(74): Article->view()
#24 /srv/mediawiki/php-1.38.0-wmf.2/includes/MediaWiki.php(538): ViewAction->show()
#25 /srv/mediawiki/php-1.38.0-wmf.2/includes/MediaWiki.php(320): MediaWiki->performAction(Article, Title)
#26 /srv/mediawiki/php-1.38.0-wmf.2/includes/MediaWiki.php(925): MediaWiki->performRequest()
#27 /srv/mediawiki/php-1.38.0-wmf.2/includes/MediaWiki.php(559): MediaWiki->main()
#28 /srv/mediawiki/php-1.38.0-wmf.2/index.php(53): MediaWiki->run()
#29 /srv/mediawiki/php-1.38.0-wmf.2/index.php(46): wfIndexMain()
#30 /srv/mediawiki/w/index.php(3): require(string)
#31 {main}
Notes
  • The same pages render file when re-tried.
  • The payloads are insignificant.
  • The shellbox service does not appear to be under notable load.
Impact

4 of these recently, all in 1.38.0-wmf.2.

Event Timeline

Just saw another one while looking elsewhere in logstash https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-2021.10.08?id=bvbwYXwBf67HOmRff8lJ (reqId: 43b1a7a9-028a-43fe-8413-cccc39309c4a), this one was for SyntaxHighlight though.

We get a handful of these errors every day (for example 19 in the last 24 hours). There don't seem to be any corresponding errors logged that provide more detail about what is happening.

Something must have regressed with envoy, I'll need to look into those logs.

@Legoktm A question that came up during train log triage today: Is there a logstash dashboard for Shellbox?

I think these are real 503s generated by Shellbox, from the shellbox-timeline httpd logs:

2021-10-02T23:51:01	1119	127.0.0.1	proxy:fcgi://127.0.0.1:9000/503	299	POST	http://localhost:6028/shell/easytimeline	-	text/html	-	-	MediaWiki/1.38.0-wmf.2	-	-	-	-	127.0.0.1	afbe7b30-7810-47ed-ae21-d171f5f2c67e	-
2021-09-30T19:38:31	130859	127.0.0.1	proxy:fcgi://127.0.0.1:9000/503	299	POST	http://localhost:6028/shell/easytimeline	-	text/html	-	-	MediaWiki/1.38.0-wmf.2	-	-	-	-	127.0.0.1	cb0d6244-7d9c-4a8f-8b36-c2fecc48ab7d	-

@Legoktm A question that came up during train log triage today: Is there a logstash dashboard for Shellbox?

No, just what I've written down at https://wikitech.wikimedia.org/wiki/Shellbox#Logs

I believe this task was presumed less urgent due to an assumption that it would only affect requestst that involve rendering of wiki pages that need syntax highlighting, or perhaps more specifically that syntax highlighting falls back to rendering source code on articles as plain text.

Both of these are incorrect.

  • Based on very crude Logstash analysis, I believe the vast majority of requestst that fatal due to Shellbox 503, are requests that do not involve any wiki page or involve wiki pages that don't contain any <syntaxhighlight> block. Instead, the reason they fatal is due to Pygments::version() fatalling early on which is used in various cache keys. It's also used on api.php help guides, formatted api.php responses, and Special:Version.
  • Based on modifying vendor/wikimedia/shellbox/src/Client.php:172 to throw its ShellboxError unconditionally even on HTTP 200, and trying this on mwdebug1002, I've confirmed that the various code paths in SyntaxHighlight that error handling doesn't work (e.g. catching PygmentsException after non-zero shell exit codes). As part of the transition to Shellbox, while these presumably do catch cases of successfull Shellbox responses that wrap a non-zero remote exit code, they fail to handle the (now) far more common error of Shellbox itself not completing the request.

The latter means that api.php and Special:Version regularly fatal in their entirely, which seems unintended.

Krinkle renamed this task from Shellbox\ShellboxError: Shellbox server returned status code 503 to Pages with Pygments or Timeline intermittenly fail to render (Shellbox server returned status code 503).May 4 2023, 4:47 PM
Krinkle updated the task description. (Show Details)
Krinkle added a project: SyntaxHighlight.

@Krinkle based on the retitling, am I correct in assuming that you're looking for a fix on the SyntaxHighlight/Timeline side that more gracefully handles these errors? (And not fixing Shellbox 503s necessarily)

@Legoktm No, not per se. I retitled to signify impact and indicate scope (two extensions, not one).

We probably should also improve fallback in MediaWiki, since requiring perfect availability isn't reasonable. The fallback seems to have stopped working when the code was transitioned to Shellbox because there are now two failure modes (shell fail over http success, and low level http fail) but only the former is handled. But the service also seems in in need of a fix, or at least a better understanding (and maybe retry policy) of why it is as unstable as it appears.

A service should imho not continually fail by default under low load for trivial getVersion calls that succeed on re-try. It seems reasonable to expect backend services to at least occasionally get 1h stretches a day of 100% success under low-load conditions, given small uncontentious payloads that are known to succeed. When a service doesn't even reach that, it poisons confidence in the intermediary services (Shellbox, Envoy, Kubernetes) until we understand why.

Logstash: https://logstash.wikimedia.org/goto/927538f8c5d1a937cc71c7909b907560

Grafana: shellbox

Change 921357 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] shellbox: Add service mesh envoy retries

https://gerrit.wikimedia.org/r/921357

Change 921357 merged by Alexandros Kosiaris:

[operations/puppet@production] shellbox: Add service mesh envoy retries

https://gerrit.wikimedia.org/r/921357

Notes from looking at it with @jijiki and @akosiaris:

  • The 503s returned to MediaWiki from local envoy / php-curl do not reach the shellbox service in k8s, as measured by the envoy-tls terminator inside the k8s pod. The stdout from the k8s pod shows no HTTP errors. This confirms what we saw on Grafana: shellbox.
  • The 503s are seen in the local envoy logs on each appserver. For example, on appserver, parsoid and api appserver we see things like:
krinkle@parse1018:~$ tail /var/log/envoy/shellbox-syntaxhighlight.log 
[2023-05-18T15:20:10.915Z] "POST /shell/syntaxhighlight-pygments HTTP/1.1" 503 UF 903 91 251 - "-" "MediaWiki/1.41.0
[2023-05-18T15:21:35.669Z] "POST /shell/syntaxhighlight-pygments HTTP/1.1" 503 UF 928 91 252 - "-" "MediaWiki/1.41.0
[2023-05-18T20:12:08.416Z] "POST /shell/syntaxhighlight-pygments HTTP/1.1" 503 UF 1068 91 248 - "-" "MediaWiki/1.41.0
[2023-05-19T14:13:56.228Z] "POST /shell/syntaxhighlight-pygments HTTP/1.1" 503 UF 897 91 249 - "-" "MediaWiki/1.41.0
  • Alex's theory at first was that there might be a bug in Envoy with it killing reusable connections after a certain amount of time from a separate thread and doing so even if it's mid-flight. We ruled this out after realizing that the errors above take well over 200ms before the failure is reached. (The access log above is in Envoy access log format). 200ms is far above the normal latency for this service. So it looks like it might be a timeout after all. Not a timeout in the k8s/shellbox service, but a connection timeout.
  • There is apparently a separate Grafana dashboard that plots data about the non-k8s appserver/outgoing Envoy: Grafana: Envoy Telemetry where we can see the 503s.

So far these connection timeouts do not appear to correlate with anything happening on the host, E.g. looking at the Host overview dashboard for one of the MW servrs (parse1018), there's not a correlating spike in socket errors, or TCP retransmits.

@akosiaris It's dropped a lot. But.. it's not completely gone either. And.. still with no errors reported on the pod-side (Grafana: shellbox).

https://logstash.wikimedia.org/goto/1fe28c6d94112319b556ec237409d764

It's now down to about 10-20 a day. Impact the same as before, i.e. the MediaWiki backend pageview request responds with HTTP 500 to the browser, after the Shelbox HTTP 503 being turned into a PHP exception.

Change 926661 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/SyntaxHighlight_GeSHi@master] Pygmentize: Treat Shellbox network loss like non-zero exit code

https://gerrit.wikimedia.org/r/926661

Krinkle triaged this task as Medium priority.

Change 926661 merged by jenkins-bot:

[mediawiki/extensions/SyntaxHighlight_GeSHi@master] Pygmentize: Treat Shellbox network loss like non-zero exit code

https://gerrit.wikimedia.org/r/926661

Due to a typo, the patch that fixed this issue for the other shellboxes is not visible here

https://gerrit.wikimedia.org/r/c/operations/puppet/+/929354

Change #1130568 had a related patch set uploaded (by Hashar; author: Hashar):

[mediawiki/extensions/SyntaxHighlight_GeSHi@master] Gracefully fallback when Shellbox is not available

https://gerrit.wikimedia.org/r/1130568

Change #1130568 merged by jenkins-bot:

[mediawiki/extensions/SyntaxHighlight_GeSHi@master] Gracefully fallback when Shellbox is not available

https://gerrit.wikimedia.org/r/1130568