Page MenuHomePhabricator

Shellbox error rate increased from 100/d to 1000/d, 2022-07-12
Closed, ResolvedPublic

Description

The Shellbox error rate as seen by MediaWiki increased from ~100 per day to ~1000/d, starting 2022-07-11 or 2022-07-12.

zgrep -c 'Shellbox server returned status code' archive/exception-json.log* on a log scale:

shellbox-errors.png (708×964 px, 19 KB)

You can see there were two bad days exceeding 10,000 errors per day; the counts were 36994 and 30085. These are user-visible errors. I used the dates from the log filenames.

Typical envoy log entry

[2022-07-20T04:13:22.531Z] "POST /shell/syntaxhighlight-pygments HTTP/1.1" 503 UF 1750 91 251 - "-" "MediaWiki/1.39.0-wmf.19" "84179aea-bf17-430c-b3e3-1ae8f0adea41" "localhost:6027" "10.2.2.65:4014"

UF means "Upstream connection failure in addition to 503 response code".

The Shellbox dashboard in Grafana shows zero errors. It uses errors as reported by the envoy inside Kubernetes, but evidently the requests are not successfully received by the pod.

There is T292663, but it reports a handful of errors per day, not 100 or 1000.

Event Timeline

Looking at logstash (https://logstash.wikimedia.org/goto/8b05ef476e1c74f8cb625fda5af6a81c) it seems we had some issues in Juli (have not done cross-checking) but before and after we see ~100 errors per day. That seems to have started on 2022-06-10 00:01Z and went back to that level again since 2022-07-21.

The absolute majority of those ~100 errors per day indeed are for extensions/SyntaxHighlight_GeSHi/includes/SyntaxHighlight.php(322): MediaWiki\SyntaxHighlight\Pygmentize::highlight(string, string, array).

Joe claimed this task.

Resolving as it seems like the problem has been analyzed and not occuring currently.