Page MenuHomePhabricator

Score ocassionally gets a 503 response from Shellbox
Closed, ResolvedPublic

Description

On large scores, Shellbox is returning a 503 error that bubbles up into a user-facing exception.

I think these are mostly timeouts, but still need to verify.

We still have a bit of performance tuning work to do: T286384: Benchmark Shellbox

If these are timeouts (can Shellbox tell?) it ideally would be a 504 and a "TimeoutException" or something so it's obvious to people looking at the logs what the issue is.

Event Timeline

Legoktm renamed this task from Score ocassiontally gets a 503 response from Shellbox to Score ocassionally gets a 503 response from Shellbox.Jul 23 2021, 8:36 PM

Well, this error appears quite often, almost on any non-tiny score, in most cases immediately after request, and often multiple times if retrying.

  1. With 2.22: I made a test displaying a set of pages in sequence in a browser, and after this i do the same in private mode to skip browser cache (and as non logged-in). The error appeared about 6 times for about 40 pages. Note: the scores were already pregenerated and should be cached.
  1. With 2.18 the error appeared often multiple times even if the resulting score image that appeared was an old one, not generated in shellbox.

Score needs to check lilypond version in order to determine whether the requested image, generated by this version of lilypond, already exists. Maybe, such checks overload shellbox? Maybe, they can be done in a stati way, without executing lilypond --version ?

It is also possible that there are few different problems hidden here. I observed sometimes (rarely) this error also after some delay, not immediately after a request; this may be different to T257066#7233684 version check.

Well, this error appears quite often, almost on any non-tiny score, in most cases immediately after request, and often multiple times if retrying.

  1. With 2.22: I made a test displaying a set of pages in sequence in a browser, and after this i do the same in private mode to skip browser cache (and as non logged-in). The error appeared about 6 times for about 40 pages. Note: the scores were already pregenerated and should be cached.
  1. With 2.18 the error appeared often multiple times even if the resulting score image that appeared was an old one, not generated in shellbox.

Thank you, this is all helpful.

Score needs to check lilypond version in order to determine whether the requested image, generated by this version of lilypond, already exists. Maybe, such checks overload shellbox? Maybe, they can be done in a stati way, without executing lilypond --version ?

When you wrote your comment, we were caching the version for 60s, now I bumped it to an hour (see https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Score/+/707430)

It is also possible that there are few different problems hidden here. I observed sometimes (rarely) this error also after some delay, not immediately after a request; this may be different to T257066#7233684 version check.

OK, finally had some time to dig into the logs. As far as I can tell, looking at the httpd logs there are no 503 responses coming from Shellbox to MediaWiki: https://logstash.wikimedia.org/goto/9f5733a8712a53454389b36c953520b1

Searching for one example, reqId:"78157f46-c040-4690-8cb1-1ecad464aeae", happened at Aug 5, 2021 @ 18:44:11.735 UTC, on https://en.wikisource.org/wiki/A_Dictionary_of_Music_and_Musicians/Passion_Music

And then in the envoy log on mw2389:

[2021-08-05T18:44:11.731Z] "POST /shell/score-lilypond HTTP/1.1" 503 UC 5780 95 0 - "-" "MediaWiki/1.37.0-wmf.17" "78157f46-c040-4690-8cb1-1ecad464aeae" "localhost:6024" "10.2.1.51:4008"

The envoy docs say "UC: Upstream connection termination in addition to 503 response code." But I'm not really sure what that means...

This is a problem we've encountered before - what I think happens is that envoy (mw side) establishes a persistent connection to envoy (sh side), which in turn makes a connection to the local apache (sh).

Now, while envoy will keep the connection open indefinitely by default, apache will terminate it if inactive after the KeepAliveTimeout. When this happens, sometimes there is a race condition where the envoy-envoy connection doesn't get closed when the backend connection gets closed, and causes this.

What we do in these situations is to add an idle timeout slightly below the one of apache itself to the envoy connection from the mw side. That should mostly eliminate such cases.

Change 710420 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] services_proxy: add keepalive for shellboxes

https://gerrit.wikimedia.org/r/710420

Change 710420 merged by Giuseppe Lavagetto:

[operations/puppet@production] services_proxy: add keepalive for shellboxes

https://gerrit.wikimedia.org/r/710420

Also interestingly we had quite a few actual errors from shellbox, see https://logstash.wikimedia.org/goto/2a88f42efee0ce1edea973e0dacc2fb4 that stopped 2 days ago apparently.

This is a problem we've encountered before - what I think happens is that envoy (mw side) establishes a persistent connection to envoy (sh side), which in turn makes a connection to the local apache (sh).

Now, while envoy will keep the connection open indefinitely by default, apache will terminate it if inactive after the KeepAliveTimeout. When this happens, sometimes there is a race condition where the envoy-envoy connection doesn't get closed when the backend connection gets closed, and causes this.

What we do in these situations is to add an idle timeout slightly below the one of apache itself to the envoy connection from the mw side. That should mostly eliminate such cases.

Ahhh, thanks :D

Also interestingly we had quite a few actual errors from shellbox, see https://logstash.wikimedia.org/goto/2a88f42efee0ce1edea973e0dacc2fb4 that stopped 2 days ago apparently.

Yeah, but AFAIS those are all monitoring failures, real Shellbox requests have a UA of "MediaWiki/1.37.0-wmf.XX". It's still concerning that the fpm-status page is returning 503s, but I didn't see anything in the logs related to that 503 error.

Screenshot 2021-08-06 at 23-59-57 Discover - Elastic.png (546×3 px, 54 KB)

Looks good so far (coincidentally, the last error logged was a minute after the puppet patch was merged, but not yet deployed I presume), I'll check again on Monday and close this if nothing else crops up.

Legoktm assigned this task to Joe.

Screenshot 2021-08-09 at 09-17-34 Discover - Elastic.png (542×3 px, 55 KB)

Calling this fixed.