Starting around March 24, many requests to production (either the API, action or REST, or simple HTML requests) are timing out or returning empty responses. In my experience, this by itself is not unusual. What is unusual is the rate at which it happens; normally it's just a handful of times a day, now it is many hundreds or thousands of times a day.
Symptoms:
- A simple request is made (i.e. fetching the HTML for a page, prop=info API, or pageviews API) may either time out after 30 seconds, or return an empty response.
- This is intermittent; if I make a request for [[Foo]], it may time out, but usually the very next request to [[Foo]] gives a quick, successful response as expected.
How I know this:
- XTools (runs on VPS) sends me an email whenever there are fatal exceptions in the application. In this case there are cURL timeouts when it tries to make requests to the wiki. Our ArticleInfo API (queries action=info API) gets about 100 req/minute, and the Prose API (makes HTML requests) receives about 20 req/minute.
- My continuously running bots (runs on Toolforge) are also reporting occasional timeouts. See the on-wiki error logs (towards the bottom): 1, 2, 3
Recent examples, with UTC timestamps:
- [2020-03-31 16:29:40] Pageviews API got cURL error 28: Operation timed out after 30002 milliseconds with 0 out of 0 bytes received
- [2020-03-31 16:52:55] prop=info API got cURL error 28: Operation timed out after 30001 milliseconds with 0 out of 0 bytes received`
- [2020-03-31 17:08:37] prop=info API got cURL error 52: Empty reply from server
Screenshot of my inbox as of March 31, at 2:28 PM local time (note the counts for each thread):
Given these tools make a comparatively low amount of requests compared to what production receives overall, I am under the belief the issue is probably more widespread. The strange thing is vital signs seem to be OK, and there haven't been many complaints from users that I'm aware of. My attempts to search Logstash have not been successful; it's as if the requests just never made it to the production servers (or I just don't know how to use Logstash :). Please see "Recent examples" above.
At this point I'm confident it's a production issue, probably network-related. For a while I thought it was my applications, or something with Toolforge and VPS, but requests to other external services are not timing out, only requests to production Wikimedia services. I am tagging Cloud-Services anyway just in case.
Any tips on debugging?