Page MenuHomePhabricator

persistent toolforge 502/503 errors
Closed, ResolvedPublicBUG REPORT

Description

My toolforge service (https://author-disambiguator.toolforge.org/) keeps becoming unavailable with hangs/502 Bad Gateway or other server errors a few minutes after I restart it, and I can't see what could be causing this. These errors don't show up in the error log, and the 502 responses don't show up in the access log (which has had very little traffic anyway - one request per minute at most usually?)

I can connect to the kubernetes pod with kubectl and everything looks normal,there's only a few processes listed in /proc, etc. But I can't get a response via the web after the first few minutes.

The problem seems to have started mid-day yesterday - see the monitor data here:

https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&refresh=5m&var-namespace=tool-author-disambiguator

with the surge in 4xx and 5xx status codes on 1/3 (by the way, I don't see the surge in 4xx status codes in access.log recently either - there are 2 from this morning and none yesterday, nothing like the multiple per second indicated in that grafana chart!)

I've restarted the service three times today; each time it worked for maybe 2 minutes, then froze up like this again. There's almost no traffic in the logs from anybody else (essentially the tool is broken right now).

Any ideas what's going on? This looks like some sort of upstream issue with nginx maybe?

I am seeing a "You have run out of local ports" error in the error logs from earlier today (but it hasn't repeated recently) which is maybe a clue? I don't think that could possibly be from anything my service is doing though!

I queried the cloud services mailing list on this and Brooke suggested T271063 might be related? The timing is at least coincidental.

Steps to Reproduce:
Go to https://author-disambiguator.toolforge.org/

Actual Results:
no response, then 502 or 503 server error

Expected Results:
The home page of the tool!

Event Timeline

Hello @ArthurPSmith, tool is working for me.

image.png (992×1 px, 105 KB)

I've tried also to log-in, and it works without errors.

image.png (992×1 px, 83 KB)

Can you please check is this still happening for you?

Just FYI. I did a request to the main page, and the first time it took 11 min to load, consecutive times it was in the order of seconds though.
Maybe one of the examples it tries to load randomly on the first screen makes it misbehave?

@Kizule and @dcaro - thanks for checking. Yes, the problem appears to have gone away now (8:52 AM EST). But the 11 min load @dcaro saw sounds like a real problem - the examples come from a simple Wikidata Query Service query and it does nothing with them other than list them as links, so there's no way they cause trouble. I'm still wondering why there seem to be a large number of 4xx errors shown in grafana for the last day when I see only 3 in the access log on toolforge.

ArthurPSmith claimed this task.

Things still seem to be working properly and at least the 5xx errors have disappeared, so i think we can say this is resolved now. It would be nice to know what caused it though!