Page MenuHomePhabricator

wdreconcile.toolforge.org acting as though HTTP 502 Gateway errors are cached
Open, Needs TriagePublic

Description

I am trying to understand why my toolforge tool (https://wdreconcile.toolforge.org) returns HTTP 502 errors for some URLs.

The tool is written in Python 3.7, deployed with Kubernetes via WSGI.
In my experience, HTTP 502 errors signal a lack of connectivity between the HTTP server and the WSGI process. This normally happens when the service is overloaded or just times out. But this time, there are specific URLs which reliably return 502 errors for an extended period of time, while any variation around them will succeed.

For instance, the following URL returns an HTTP 502 error at the moment:
https://wdreconcile.toolforge.org/en/api?queries=%7B%22q0%22%3A%7B%22query%22%3A%22Ujjwal%22%2C%22type%22%3A%22Q202444%22%2C%22type_strict%22%3A%22should%22%7D%2C%22q1%22%3A%7B%22query%22%3A%22Ant%C3%B3nio%22%2C%22type%22%3A%22Q202444%22%2C%22type_strict%22%3A%22should%22%7D%2C%22q2%22%3A%7B%22query%22%3A%22Milan%22%2C%22type%22%3A%22Q202444%22%2C%22type_strict%22%3A%22should%22%7D%2C%22q3%22%3A%7B%22query%22%3A%22Sevag%22%2C%22type%22%3A%22Q202444%22%2C%22type_strict%22%3A%22should%22%7D%2C%22q4%22%3A%7B%22query%22%3A%22Magdalena%22%2C%22type%22%3A%22Q202444%22%2C%22type_strict%22%3A%22should%22%7D%2C%22q5%22%3A%7B%22query%22%3A%22John%22%2C%22type%22%3A%22Q202444%22%2C%22type_strict%22%3A%22should%22%7D%2C%22q6%22%3A%7B%22query%22%3A%22Shelby%22%2C%22type%22%3A%22Q202444%22%2C%22type_strict%22%3A%22should%22%7D%2C%22q7%22%3A%7B%22query%22%3A%22Nicolas%22%2C%22type%22%3A%22Q202444%22%2C%22type_strict%22%3A%22should%22%7D%2C%22q8%22%3A%7B%22query%22%3A%22Earl%22%2C%22type%22%3A%22Q202444%22%2C%22type_strict%22%3A%22should%22%7D%2C%22q9%22%3A%7B%22query%22%3A%22Dan%22%2C%22type%22%3A%22Q202444%22%2C%22type_strict%22%3A%22should%22%7D%7D

I am puzzled by two things:

  • the URL reliably returns HTTP 502 errors, and does so instantly (no timeout)
  • if I change the query slightly (for instance by changing "Magdalena" to Magdalen", or changing any character in the query) then the problem disappears.
  • the query runs fine on a local instance of the service

Therefore I am starting to wonder if HTTP 502 errors could be cached by some layer, between the WSGI process and the HTTP server?
I also observe 502 errors in POST requests, but is not clear to me whether they could also be affected by caching issues (I hope not, since caching POST requests is obviously an issue on its own).

For reference, the text returned with the 502 error is:

502 Bad Gateway
openresty/1.15.8.1

Event Timeline

Pintoch updated the task description. (Show Details)

It is likely that this is a new behaviour (2 user reports about this over the past 12 hours).

This comment was removed by Pintoch.

UWSI logs are attached (they do not contain personal information)

Mentioned in SAL (#wikimedia-cloud) [2020-07-08T16:36:40Z] <wm-bot> <root> Hard restart to reset Ingess objects (T257405)

There are two layers of nginx reverse proxy between your uwsgi service and the internet, but neither layer (dynamic proxy and Kubernetes Ingress) have response caching enabled. That being said I see the same behavior you mentioned that one particular query returns very quickly returns with a 502 status, but that adding a "cache busting" component to the query string (like &ts=1) triggers actual execution.

Now that I have used the cache busting URL, the query that was failing is working. A mystery?

bd808 renamed this task from Are HTTP 502 Gateway errors cached? to wdreconcile.toolforge.org actin as though HTTP 502 Gateway errors are cached.Jul 8 2020, 4:45 PM
bd808 renamed this task from wdreconcile.toolforge.org actin as though HTTP 502 Gateway errors are cached to wdreconcile.toolforge.org acting as though HTTP 502 Gateway errors are cached.

Thanks for investigating this @bd808. This problem is still occurring and I have no idea what I can do about it. It seems that this problem appeared when the last steps of the toolforge redirection were put in place - is there any config change on your side that could potentially be linked to that?

Thanks for investigating this @bd808. This problem is still occurring and I have no idea what I can do about it. It seems that this problem appeared when the last steps of the toolforge redirection were put in place - is there any config change on your side that could potentially be linked to that?

Not that I have been able to think of or track down, no. The final cutover added a third nginx server to the web flow, but only for the legacy tools.wmflabs.org hostname. The current web request flow is something like:
Internettools.wmflabs.org redirector nginxtoolforge.org ingress nginxKubernetes ingress nginxKubernetes Service locatorKubernetes Poduwsgi containerpython code

I am not aware of any response caching mechanism from point that traffic enters the Toolforge space down to the uwsgi container.

Ok, thanks!

I am unable to mitigate this, so I am migrating the project out of the toolforge infrastructure now. The new service is at https://wikidata.reconci.link/.

I will keep the toolforge project running for a while, marking it as deprecated. I am not planning to redirect users to the external service directly, as I suspect this would violate the toolforge policy.

Lydia_Pintscher subscribed.

This is a rather important service for Wikidata so it'd be <3 if we could figure this out.

Thanks a lot @Lydia_Pintscher !

I have also used this opportunity to optimize the service and deploy it with ASGI, which Toolforge does not support as far as I know. Perhaps this is an indication that this should rather be a Cloud VPS project, where we would be more in control of the deployment. But this would still live behind an HTTPS proxy. At the moment my priority is to make sure there is one reliable instance out there, where I can fix any issue directly.

I realized today that I have uptime statistics for this service since I have been monitoring it for a few years (with downnotifier.com).

YearNb of outagesUptime
20186799.79%
201932499.09%
2020241 so far98.02% so far

These are obtained by polling the service every 10 minutes with the same query (and retrying every minute in case of failure).
The large majority of the outages detected by this polling strategy are 502 errors which last for a few minutes.

In comparison, a simple deployment of the service on https://wikidata.reconci.link/ (with just Apache as a reverse proxy to the tool) achieves 99.99+% availability so far (the only outage registered so far is when the service was initially set up). So perhaps the complexity of the deployment in toolforge comes at a performance price? I would not be able to pin down where sadly…

Yesterday the https://wdreconcile.toolforge.org/ service returned 504 errors for all URLs (after a waiting time - they did not appear "cached" as in this ticket). Restarting the service seems to temporarily mitigate the issue, so it's probably a different sort of outage. I do not commit to restarting the service when this happens - I no longer receive notifications when the service is down. If people want to volunteer to do it, I can give them access to the tool. To restart the service, a single command is needed: webservice --backend kubernetes python3.7 restart. It is very simple!

Yesterday the https://wdreconcile.toolforge.org/ service returned 504 errors for all URLs (after a waiting time - they did not appear "cached" as in this ticket). Restarting the service seems to temporarily mitigate the issue, so it's probably a different sort of outage. I do not commit to restarting the service when this happens - I no longer receive notifications when the service is down. If people want to volunteer to do it, I can give them access to the tool. To restart the service, a single command is needed: webservice --backend kubernetes python3.7 restart. It is very simple!

Does it use tools-redis by chance? That had about 10 min of downtime during migrations, in case that helps anyone who decides to troubleshoot.

It does use tools-redis indeed!

To restart the service, a single command is needed: webservice --backend kubernetes python3.7 restart. It is very simple!

Here's a way to make it even a bit easier: create a $HOME/service.template file with your default settings in it and then just type webservice [start|stop|restart] as needed.

$HOME/service.template
backend: kubernetes
type: python3.7

I'm doing some cleanup of OpenRefine tasks in preparation for new development related to StructuredDataOnCommons (see this page for more context) and I wonder if this issue is still relevant, or if this task can be closed?

https://wdreconcile.toolforge.org is deprecated and https://wikidata.reconci.link is its replacement for a while already.

@Spinster I'd rather keep this open since this is a problem that is likely to be relevant to other tools and still has not been solved as far as I can tell. It seems similar to T282732 which is current apparently.