wdreconcile.toolforge.org acting as though HTTP 502 Gateway errors are cached
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Pintoch
	Jul 8 2020, 8:10 AM

Description

I am trying to understand why my toolforge tool (https://wdreconcile.toolforge.org) returns HTTP 502 errors for some URLs.

The tool is written in Python 3.7, deployed with Kubernetes via WSGI.
In my experience, HTTP 502 errors signal a lack of connectivity between the HTTP server and the WSGI process. This normally happens when the service is overloaded or just times out. But this time, there are specific URLs which reliably return 502 errors for an extended period of time, while any variation around them will succeed.

I am puzzled by two things:

the URL reliably returns HTTP 502 errors, and does so instantly (no timeout)
if I change the query slightly (for instance by changing "Magdalena" to Magdalen", or changing any character in the query) then the problem disappears.
the query runs fine on a local instance of the service

Therefore I am starting to wonder if HTTP 502 errors could be cached by some layer, between the WSGI process and the HTTP server?
I also observe 502 errors in POST requests, but is not clear to me whether they could also be affected by caching issues (I hope not, since caching POST requests is obviously an issue on its own).

For reference, the text returned with the 502 error is:

502 Bad Gateway
openresty/1.15.8.1

Related Objects

Mentioned In: T311201: Request creation of wikidata-reconciliation VPS project
T282732: Occasional HTTP 502 Bad Gateway errors for several Toolforge tools
T244847: Future of the OpenRefine Wikidata reconciliation interface
Mentioned Here: T282732: Occasional HTTP 502 Bad Gateway errors for several Toolforge tools

Event Timeline

Pintoch created this task.Jul 8 2020, 8:10 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 8 2020, 8:10 AM

Pintoch updated the task description. (Show Details)Jul 8 2020, 8:11 AM

Pintoch updated the task description. (Show Details)

It is likely that this is a new behaviour (2 user reports about this over the past 12 hours).

Pintoch added a comment.Jul 8 2020, 8:56 AM

This comment was removed by Pintoch.

UWSI logs are attached (they do not contain personal information)

uwsgi.1000.log230 KBDownload

Mentioned in SAL (#wikimedia-cloud) [2020-07-08T16:36:40Z] <wm-bot> <root> Hard restart to reset Ingess objects (T257405)

There are two layers of nginx reverse proxy between your uwsgi service and the internet, but neither layer (dynamic proxy and Kubernetes Ingress) have response caching enabled. That being said I see the same behavior you mentioned that one particular query returns very quickly returns with a 502 status, but that adding a "cache busting" component to the query string (like &ts=1) triggers actual execution.

Now that I have used the cache busting URL, the query that was failing is working. A mystery?

bd808 renamed this task from Are HTTP 502 Gateway errors cached? to wdreconcile.toolforge.org actin as though HTTP 502 Gateway errors are cached.Jul 8 2020, 4:45 PM

bd808 renamed this task from wdreconcile.toolforge.org actin as though HTTP 502 Gateway errors are cached to wdreconcile.toolforge.org acting as though HTTP 502 Gateway errors are cached.

Thanks for investigating this @bd808. This problem is still occurring and I have no idea what I can do about it. It seems that this problem appeared when the last steps of the toolforge redirection were put in place - is there any config change on your side that could potentially be linked to that?

Nintendofan885 subscribed.Jul 16 2020, 2:30 PM

In T257405#6312387, @Pintoch wrote:

Thanks for investigating this @bd808. This problem is still occurring and I have no idea what I can do about it. It seems that this problem appeared when the last steps of the toolforge redirection were put in place - is there any config change on your side that could potentially be linked to that?

Not that I have been able to think of or track down, no. The final cutover added a third nginx server to the web flow, but only for the legacy tools.wmflabs.org hostname. The current web request flow is something like:
Internet → tools.wmflabs.org redirector nginx → toolforge.org ingress nginx → Kubernetes ingress nginx → Kubernetes Service locator → Kubernetes Pod → uwsgi container → python code

I am not aware of any response caching mechanism from point that traffic enters the Toolforge space down to the uwsgi container.

Ok, thanks!

I am unable to mitigate this, so I am migrating the project out of the toolforge infrastructure now. The new service is at https://wikidata.reconci.link/.

I will keep the toolforge project running for a while, marking it as deprecated. I am not planning to redirect users to the external service directly, as I suspect this would violate the toolforge policy.

Nemo_bis subscribed.Jul 20 2020, 2:15 PM

This is a rather important service for Wikidata so it'd be <3 if we could figure this out.

Pintoch mentioned this in T244847: Future of the OpenRefine Wikidata reconciliation interface.Jul 20 2020, 2:38 PM

Thanks a lot @Lydia_Pintscher !

I have also used this opportunity to optimize the service and deploy it with ASGI, which Toolforge does not support as far as I know. Perhaps this is an indication that this should rather be a Cloud VPS project, where we would be more in control of the deployment. But this would still live behind an HTTPS proxy. At the moment my priority is to make sure there is one reliable instance out there, where I can fix any issue directly.

Thadguidry subscribed.Aug 20 2020, 6:01 PM

I realized today that I have uptime statistics for this service since I have been monitoring it for a few years (with downnotifier.com).

Year	Nb of outages	Uptime
2018	67	99.79%
2019	324	99.09%
2020	241 so far	98.02% so far

These are obtained by polling the service every 10 minutes with the same query (and retrying every minute in case of failure).
The large majority of the outages detected by this polling strategy are 502 errors which last for a few minutes.

In comparison, a simple deployment of the service on https://wikidata.reconci.link/ (with just Apache as a reverse proxy to the tool) achieves 99.99+% availability so far (the only outage registered so far is when the service was initially set up). So perhaps the complexity of the deployment in toolforge comes at a performance price? I would not be able to pin down where sadly…

Yesterday the https://wdreconcile.toolforge.org/ service returned 504 errors for all URLs (after a waiting time - they did not appear "cached" as in this ticket). Restarting the service seems to temporarily mitigate the issue, so it's probably a different sort of outage. I do not commit to restarting the service when this happens - I no longer receive notifications when the service is down. If people want to volunteer to do it, I can give them access to the tool. To restart the service, a single command is needed: webservice --backend kubernetes python3.7 restart. It is very simple!

In T257405#6505154, @Pintoch wrote:

Yesterday the https://wdreconcile.toolforge.org/ service returned 504 errors for all URLs (after a waiting time - they did not appear "cached" as in this ticket). Restarting the service seems to temporarily mitigate the issue, so it's probably a different sort of outage. I do not commit to restarting the service when this happens - I no longer receive notifications when the service is down. If people want to volunteer to do it, I can give them access to the tool. To restart the service, a single command is needed: webservice --backend kubernetes python3.7 restart. It is very simple!

Does it use tools-redis by chance? That had about 10 min of downtime during migrations, in case that helps anyone who decides to troubleshoot.

It does use tools-redis indeed!

In T257405#6505154, @Pintoch wrote:

To restart the service, a single command is needed: webservice --backend kubernetes python3.7 restart. It is very simple!

Here's a way to make it even a bit easier: create a $HOME/service.template file with your default settings in it and then just type webservice [start|stop|restart] as needed.

$HOME/service.template

backend: kubernetes
type: python3.7

wolfgang8741 subscribed.Oct 6 2020, 1:26 PM

I'm doing some cleanup of OpenRefine tasks in preparation for new development related to StructuredDataOnCommons (see this page for more context) and I wonder if this issue is still relevant, or if this task can be closed?

https://wdreconcile.toolforge.org is deprecated and https://wikidata.reconci.link is its replacement for a while already.

Fuzheado subscribed.Sep 13 2021, 1:51 PM

@Spinster I'd rather keep this open since this is a problem that is likely to be relevant to other tools and still has not been solved as far as I can tell. It seems similar to T282732 which is current apparently.

Pintoch mentioned this in T282732: Occasional HTTP 502 Bad Gateway errors for several Toolforge tools.Sep 15 2021, 6:12 PM

AntiCompositeNumber subscribed.Sep 15 2021, 9:37 PM

Pintoch mentioned this in T311201: Request creation of wikidata-reconciliation VPS project.Jun 23 2022, 7:33 AM

	F31920753: uwsgi.1000.log
	Jul 8 2020, 8:57 AM

wdreconcile.toolforge.org acting as though HTTP 502 Gateway errors are cachedOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

wdreconcile.toolforge.org acting as though HTTP 502 Gateway errors are cached
Open, Needs TriagePublic
Actions