Page MenuHomePhabricator

Some of my tools (eg wikidata-todo) just start throwing 504 errors
Closed, ResolvedPublicBUG REPORT

Description

Some of my tools, mostly "wikidata-todo" (see T341190), randomly start throwing 504 errors. This last one was less than a day after I restarted the webserver manually.

This tool is just a collection of PHP/HTML/JS, nothing crazy. The webservice should not become 504, or at the very least restart itself if that happens.

I will try and move some of the sub-tools to "full" tools, but I will have to leave redirects in place, which will be subject to the 504 errors again.

Please let me know if I can do anything on my side, but this looks like a server issue to me.

Event Timeline

While @Magnus is listed as an author for geohack, he won't be able to do much about osm4wiki.

At the moment, the https://petscan.wmflabs.org homepage responds. It might be the query that is to complex or depends on the availability of another server.

https://geohack.toolforge.org doesn't respond at all at the moment.

@fnegri this seems to have knocked out a significant number of toolforge tools, if not all of them. I see a generic NGINX 504 error. You couldn't shed any light on this, could you?

@Brycehughes we had some DNS issues over the past couple of days (T346177), but those have now been fixed. Some tools might need a restart. If you see tools misbehaving even after a restart, do let me know and I'll have a look.

@aborrero @cmooney I'm wondering if T346177 was resolved prematurely, since most if not all of the Toolforge tools are throwing 504s now. Any chance you could look into this?

@Brycehughes that issue was resolved however there have been other changes made. They should not have caused any issues, but I can't guarantee the problems you're seeing are unrelated.

Some of the links reported above (i.e. https://geohack.toolforge.org) appear to be working, but not being a regular user I can't fully tell (other than the site loads and looks functional).

The https://commonsapi.toolforge.org site doesn't load right now. A curl shows that TLS negotiation completes ok, but then the site just stops responding. So to me it looks like the request gets so far, and then probably something on the back-end breaks / times out, and the user trying to connect is just left waiting. But because the error isn't reported back I can't really tell what's going on behind the scenes.

Looking at commonsapi.toolforge.org, it's using internal IP 172.16.0.17, and has an external IP of 185.15.56.11. Checking for DNS traffic from it I can see it seems to be resolving names ok:

root@cloudnet1005:~# tcpdump -i qr-defc9d1d-40 -l -p -nn host 172.16.0.17 and port 53 
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on qr-defc9d1d-40, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:05:40.835546 IP 172.16.0.17.39310 > 208.80.154.143.53: 37405+ A? syslogaudit2.svc.eqiad1.wikimedia.cloud. (57)
23:05:40.835856 IP 208.80.154.143.53 > 172.16.0.17.39310: 37405 1/0/0 A 172.16.5.118 (73)
root@cloudnet1005:~# tcpdump -vvvv -i qr-defc9d1d-40 -l -p -nn host 172.16.0.17 and port 53 
tcpdump: listening on qr-defc9d1d-40, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:11:52.499669 IP (tos 0x0, ttl 64, id 44550, offset 0, flags [DF], proto UDP (17), length 80)
    172.16.0.17.45914 > 172.20.255.1.53: [udp sum ok] 27755+ AAAA? tools-proxy-06.tools.eqiad.wmflabs. (52)
23:11:52.504245 IP (tos 0x0, ttl 61, id 41898, offset 0, flags [none], proto UDP (17), length 168)
    172.20.255.1.53 > 172.16.0.17.45914: [udp sum ok] 27755 q: AAAA? tools-proxy-06.tools.eqiad.wmflabs. 0/1/0 ns: eqiad.wmflabs. [30s] SOA ns0.openstack.eqiad1.wikimediacloud.org. root.wmflabs.org. 1694540600 3600 600 86400 3600 (140)

Sorry that's probably all too low-level to give you much insight. Not really knowing about the cloud upper layers or having access to the systems it's hard for me to troubleshoot what's going on above that. The bits I could check are working ok, at this point we may need to wait until more folks are online in EU morning to shed light on the situation.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T08:36:52Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-76 (T346126)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T08:38:04Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-76 (T346126)

I was due to a non-responsive kubernetes worker node, rebooting it to force the pod to get rescheduled seemed to get the service back online, I'm looking a bit to see what was the cause for the stuck node, but seems unrelated to the DNS issues.

@Brycehughes can you verify that everything works as you expect?

@dcaro it works in my case. What a coincidence re the DNS stuff! Thanks all

aborrero claimed this task.

Thanks @dcaro for fixing the cluster!

Mentioned in SAL (#wikimedia-cloud) [2023-09-18T10:41:08Z] <dhinus> restarted stuck pod (webservice stop+start) T346126

@M2k_dewiki the Kubernetes pod was stuck, I restarted it manually with webservice stop followed by webservice start, and https://templatetransclusioncheck.toolforge.org/ is now working again.

Mentioned in SAL (#wikimedia-cloud) [2023-09-18T10:43:55Z] <dhinus> restarted stuck pod (webservice stop+start) T346126

fnegri added a subscriber: Chameleon222.

Hello,

https://templatetransclusioncheck.toolforge.org/

https://templatetransclusioncheck.toolforge.org/?lang=de&name=Vorlage:Navigationsleiste_Kader_der_KAA_Gent

returns a "504 Gateway Time-out" error again (as before on 17th of September 2023 above)

I have restarted the tool (become templatetransclusioncheck; webservice restart) and it's working again.

@Chameleon222 you are listed as a maintainer of this tool, can you please check if there is any underlying issue that is causing the tool to get stuck after a few days?