Some of my tools (eg wikidata-todo) just start throwing 504 errors
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	Magnus
	Sep 12 2023, 7:55 AM

Description

Some of my tools, mostly "wikidata-todo" (see T341190), randomly start throwing 504 errors. This last one was less than a day after I restarted the webserver manually.

This tool is just a collection of PHP/HTML/JS, nothing crazy. The webservice should not become 504, or at the very least restart itself if that happens.

I will try and move some of the sub-tools to "full" tools, but I will have to leave redirects in place, which will be subject to the 504 errors again.

Please let me know if I can do anything on my side, but this looks like a server issue to me.

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved	BUG REPORT	aborrero	T346126 Some of my tools (eg wikidata-todo) just start throwing 504 errors
Resolved	BUG REPORT	Magnus	T341190 duplicity returns "504 Gateway Time-out"
Resolved		Cyberpower678	T346158 Bot down

Event Timeline

Magnus created this task.Sep 12 2023, 7:55 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2023, 7:56 AM

Magnus mentioned this in T341190: duplicity returns "504 Gateway Time-out".Sep 12 2023, 9:26 AM

M2k_dewiki subscribed.Sep 12 2023, 10:02 AM

thiemowmde added a subtask: T341190: duplicity returns "504 Gateway Time-out".Sep 12 2023, 2:39 PM

Also other tools seem to be affected, for example PetScan / QuickStatement:

https://petscan.wmflabs.org/?after=&project=wikipedia&since_rev0=&interface_language=en&ns%5B0%5D=1&categories=Living%20people%7C30&search_max_results=500&cb_labels_yes_l=1&sortby=date&edits%5Bbots%5D=both&cb_labels_any_l=1&cb_labels_no_l=1&wikidata_item=without&combination=union&edits%5Bflagged%5D=both&language=en&edits%5Banons%5D=both&show_redirects=no&negcats=Wikipedia:Borrar%7C30&doit=&al_commands=P31%3AQ5

or GeoHack:

https://geohack.toolforge.org/geohack.php?pagename=Mausoleum_im_Schlosspark_Gadow&language=de&params=53.076249_N_11.61431_E_region:DE-BB_type:landmark

or osm4wiki:

https://osm4wiki.toolforge.org/cgi-bin/wiki/wiki-osm.pl?project=de&article=Kategorie:Denkmal_in_Wien

While @Magnus is listed as an author for geohack, he won't be able to do much about osm4wiki.

At the moment, the https://petscan.wmflabs.org homepage responds. It might be the query that is to complex or depends on the availability of another server.

https://geohack.toolforge.org doesn't respond at all at the moment.

iabot also affected too.

Cyberpower678 added a subtask: T346158: Bot down.Sep 12 2023, 3:26 PM

Cyberpower678 closed subtask T346158: Bot down as Resolved.Sep 12 2023, 3:39 PM

Ditto for this commons api fork by @Sebkur

@fnegri this seems to have knocked out a significant number of toolforge tools, if not all of them. I see a generic NGINX 504 error. You couldn't shed any light on this, could you?

@Brycehughes we had some DNS issues over the past couple of days (T346177), but those have now been fixed. Some tools might need a restart. If you see tools misbehaving even after a restart, do let me know and I'll have a look.

@fnegri Thanks. In my personal case, this tool https://toolsadmin.wikimedia.org/tools/id/commonsapi (e.g. https://commonsapi.toolforge.org/?image=Fall_in_Yukon%27s_Tombstone_Territorial_Park_%E2%80%93_Protected_areas_in_Canada_Q844692.jpg&thumbwidth=200) needs a restart.

I assume at least the rest of these listed by others in this ticket need a restart too.

Brycehughes mentioned this in T346042: cloudservices1005: move to new setup.Sep 14 2023, 8:46 PM

Brycehughes added projects: SRE, Cloud-VPS.

@aborrero @cmooney I'm wondering if T346177 was resolved prematurely, since most if not all of the Toolforge tools are throwing 504s now. Any chance you could look into this?

@Brycehughes that issue was resolved however there have been other changes made. They should not have caused any issues, but I can't guarantee the problems you're seeing are unrelated.

Some of the links reported above (i.e. https://geohack.toolforge.org) appear to be working, but not being a regular user I can't fully tell (other than the site loads and looks functional).

The https://commonsapi.toolforge.org site doesn't load right now. A curl shows that TLS negotiation completes ok, but then the site just stops responding. So to me it looks like the request gets so far, and then probably something on the back-end breaks / times out, and the user trying to connect is just left waiting. But because the error isn't reported back I can't really tell what's going on behind the scenes.

Looking at commonsapi.toolforge.org, it's using internal IP 172.16.0.17, and has an external IP of 185.15.56.11. Checking for DNS traffic from it I can see it seems to be resolving names ok:

root@cloudnet1005:~# tcpdump -i qr-defc9d1d-40 -l -p -nn host 172.16.0.17 and port 53 
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on qr-defc9d1d-40, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:05:40.835546 IP 172.16.0.17.39310 > 208.80.154.143.53: 37405+ A? syslogaudit2.svc.eqiad1.wikimedia.cloud. (57)
23:05:40.835856 IP 208.80.154.143.53 > 172.16.0.17.39310: 37405 1/0/0 A 172.16.5.118 (73)

root@cloudnet1005:~# tcpdump -vvvv -i qr-defc9d1d-40 -l -p -nn host 172.16.0.17 and port 53 
tcpdump: listening on qr-defc9d1d-40, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:11:52.499669 IP (tos 0x0, ttl 64, id 44550, offset 0, flags [DF], proto UDP (17), length 80)
    172.16.0.17.45914 > 172.20.255.1.53: [udp sum ok] 27755+ AAAA? tools-proxy-06.tools.eqiad.wmflabs. (52)
23:11:52.504245 IP (tos 0x0, ttl 61, id 41898, offset 0, flags [none], proto UDP (17), length 168)
    172.20.255.1.53 > 172.16.0.17.45914: [udp sum ok] 27755 q: AAAA? tools-proxy-06.tools.eqiad.wmflabs. 0/1/0 ns: eqiad.wmflabs. [30s] SOA ns0.openstack.eqiad1.wikimediacloud.org. root.wmflabs.org. 1694540600 3600 600 86400 3600 (140)

Sorry that's probably all too low-level to give you much insight. Not really knowing about the cloud upper layers or having access to the systems it's hard for me to troubleshoot what's going on above that. The bits I could check are working ok, at this point we may need to wait until more folks are online in EU morning to shed light on the situation.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T08:36:52Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-76 (T346126)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T08:38:04Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-76 (T346126)

I was due to a non-responsive kubernetes worker node, rebooting it to force the pod to get rescheduled seemed to get the service back online, I'm looking a bit to see what was the cause for the stuck node, but seems unrelated to the DNS issues.

@Brycehughes can you verify that everything works as you expect?

@dcaro it works in my case. What a coincidence re the DNS stuff! Thanks all

Thanks @dcaro for fixing the cluster!

Hello,

https://templatetransclusioncheck.toolforge.org/

https://templatetransclusioncheck.toolforge.org/?lang=de&name=Vorlage:Navigationsleiste_Kader_der_KAA_Gent

returns a "504 Gateway Time-out" error.

https://de.wikipedia.org/w/index.php?title=Wikipedia%3AFragen_zur_Wikipedia&diff=237413922&oldid=237398558

Thanks a lot!

Mentioned in SAL (#wikimedia-cloud) [2023-09-18T10:41:08Z] <dhinus> restarted stuck pod (webservice stop+start) T346126

@M2k_dewiki the Kubernetes pod was stuck, I restarted it manually with webservice stop followed by webservice start, and https://templatetransclusioncheck.toolforge.org/ is now working again.

Mentioned in SAL (#wikimedia-cloud) [2023-09-18T10:43:55Z] <dhinus> restarted stuck pod (webservice stop+start) T346126

Brycehughes mentioned this in T347532: At least one commons tool timing out.Sep 27 2023, 10:48 PM

Brycehughes mentioned this in T347533: At least one commons tool timing out.Sep 27 2023, 11:17 PM

M2k_dewiki reopened subtask T341190: duplicity returns "504 Gateway Time-out" as Open.Oct 11 2023, 4:41 PM

M2k_dewiki closed subtask T341190: duplicity returns "504 Gateway Time-out" as Resolved.Oct 11 2023, 6:44 PM

Hello,

https://templatetransclusioncheck.toolforge.org/

https://templatetransclusioncheck.toolforge.org/?lang=de&name=Vorlage:Navigationsleiste_Kader_der_KAA_Gent

returns a "504 Gateway Time-out" error again (as before on 17th of September 2023 above)

Could you please check / restart the service?

Thanks a lot!

Also see

Mentioned in SAL (#wikimedia-cloud) [2023-11-27T10:53:13Z] <dhinus> restarted stuck pod T346126

Hello,

https://templatetransclusioncheck.toolforge.org/

https://templatetransclusioncheck.toolforge.org/?lang=de&name=Vorlage:Navigationsleiste_Kader_der_KAA_Gent

returns a "504 Gateway Time-out" error again (as before on 17th of September 2023 above)

I have restarted the tool (become templatetransclusioncheck; webservice restart) and it's working again.

@Chameleon222 you are listed as a maintainer of this tool, can you please check if there is any underlying issue that is causing the tool to get stuck after a few days?

Some of my tools (eg wikidata-todo) just start throwing 504 errorsClosed, ResolvedPublicBUG REPORTActions

Description

Related ObjectsSearch...

Event Timeline

Some of my tools (eg wikidata-todo) just start throwing 504 errors
Closed, ResolvedPublicBUG REPORT
Actions

Related Objects
Search...