Page MenuHomePhabricator

IA Upload 504 Gateway Timeouts
Open, Needs TriagePublic

Description

Frequently, the tool has been responding with "An error occurred: Server error: POST https://commons.wikimedia.org/w/api.php resulted in a 504 Gateway Timeout response: upstream request timeout" error. The 'Job queue' shows many failed attempts at uploading files.

Event Timeline

Pigsonthewing renamed this task from IS Upload GAteway Timeouts to IA Upload Gateway Timeouts.Nov 22 2020, 12:05 PM

I'm also getting these over the API (without IA-Upload) - I think it's that the Commons upload-by-URL call times out.

However, the process is actually still working in the background and the file does eventually appear, but you still get 504 errors even once the file exists.

Inductiveload renamed this task from IA Upload Gateway Timeouts to IA Upload 504 Gateway Timeouts.Apr 6 2021, 8:38 AM

Based on the symptom and the time this started to become a real issue, I think this could well be the same root cause as T129216: the chunked upload needs to be async, or it times out.

If that is true, porting https://gerrit.wikimedia.org/r/c/pywikibot/core/+/679021 to addwiki would be needed.

MusikAnimal renamed this task from IA Upload 504 Gateway Timeouts to Investigate IA Upload downtime.May 7 2021, 7:51 PM
MusikAnimal updated the task description. (Show Details)
MusikAnimal removed a project: Commons.
MusikAnimal moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.

For the time being, I've got a cron job running that restarts the webservice every 4 hours. This is not a fix.

NRodriguez subscribed.

Would be great to also get an estimation of complexity on ticket!

NRodriguez moved this task from Needs Discussion to Up Next on the Community-Tech board.
NRodriguez added a project: Spike.
Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptMay 10 2021, 7:40 PM

It looks like the downtimes began around April 12, and commit 1ba22eb9083f53c1118175648941a702e80b2a15 was the first major commit recently before that, on March 29.

The changes between then and the most recent non-i18n commit: https://github.com/wikisource/ia-upload/compare/1ba22eb9083f53c1118175648941a702e80b2a15..d1020ef5c61a0cb90052acf9b18897d6554986d9

It looks like the downtimes began around April 12

I opened this ticket on Nov 22 2020...

I think there is confusion here (at least I am confused). I thought this was about the IA UPload tool receiving a 504 from the Commons API and telling the user during the upload process (my pet theory is that this is because AddWiki doesn't do async chunked uploads and recently sync chunked uploads have stopped working well).

The more recent total IA-Upload outage(s) have been the IA-Upload tool itself returning 503s to the user.

Maybe the recent renaming of the task and rewording of the description hasn't helped?

Original title: IA Upload Gateway Timeouts.

Earlier title: IA Upload 504 Gateway Timeouts

Original description: For three days now, the tool has been responding with "An error occurred: Server error: POST https://commons.wikimedia.org/w/api.php resulted in a 504 Gateway Timeout response: upstream request timeout" error. The 'Job queue' shows many failed attempts at uploading files.

Maybe the recent renaming of the task and rewording of the description hasn't helped?

Sorry! I must have misread. Inductiveload's comments from April lined up with the general downtime we were seeing, and 503/504 are pretty close numerically, heh... We can re-open T268594: An error occurred: Server error: or T276222: Error: 504 Gateway Timeout about that issue since I incorrectly hijacked this task? Apologies again!

Yeah, sorry – I got confused too!

I've opened T282633 to deal with the current downtime problem.

MusikAnimal renamed this task from Investigate IA Upload downtime to IA Upload 504 Gateway Timeouts.May 12 2021, 4:07 PM
MusikAnimal removed a project: Spike.
MusikAnimal updated the task description. (Show Details)
MusikAnimal removed the point value for this task.
MusikAnimal changed the subtype of this task from "Spike" to "Task".May 12 2021, 4:09 PM

Is this perhaps simply another symptom of T292954?

@Pigsonthewing Are you still getting these errors? The changes announced in the latest Tech News should have essentially eliminated these if it's the same cause.

@Xover I haven't seen one of these for a while, but they're theoretically still possible until T295009 moves.

However, they're often "bogus" in that you get a timeout, but the file does eventually appear on the server (the first you know of that is often an exists error on a subsequent re-try (à la T293435)