Page MenuHomePhabricator

503 errors on API file upload
Closed, DuplicatePublic

Description

See https://commons.wikimedia.org/wiki/Commons:Village_pump#Missing_Commons_image_text_pages

This may be an operational glitch which started on the 25 September, but uploads through the API appear to have been suffering from 503 drop-out at an unusually high level (observing this up to the date of creating this task, 27 September). A consequence is more empty image text pages for uploaded images, resulting in more volunteer work to retrospectively repair these. It is unknown if uploads using other processes, such as the standard upload wizard, were affected.

It would be good for confirmation that this was a transient issue and either has been fixed or will go away by itself, and if related to a system change on the 25th, that this can be noted for similar future changes.

Event Timeline

Fae raised the priority of this task from to Needs Triage.
Fae updated the task description. (Show Details)
Fae added a project: Commons.
Fae subscribed.
Fae set Security to None.

For reference: some examples:

  • File:The_Boston_Cooking_School_magazine_of_culinary_science_and_domestic_economics_(1905)_(14773190165).jpg
  • File:Matabele_land_and_the_Victoria_Falls_-_a_naturalist's_wanderings_in_the_interior_of_South_Africa,_from_the_letters_and_journals_of_the_late_Frank_Oates_(1881)_(14728367516).jpg
  • File:Life_and_letters_of_Maggie_Benson_(1917)_(14597477079).jpg

The interesting bit, is that all these files are missing both the page text and the edit summary (and img_descrition, and log_comment)

In the event anyone actually investigates this (which given how these things go, don't get your hopes up), it may be useful to know the exact api requests used. (If you don't know the exact api requests used, do you know whether it was using stashed upload, chunked upload, url upload, or just plain direct upload using the file parameter. Also, was comment parameter set to something?).

I'm given to understand the course of events is:

  • You start upload
  • You get a 503
  • At some point, you try uploading again

Did you retry immediately, or did you wait some time. Do the timestamps on the file look like it corresponds with the initial 503'd request or the second attempt? If you're using a library, are you sure that the library isn't just buggy when it gets a 503 and retries but without the page text?

(Responses to Bawolff's questions)

The example uploads are all using an older but very stable version of Pywikibot (the some one I've used for >500,000 uploads, and 300,000 in this latest project). This same routine worked without this type of error until 25th September, so that's a lot of history indicating that the external system has changed, not anything I ought to be fixing in my code.

I do attempt re-upload with 5 second delay after the 503 failure times out (the delay would increase between each successive failure), they seem to succeed on the 2nd attempt. The second call is identical to the first attempt, passing up(url, pagetitle, description), so the description should be passed again. I have not checked timestamps, nor have I started to work-around the failure.

Even though new, the error rate is *low*, something like 1/200 files being uploaded.

Example failures from today are:

Though as I've been knocked offline for a while today, I have attempted far fewer uploads than I achieved over the weekend.

Testing my uploads for the whole of September, shows this problem did not exist until 25th September (i.e. zero matches before then) even though the identical upload process was being used. On average 0.3% of uploads using similar API processes may be affected:
<pre>+-----------------+------------+

empty/totalDate

+-----------------+------------+

29/103522015-09-25
36/102222015-09-26
28/ 89282015-09-27
7/ 65322015-09-28
29/ 94702015-09-29
28/108452015-09-30

+-----------------+------------+
</pre>

One in 200 doesn't seem much. The problem is that Fae is uploading tons of stuff. Current count 3 mill. edits. See Fae SUL. I just found 5, no idea how many are still hidden somewhere. This needs to be fixed now.

One in 200 doesn't seem much. The problem is that Fae is uploading tons of stuff. Current count 3 mill. edits. See Fae SUL. I just found 5, no idea how many are still hidden somewhere. This needs to be fixed now.

I looked through the upload code in MediaWiki. There really wasn't anything there that looked like it could cause this issue. I think there is a chance it could be the bots code at fault (Or at least I think that's the most logical thing to investigate next).

Of course the most useful thing would be being able to reproduce this issue in a controlled situation.

One in 200 doesn't seem much. The problem is that Fae is uploading tons of stuff

Its not 1 out of 200 across the board. The problem is Fae uploads a ton of stuff, and the issue mostly just affects him. Here are the number of images uploaded in september that have symptoms similar to this bug

MariaDB [commonswiki_p]> select rev_user_text, count(*) from revision inner join page on rev_page = page_id inner join image on img_name = page_title where rev_parent_id = 0 and rev_sha1 = 'phoiac9h4m842xq45sp7s6u21eteeq1' and rev_timestamp > '20150900000000' and page_namespace = 6 and img_description = '' group by rev_user_text order by count(*) desc;
rev_user_textcount(*)
245
Senapa9
Ilaha Shbb6
Bob.6
Ikshvaaku4
Dsprc4
ಸುಮಂತ3
Frank Behnsen2
Ernest-Mtl1
MovingWPdeToCommons1
Ant 2221

11 rows in set (30.45 sec)

In that time period, there was about 969,855 pages created in image namespace, of which 287438 were created by Fae. That's about 8.5 out of every 10,000 images for Fae, but only 0.5 out of every 10000 images for everyone else. So the issue seems to predominantly affect Fae.

I just found 5, no idea how many are still hidden somewhere. This needs to be fixed now.

At this precise moment there is 1 file with a page_length of exactly 0.

MariaDB [commonswiki_p]> select page_title from page where page_namespace = 6 and page_len = 0;

page_title
Manastiri_i_Deçanit_04.jpg

1 row in set (2.20 sec)

(Of course bots add missing stuff templates pretty quickly afterwards. Anyways, Fae has been doing queries looking for files without much info on them)

FYI I have been tracking down and repairing "empties" a couple of times a day using SQL like:

SELECT DATE_FORMAT(img_timestamp, "%Y-%m-%d") AS Date, GROUP_CONCAT(left(right(page_title,16),11)) AS PID FROM page JOIN image ON img_name = page_title LEFT JOIN oldimage ON oi_name = page_title WHERE img_timestamp > 20151003000000 AND (img_user_text = "Fæ" OR oi_user_text = "Fæ") AND (img_description = "" OR oi_description = "") AND img_timestamp > DATE_FORMAT(DATE_ADD(NOW(), INTERVAL -2 DAY), "%Y%m%d%H%i%s") AND page_len < 400 ORDER BY img_timestamp;

The PID thing is because my uploads right now are all from flickr, so this is the Flickr Photo_ID, which is all I need to use my script in reverse. Using page len and catering for old images means that it can still find an image if it was overwritten with a rotate. My upload function has been precisely the same as it was before 25th September. I have changed failure behaviour, so that the previously mentioned retry in 5 seconds is now a retry in 35 seconds when upload failure is reported by the API. My function uses the upload.UploadRobot() in Pywikibot compat (which has been stable for years).

The error rate seems to have dropped from 0.30% to 0.17%, but this might be statistical variation rather than a trend.

EmptyTotalRatioDate
29102460.28%2015-09-25
34100460.34%2015-09-26
2887640.32%2015-09-27
663930.09%2015-09-28
2893360.30%2015-09-29
28106560.26%2015-09-30
2892020.30%2015-10-01
3095120.32%2015-10-02
2783510.32%2015-10-03
667660.09%2015-10-04
154620.02%2015-10-05
952490.17%2015-10-06
846190.17%2015-10-07
2383760.27%2015-10-08

The only other thing that has changed is that I swapped my router at home, and the macmini processing these connects via my old router working as a repeater, connecting via a power line converter. Though it would be a bit obscure if a change to my topology at the client end made any operational difference.

Update:

As the uploading process remains unchanged and stable, the table below shows that something changed on the WMF server side on 25 September to introduce more upload failures, then by 9th October something else has changed to reduce the error rate (but not eliminate it) by 90%+. Though 1 empty error/5000 uploads seems manageable, I remain concerned that there has been no explanation of the underpinning cause, so there seems nothing to stop it randomly recurring.

EmptyTotalRatioDate
0133840.00%2015-09-20
0134000.00%2015-09-21
0115320.00%2015-09-22
098630.00%2015-09-23
0109460.00%2015-09-24
29101760.28%2015-09-25
3499380.34%2015-09-26
2785180.32%2015-09-27
663180.09%2015-09-28
2892730.30%2015-09-29
28104180.27%2015-09-30
2890720.31%2015-10-01
3094110.32%2015-10-02
2681390.32%2015-10-03
666490.09%2015-10-04
153260.02%2015-10-05
849690.16%2015-10-06
846380.17%2015-10-07
2177410.27%2015-10-08
448240.08%2015-10-09
036640.00%2015-10-10
060340.00%2015-10-11
031720.00%2015-10-12
133730.03%2015-10-13
150390.02%2015-10-14
MarkTraceur subscribed.

Pretty sure this is no longer an issue, but I'll leave this open until someone comes up with an explanation.

Update Here are the figures for 1-24 January for upload of "blanks". The numbers have remained small. As my uploads do a double-check for blanks and then re-write the text page, this has basically become invisible for my uploads but is still a regular failure and quite a bad thing to build into batch upload scripts without flagging it somewhere:

UserCount
48
Nghwaya3
Koreahdhwwe2
風中的刀劍1
فیروز اردووالا1
AbhiRiksh1
Benard Okello1
Vit.riviera161

An example file to look over that experienced the problem today, is https://commons.wikimedia.org/wiki/File:Explicit_(NYPL_b12455533-426847).tif. For large images like this TIFF, the error means a second upload which can be time-consuming from my home broadband view, but also means we have more redundant duplicates in the system.