Page MenuHomePhabricator

Could GWToolset upload faster please?
Closed, DeclinedPublic

Description

  • Could GWToolset upload donated files faster again please?

To mass upload large image donations by museums to Wikimedia Commons, GWToolset has been very useful. However, the upload speed is erratic: sometimes i could upload say 2000 images/24 hours (for instance three weeks ago (300kB files), 1 May 2015 (50 MB files)), at other times (end July-beginning August 2015) only around or less than 200 x 300kB/24 h. This is the case again the previous 10 days.

Judging by https://commons.wikimedia.org/wiki/Special:NewFiles "Recent uploads", the upload server could handle a higher priority of GWToolset jobs. GWT was expected to run a background job "at least every 5 minutes". Presently, it can take more than three hours before GWT resumes (my job this morning 05:01 - 08:12), independent of throttle.

I am presently temporarily wikipedian in residence at Naturalis Biodiversity Center at Leiden (May-November 2015). At this upload speed i cannot fulfill my obligations to the donor Naturalis of uploading many more thousands of their images, so i would be very grateful for a higher upload speed.

Thank you for considering this request, best regards

hans muller
https://commons.wikimedia.org/wiki/User:Hansmuller

Project sites:
https://en.wikipedia.org/wiki/Wikipedia:GLAM/Naturalis
https://nl.wikipedia.org/wiki/Wikipedia:GLAM/Naturalis

Event Timeline

Hansmuller raised the priority of this task from to Needs Triage.
Hansmuller updated the task description. (Show Details)
Hansmuller subscribed.
Restricted Application added subscribers: Steinsplitter, Aklapper. · View Herald Transcript
Hansmuller set Security to None.
Hansmuller raised the priority of this task from Medium to Needs Triage.Sep 29 2015, 1:44 PM

@Hansmuller : Just to clarify - How many images are you wanting to upload, what timeframe do you want to do it in, and what rate would you like to have?


"at least every 5 minutes"

I know it says that somewhere in commons docs. That's mostly a lie and the reality of when gwtoolset gets called is significantly more complicated.


So how fast gwtoolset does things seems to be based on:

  • GWToolset\Config::$metadata_job_delay (currently 1 minute)
  • $wgJobBackoffThrottling['gwtoolsetUploadMetadataJob'] (currently 5 / 3600 ). I suspect this is the limiting factor
  • GWToolset\Config::$mediafile_job_throttle_max (currently 20)

I think its safe to temporarily increase these limits (within reason). A major reason for them I think is to prevent saturating other people's bandwidth.

I think setting $wgJobBackoffThrottling['gwtoolsetUploadMetadataJob'] to 10/3600 would be reasonable. (However, I'd like other's opinions)


Just as a note, @Hansmuller is correct for gwtoolset being all over the place. For example, on sept 10, there was 1600 UploadMetadataJob's processed, which seems to be much more than should have been. At current settings, there should be about 5 metadata jobs an hour, 120 in a day (= ~2400 total images a day), See P2119

@Bawolff: Thanks for your interest.

  • How many images? Naturalis has offered 8 million images, but of course not all are of encylopedic value. I have metadata at hand for 240.000 images, but some large collections still have to be explored.
  • Timeframe: before December 1st, so 60 days to go.
  • Rate: i think i can't reasonably ask for more than 2500 images/day. What would be reasonable? As a rule Naturalis images are about 300 kB.
  • Bandwith: I have not experienced delays when using Uploadwizard during quite heavy uploading sessions by other users.

Best regards, Hansmuller

2500 images/day? This rate is what can be achieved by not uploading through GWToolset, but by running a script (guesstimation), which i can't. Of course, i can't do all these images mentioned above before December 1, but i should upload a fair share of various collections, which cannot be done at the present upload rate of GWToolset. So i would be very grateful for improvements. Regards, hansmuller

My GWToolset job halted tonight from 04:24 - 07:40 5 October. Sometimes it helps to submit a tiny job (5 records throttle 1), at 07:57. Repeating this procedure at around 11:15 with a 2-records job, throttle=1, after more than an hours halt did not work. (Perhaps on the contrary.) Regards, hansmu

@aaron Do you think increasing the $wgJobBackoffThrottling['gwtoolsetUploadMetadataJob'] sounds reasonable

It would be nice to have logging to know when throttling is being applied vs some other bottleneck.

MarkTraceur subscribed.
MarkTraceur lowered the priority of this task from Medium to Low.Dec 5 2016, 10:17 PM
MarkTraceur moved this task from Untriaged to Tracking on the Multimedia board.