Page MenuHomePhabricator

Uploads via Lingualibre-Commons are hitting an upload ratelimit
Closed, DeclinedPublic

Description

We have documented massive lost of files : Lingualibre editors recording 1,000 files, 380 files upload fine, 620 files lost due to enforced 380upload / 72mins ratelimit. We need long term solutions to allow lingualibre folks to contribute via Oauth 2.0 connection yet pass this ratelimit. Commons, Security-team: please give us some guidance.

Current ratelimit are defined in wmf-config/InitialiseSettings.php as :

		'upload' => [
			// 380 uploads per 72 minutes
			'user' => [ 380, 4320 ],
			// Effectively no upload rate limit for members of these groups
			'image-reviewer' => [ 999, 1 ],
			'patroller' => [ 999, 1 ],
			'autopatrolled' => [ 999, 1 ],
		],

See also : T60224 T260649 T245214 and

Event Timeline

Yug added a project: Commons.
Yug updated the task description. (Show Details)
Yug updated the task description. (Show Details)
AntiCompositeNumber renamed this task from Ratelimit : Lingualibre-Commons require better integration via whitelisting or else. to Uploads via Lingualibre-Commons are hitting an upload ratelimit.Mar 9 2021, 11:27 PM

Is there a reason you need to be uploading files every 11 seconds?

Try :lingualibre:Special:RecordWizard (a recording studio). the tool allows an human to record 1,000 audio per hour, or one every 3 sec. It's the service Wikimedia's Lingualibre provides : make rapid audio recording possible and easy. The ratelimit prevent us from delivering the target added value to most Wikimedia contributors. We need a work around. Any idea ? Any whitelisting procedure to follow ?

The user can record up to a few thousands words. Once the recordings all get "approved" by the user, Lingua Libre then uploads them to Commons with the user's OAuth token. We've seen people registering a thousand words, and then getting "blocked" at upload time due to the 380 words/72 minutes limit. And in most cases, they lose two thirds of their work, because they aren't able to wait 72 minutes before retrying the upload.

That is an extremely high upload rate that will flood Special:RecentChanges and Special:NewFiles with unpatrolled new files. Aside from users qualifying for and being granted autopatrolled status in accordance with https://commons.wikimedia.org/wiki/Commons:Autopatrolled, I don't know of a way to exempt a particular tool from the upload limit, and I would be opposed to the creation of one.

We've seen people registering a thousand words, and then getting "blocked" at upload time due to the 380 words/72 minutes limit. And in most cases, they lose two thirds of their work, because they aren't able to wait 72 minutes before retrying the upload.

Is there a reason the tool isn't batching the uploads and doing it slowly in the background? then the user shouldn't need to wait around in their session?

Please upload the files slowly, in the background. Storing the OAuth tokens for some time is fine for that purpose.

Note that the background upload process should also respect maxlag, and otherwise behave like a non-interactive process.

Thanks for these avenues.

Few things.

Most of our users make recording sessions of 50~400 items (no more), review them, then upload them via Oauth 2.0.
Only some users occasionally do a strong session of 1,000+ items, review, upload them Oauth 2.0. Currently an estimated 4 times a week all together.
We here look for solutions so to handle all usages, including these occasional higher usages.
Our users recently represent est. 1/8 to 1/4 of Commons autopatroller userright requests.
From Jan. 2020 to Jan. 2021 we had 500% growth in monthly user registrations.
Lingualibre currently represents about 2% of Commons monthly uploads.
Reasonable growth estimates require to prepare for x3~x10 in the next 2 years.
A special flag could then be helpful to not flood Special:RecentChanges and Special:NewFiles.

It's also possible that Lingualibre joins or represents a new kind of hybrid tool:
Lingualibre assists humans to get occasional, one-hour sessions with bot-like productivity of 200 audios per 10mins session to 1,000 audios / hour. It's neither bot, nor simply human.
Lingualibre strongly constraints contributions into well formatted audio files and this only.
This high constraint makes the contributions trustworthy. The trustfulness in not based upon a time-based, human-experience-based path as for previous user-rights requests.
We may have there the case of an emerging computer-assisted use-case, like when bots appeared, which may require dedicated status and handling.

If this working process reminds you of other computer assisted editing systems on wikimedia wikis I would be interested to learn more and see if there is an emerging tools pattern there.
We will investigate if your recommendations match our constraints and do as best as we can.

Is there a reason the tool isn't batching the uploads and doing it slowly in the background? then the user shouldn't need to wait around in their session?

Not sure of this one (Poslovitch, any idea ?).
I believe all is done client side. So after recording and validation, when the green-light is given by the contributor, the contributor's browser tab sends the files and metadata to Commons. We there look for a balance between respecting Commons (going slow) and respecting the contributors and risk management (not taking hours and not risking accidentally loosing the data due to battery out, closed tab or others).

sbassett moved this task from Incoming to Watching on the Security-Team board.
sbassett subscribed.

The Security-Team would currently rate any increase in rate limits for file uploads for this use case as a medium risk, given several of the concerns around resource exhaustion and audibility mentioned within previous comments.

I don't think there is currently any willingness to raise upload rate limits for Lingualibre-Commons. As mentioned above, the ideal way to fix this issue would be by introducing a queue for uploading, and import it outside of webrequest context, storing the OAuth tokens temporarily.

If you need any help or assistance with implementing this solution, let us know.

I don't have the skills to understand your whole proposal.
Wouldn't this solution ask the user to live its webbrowser tab open for a long duration (1000 audios = 5000 secs = 1h30mins ?) and creates an higher risk to lose the work of contributors ?
We currently don't have the development capability to lead such project.
(Our historic dev left 6 months ago for a sabatical year. We gathered ~4 juniors devs who slowly took back the project. We rose to +15% growth in the month of February alone. On March 9th OVH's data center in Europe had a massive fire, 150,000 websites have been affected or destroyed. We lost some data, code, wikipages, documentations. Our dev and wikimedian capability are centered on restoring those assets rights now, then onboarding, events and restoring trust for the next 4 months, afterwhat we will likely need a break. We are in situation similar to the Suez Canal today : digging to restore the flow :D )
But thanks for proposing assistance, it stays a major priority for us and we will come back to it later this year.

So I was chatting with @Yug at the hackathon about this. He was hoping this could be reconsidered.

My understanding of the situation:

  • There is concern that having the tool delay would confuse users, who are non-technical and won't understand if it doesn't happen instantly.
  • In the past, users of this tool have been asking for autopatrol from commons (Which increases rate limits), which have generally been granted pretty easily.
  • The usecase doesn't need unlimited rate limits, and double the current would probably be sufficient.

Personally, I wonder how commons feels about this. A simple doubling of the limit for uploads from an established tool's IP doesn't seem that scary, we already do similar things for WikiEdu dashboard (T308702). It seems mostly an RC flooding risk, but only a slight one when the proposed limit is less than the autopatrol limit. I kind of think we should defer to the commons community on this.