Page MenuHomePhabricator

Big number of uploads from DPLA bot
Closed, ResolvedPublic

Description

Since March 19 ~15 UTC we've been observing a 3x increase in hourly uploads to swift:

https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&var-DC=codfw&from=1584607747251&to=1584693619919&fullscreen&panelId=26

Looking at recent files it seems that a whole lot of uploads are from https://commons.wikimedia.org/wiki/User:DPLA_bot and https://commons.wikimedia.org/wiki/Special:Log/DPLA_bot specifically uploading books at one file per page, this was discovered because our alerts "too many uploads per hour" tripped.

A few questions I have:

  1. Is there a way to inspect upload progress from the bot? e.g. is the current batch about to finish? will it go on for much longer? total expected batch upload size?
  2. Is one file per page approach of upload books ok? It seems a whole lot of files on the commons side while we have multipage formats support (pdf, djvu, etc)
  3. Can we rate limit uploads from DPLA bot (or all bots in general?)

All of the above aim at guiding our (SRE) expectations in terms of capacity planning for media storage

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2020, 9:08 AM

Hi, this is me! 😳If it's easier, I can get on Telegram or IRC to chat with you about my project. Obviously, I've been going at a high rate, but I don't really want to break Wikimedia!

Hi, this is me! 😳If it's easier, I can get on Telegram or IRC to chat with you about my project. Obviously, I've been going at a high rate, but I don't really want to break Wikimedia!

Hi! For sure, please join #wikimedia-sre on freenode, I'm godog there.

Summary of the IRC chat: the current batch of uploads is about halfway finished and will likely be done by early next week, although no byte size estimates are available. Bots don't seem to have upload rate limits enforced (thanks @Reedy) which I filed as T248177. The one file per page approach is fine as is, depending on the source we do get cases like that.

In terms of next steps, this upload batch is part of a bigger project to upload ~1.5 million items that @Dominicbm is working on. Such effort is very welcome although might impact our capacity planning if the total bytes to upload is in the hundreds or thousands of gigabytes range, thus it is important to get at least a rough estimate of the size of the whole dataset.

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.Mar 30 2020, 9:54 AM

@fgiunchedi Is there anything left for this ticket? Can it be closed?

fgiunchedi closed this task as Resolved.Apr 7 2020, 9:24 AM
fgiunchedi claimed this task.

I'd like to understand better how big of a dataset we're talking about for all uploads that @Dominicbm is working on, although I'll followup in another task.