Page MenuHomePhabricator

Produce dumps of commons thumbnail URLs
Open, MediumPublic

Description

We have been considering for some time creating image dumps (T73405), but have never been able to get around to it due to various constraints. Something that seems much easier to implement is producing a dump of valid thumbnail URLs. These urls are generally (some ancient urls are different, and there can be other data prepended to the px) in the form of:

https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/The_Green_and_Golden_Bell_Frog.jpg/750px-The_Green_and_Golden_Bell_Frog.jpg

By producing a dump of resized thumbnail URLs external users can choose on a per-image basis which existing thumbnail is close enough to their need and download the appropriate images from our existing infrastructure. Use cases that only want a few thousand images should still go to the public api's, but any use case that would like to have all ~60M images will be better served by this dump than hitting our public api's tens of millions of times (and if they follow our rate limit guidelines, that will take many months).

This dump can be generated relatively easily by paginating the swift container listings for the 255 commons thumbnail containers and transforming all the internal swift urls into public external urls. There is an open question of if there are thumbnails or files that should have been deleted inside swift but were not (there are no known offences, but that is far from a guarantee. A first draft of whitelisting should inform if it is actually necessary). For this reason any dump will need to be whitelisted against the set of valid pages. There are only ~60M valid pages on commons, so likely an implementation could build up a Set implementation of all the known files into memory and check all the thumbnails (~1.3B) against it.

To be investigated:

  • What kind of purging is going on? How long will the URLs in the dumps be valid?
  • Should the dump be taken from the dumps infrastructure, or from analytics and shipped to dumps? In particular analytics has on-demand compute resources which might simplify the work. But scheduling is more complicated.
  • Where to get the list of known files on commons? We could extract them from CirrusSearch dumps in analytics, or directly from the search clusters on dumps infra, but perhaps there are better ways.
  • Can we provide guidelines for how external users should rate limit their retrieval of thumbnail images?

Details

Related Gerrit Patches:
operations/puppet : productionPerform weekly dumps of all public media urls

Event Timeline

EBernhardson updated the task description. (Show Details)Dec 12 2019, 1:44 AM
EBernhardson updated the task description. (Show Details)Dec 12 2019, 1:48 AM
EBernhardson updated the task description. (Show Details)
EBernhardson triaged this task as Medium priority.
EBernhardson updated the task description. (Show Details)Dec 12 2019, 1:56 AM

We dump a list of media filenames (namespace 6) for each wiki every day. These files reside here: https://dumps.wikimedia.org/other/mediatitles/

EBernhardson updated the task description. (Show Details)Dec 12 2019, 5:12 PM
EBernhardson updated the task description. (Show Details)
EBernhardson added a comment.EditedDec 12 2019, 5:39 PM

Quick investigation of the simplest implementation, loading the dump of filenames into a python set needs just shy of 9GB of memory. This is a bit heavy to run on dumps infra which already has many other things running, but could be done in analytics.

Iterating two sorted lists in parallel would be a memory efficient way of implementing the whitelist. We could sort the set of known files and store that on disk. Then for each swift container (255, each containing ~5M thumbs) we can sort that set as well, and iterate through both lists in parallel to see which do not exist. This is perhaps a bit more complex to implement than using a set, but would likely require under a GB of memory.

How are the internal swift urls? I'm not sure why we need two lists.
Also, while a baseline, I don't expect python to be the most efficient construct. A better one could be construed directly in C. Or even use a Bloom filter rather than a set.

ArielGlenn added a comment.EditedDec 13 2019, 8:11 AM

The idea is to let people download thumbs that already exist, when grabbing bulk thumbs as a sample set for research; asking MW to generate thumbs turns out to be a big slowdown for thumb downloads.

Even though we are talking about a lot of urls, it's still only liable to take 10 minutes or so total to produce the file list. Even 30 minutes wouldn't be an issue, and I'd prefer Python for maintainability in that case.

Still thinking about ways to do it (without set in memory). I do think comparing sorted lists in some fashion is going to be the way to go. How many swift containers are there for thumbs on any given wiki? Or across all wikis, for that matter?

How are the internal swift urls? I'm not sure why we need two lists.
Also, while a baseline, I don't expect python to be the most efficient construct. A better one could be construed directly in C. Or even use a Bloom filter rather than a set.

Internal urls from swift look like (from the wikipedia-commons-local-thumb.c6 container): c/c6/View_southwest_from_Ben_Lawers,_Scottish_Highlands,_Scotland.jpg/300px-View_southwest_from_Ben_Lawers,_Scottish_Highlands,_Scotland.jpg

A bloom filter isn't going to be appropriate. To paraphrase the wiki page, a bloom filter tells us that the element either definitely is not in the set or may be in the set. For a security whitelist we need to know concretely that the element is in the set, a guarantee a bloom filter can't give.

The idea is to let people download thumbs that already exist, when grabbing bulk thumbs as a sample set for research; asking MW to generate thumbs turns out to be a big slowdown for thumb downloads.
Even though we are talking about a lot of urls, it's still only liable to take 10 minutes or so total to produce the file list. Even 30 minutes wouldn't be an issue, and I'd prefer Python for maintainability in that case.
Still thinking about ways to do it (without set in memory). I do think comparing sorted lists in some fashion is going to be the way to go. How many swift containers are there for thumbs on any given wiki? Or across all wikis, for that matter?

Thinking about the simplest way to do this:

  • Dump both lists into files (commons files already dumped, the swift command can dump the commons listing)
  • Use the unix sort command to get them in order. This natively handles files of any size.
  • Use a python script to read both sequentially and write only the whitelisted uri's to a new file
  • Delete intermediate data files

I'm ok with that, at least to try it out. If it turns out to be unworkable for some reason, there wouldn't be a huge amount of time sunk into a PoC. Just gotta be sure that the format of the titles in both lists is the same.

Then the internal urls are really no different than the public ones. Converting them would simply mean prepending "https://upload.wikimedia.org/wikipedia/commons/thumb/"

Note: you mention the commons-local-thumb.c6 container. But what if the original image size was smaller than 300px? I would expect that you would need to go to a non-thumb container.

I would probably tend to implement it in the other way. Instead of starting with the switft urls, starting with the image table, splitting it into 256 files (lists), then for each of them, check if the thumb is already cached, which could use bulk requests to stat many files at once, or be done locally from a list of all swift urls.

Then the internal urls are really no different than the public ones. Converting them would simply mean prepending "https://upload.wikimedia.org/wikipedia/commons/thumb/"
Note: you mention the commons-local-thumb.c6 container. But what if the original image size was smaller than 300px? I would expect that you would need to go to a non-thumb container.
I would probably tend to implement it in the other way. Instead of starting with the switft urls, starting with the image table, splitting it into 256 files (lists), then for each of them, check if the thumb is already cached, which could use bulk requests to stat many files at once, or be done locally from a list of all swift urls.

If the goal was to generate a single dataset to solve a single groups request, that might be viable. I'm not really looking to build and run custom dumps for each use case though. I'm thinking that the most use cases can be solved by providing regularly generated very direct dumps of what is available, and letting external users parse those down into what is useful for their case. Likely I can write this generically enough to dump any mediawiki owned swift media container with public wmf urls. My goal is to provide the list of all valid urls, and let others figure out how to best select the URLs that match their use case.

Change 561356 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Perform weekly dumps of all public media urls

https://gerrit.wikimedia.org/r/561356

ArielGlenn added a comment.EditedJan 3 2020, 3:04 PM

A couple questions as I read through the patch:

How long does it take to list one of these swift containers, say the one for en wiki thumbs, which is probably among the largest?
The script as written will also produce a listing for commonswiki, do we want that? How long would those containers take to list?
Do we have any sort of timeout on swift commands anyways, or thoughts about retries if something is unavailable/broken?
Which python dbm module is used here? Do we want python3-gdbm?

I'll have more questions and/or comments in the next day or so.

How long does it take to list one of these swift containers, say the one for en wiki thumbs, which is probably among the largest?

This seems to get urls from swift at about 20k/sec, for the 1.3B commons thumbs that works out to about 18 hours. I didn't check enwiki, assuming commons would be an order of magnitude more than the others, but could look into it. If we want things to take less time that could be parallelized over the list of containers to dump (255), probably we could do 4 at a time or some such.

The script as written will also produce a listing for commonswiki, do we want that? How long would those containers take to list?

Commonswiki was the primary goal, as above about it is around 18 hours. Compressed the output is around 7GB.

Do we have any sort of timeout on swift commands anyways, or thoughts about retries if something is unavailable/broken?

I'm mostly relying on the internal retries of the swift CLI client. The client itself is requesting paginated results from swift at 10k results per api request, so the internal timeouts are per-10k results as opposed to a full end-to-end listing.

I'm not entirely sure what to do if one of the dumps were to fail. We could check the script ret code and make sure to delete the failed output, but there isn't a good way I'm aware of to re-try specific dumps.

Which python dbm module is used here? Do we want python3-gdbm?

I tested this fromstat1007 where dbm.whichdb('...') tells me it used ndbm, the unix file command against the output db reports Berkley db. python3-gdbm doesn't seem to be installed, suggesting it's not necessary?

How long does it take to list one of these swift containers, say the one for en wiki thumbs, which is probably among the largest?

This seems to get urls from swift at about 20k/sec, for the 1.3B commons thumbs that works out to about 18 hours. I didn't check enwiki, assuming commons would be an order of magnitude more than the others, but could look into it. If we want things to take less time that could be parallelized over the list of containers to dump (255), probably we could do 4 at a time or some such.

The script as written will also produce a listing for commonswiki, do we want that? How long would those containers take to list?

Commonswiki was the primary goal, as above about it is around 18 hours. Compressed the output is around 7GB.

How does swift hold up under this? Can we run one process doing commons and another/others doing the rest, or is that going to be a noticeable load on the servers? I'm going to add @fgiunchedi for comments on this.

Do we have any sort of timeout on swift commands anyways, or thoughts about retries if something is unavailable/broken?

I'm mostly relying on the internal retries of the swift CLI client. The client itself is requesting paginated results from swift at 10k results per api request, so the internal timeouts are per-10k results as opposed to a full end-to-end listing.

Sounds good.

I'm not entirely sure what to do if one of the dumps were to fail. We could check the script ret code and make sure to delete the failed output, but there isn't a good way I'm aware of to re-try specific dumps.

Cleanup will work just fine for this. I would suggest doing a few passes over the process-specific list (commons, or all, or whatever it is) and if the output is not there, rerun it; if output is there for all, we call it a success and go home, otherwise we fail out after the specified number of passes.

Which python dbm module is used here? Do we want python3-gdbm?

I tested this fromstat1007 where dbm.whichdb('...') tells me it used ndbm, the unix file command against the output db reports Berkley db. python3-gdbm doesn't seem to be installed, suggesting it's not necessary?

Yeesssss but here's what I'm thinking, though it's not a great thought: the dbm module uses 'whatever is available' by trying imports and taking the first one that works. If the python gdbm module winds up on the snapshot hosts in the future for some independent reason, that will switch the format unbeknownst to us. I don't mind using dbm, I'd just like to lock in the format. Thoughts?

How long does it take to list one of these swift containers, say the one for en wiki thumbs, which is probably among the largest?

This seems to get urls from swift at about 20k/sec, for the 1.3B commons thumbs that works out to about 18 hours. I didn't check enwiki, assuming commons would be an order of magnitude more than the others, but could look into it. If we want things to take less time that could be parallelized over the list of containers to dump (255), probably we could do 4 at a time or some such.

The script as written will also produce a listing for commonswiki, do we want that? How long would those containers take to list?

Commonswiki was the primary goal, as above about it is around 18 hours. Compressed the output is around 7GB.

How does swift hold up under this? Can we run one process doing commons and another/others doing the rest, or is that going to be a noticeable load on the servers? I'm going to add @fgiunchedi for comments on this.

Just listing containers isn't going to pose problems for swift even with a little bit of concurrency, so +1 on my end!

Can we run one process doing commons and another/others doing the rest,

I've adjusted the script to parallelize checking the containers, and adjusted the bash script to invoke it with 4 workers. The workers coordinate based on an output lock so only one writes at a time, with a reasonable per-thread buffer. Seems to work reasonably well.

I'm not entirely sure what to do if one of the dumps were to fail. We could check the script ret code and make sure to delete the failed output, but there isn't a good way I'm aware of to re-try specific dumps.

Cleanup will work just fine for this. I would suggest doing a few passes over the process-specific list (commons, or all, or whatever it is) and if the output is not there, rerun it; if output is there for all, we call it a success and go home, otherwise we fail out after the specified number of passes.

I've adjusted the bash script to only run the dump if the output doesn't already exist, and to delete the temp file if there is a problem with the dump script. This should now reasonably handle being run multiple times, as long as it's not being run multiple times concurrently.

Which python dbm module is used here? Do we want python3-gdbm?

I tested this fromstat1007 where dbm.whichdb('...') tells me it used ndbm, the unix file command against the output db reports Berkley db. python3-gdbm doesn't seem to be installed, suggesting it's not necessary?

Yeesssss but here's what I'm thinking, though it's not a great thought: the dbm module uses 'whatever is available' by trying imports and taking the first one that works. If the python gdbm module winds up on the snapshot hosts in the future for some independent reason, that will switch the format unbeknownst to us. I don't mind using dbm, I'd just like to lock in the format. Thoughts?

I suppose I'm optimistic that we are using very basic functionality of dbm, it should work regardless of what is selected. We can certainly install gdbm though to have a little more certainty around what it is selecting.

ArielGlenn added a comment.EditedJan 14 2020, 9:03 AM

<snip>

I've adjusted the script to parallelize checking the containers, and adjusted the bash script to invoke it with 4 workers. The workers coordinate based on an output lock so only one writes at a time, with a reasonable per-thread buffer. Seems to work reasonably well.

Have you verified that the locking works with nfs (v3)? That's what we have backing the filesystem where output files are written. NM I see this is based on the multiprocessing module locks.

<snip2>

I tested this fromstat1007 where dbm.whichdb('...') tells me it used ndbm, the unix file command against the output db reports Berkley db. python3-gdbm doesn't seem to be installed, suggesting it's not necessary?

Yeesssss but here's what I'm thinking, though it's not a great thought: the dbm module uses 'whatever is available' by trying imports and taking the first one that works. If the python gdbm module winds up on the snapshot hosts in the future for some independent reason, that will switch the format unbeknownst to us. I don't mind using dbm, I'd just like to lock in the format. Thoughts?

I suppose I'm optimistic that we are using very basic functionality of dbm, it should work regardless of what is selected. We can certainly install gdbm though to have a little more certainty around what it is selecting.

It's not about the functionality but the binary format; ndbm and gbm are not byte-compatible as far as the db files themselves.