Page MenuHomePhabricator

Improve access to Commons image data for research and development
Open, MediumPublic

Description

The Commons image data (i.e. the image files with pixels) are not available on stats machines.
To do research on image analysis, we have to download Commons images from the API to the stats machines.
We would like to start a discussion with other tech staff to identify ways to gain easy access to the Commons pixel data.

Current known blockers

  • Getting a read only account on swift to be able to access.
  • Identifying the intended use cases (e.g. originals and/or thumbnails only?) and what easy access means in this context.

Potential Use Cases

  • Train large-scale machine learning models for image quality scoring
    • MIN image size: 256px by 256 px
    • Minimum number of images: 200K
  • Train machine learning models for Commons category prediction
    • MIN image size: 227px by 227 px
    • Minimum number of images: 1K per category

Event Timeline

leila triaged this task as Medium priority.Jan 11 2018, 5:32 PM

@fgiunchedi there is no urgency about this task at the moment, and I know Miriam has already talked with you about this data. When you have some time, can you help us expand the description of this task with what some of the current blockers for making it happen are?

leila renamed this task from Discuss about accessing the Commons image data to Improve access to Commons image data for research and development.Jan 11 2018, 5:36 PM

@leila for sure, one blocker is certainly getting a read only account on swift to be able to access. Another one I could think of is identify the intended use cases (e.g. originals and/or thumbnails only?) and what easy access means in this context.

@fgiunchedi makes sense.

@Miriam regarding the use-cases, let's chat to see how to collect these use-cases. When you have some extra time, you can start listing some of the use-cases in the task Description here.

@leila @fgiunchedi Thanks for your comments!
I updated the task description with potential use-cases, I hope they make sense.

Re - easy access: this would me having a folder on stats machine (e.g. /mnt/data/images) containing all the Commons image files, and a mapping (table or rule) between the file path in the server and the image filename, for example:

<path_on_stats_machines>,<image_name>
/mnt/data/images/a/a2/Colosseum_Roma.jpg,Colosseum_Roma.jpg

Does it make sense?

@leila @fgiunchedi Thanks for your comments!
I updated the task description with potential use-cases, I hope they make sense.

Re - easy access: this would me having a folder on stats machine (e.g. /mnt/data/images) containing all the Commons image files, and a mapping (table or rule) between the file path in the server and the image filename, for example:

<path_on_stats_machines>,<image_name>
/mnt/data/images/a/a2/Colosseum_Roma.jpg,Colosseum_Roma.jpg

Does it make sense?

It does! I'm assuming here the list of filenames to download is already known, I see at least a couple of ways to have said images on something that looks like a filesystem:

  1. copy/download the actual data like a regular client would do, assuming there's enough space available and download times are reasonable. This is something anyone could do since the images are public.
  1. don't copy/download the data and access swift directly but expose swift itself as a filesystem (i.e. what FUSE does). This requires an account to talk directly to swift, and thus not generally available.

Given that the subset of images seems reasonably limited, it seems 1. is easier to start with. If resources become a problem (i.e. not enough space available, takes too long to download, etc) then I think solution 2. might be investigated but it is IMO significantly more complex.

WRT solution 2 consider also that the backing system is still object-storage with a lot of files, thus certain operations (e.g. listing all files) would not be practical to achieve.

HTH!

Thanks @fgiunchedi

The ideal scenario would be to have the whole Commons repository available on the machines. Possibly updated say once a month.

This would be useful in case we want, for example, to score or tag with our models a large portion or all images in the Commons. Or in case we need easy access to image data for updating or improving the models.

40M images with 256p resolution is not a huge amount of data. And since this is not urgent, we could do the transfer through option (1) provided we have resources available and permission to use them.

Do you think this would be feasible?

Thanks @fgiunchedi

The ideal scenario would be to have the whole Commons repository available on the machines. Possibly updated say once a month.

This would be useful in case we want, for example, to score or tag with our models a large portion or all images in the Commons. Or in case we need easy access to image data for updating or improving the models.

40M images with 256p resolution is not a huge amount of data. And since this is not urgent, we could do the transfer through option (1) provided we have resources available and permission to use them.

Do you think this would be feasible?

I believe that would be feasible yeah, especially at 256 pixels width the whole archive shouldn't take a lot of space (in the 2.6TB order if the images were 256x256). I'm happy to code review the related scripts to fetch images when the time comes!

I'd like to see this available for download by the public. This would not replace the media tarballs we used to provide, but it would be a good start.

WRT solution 2 consider also that the backing system is still object-storage with a lot of files, thus certain operations (e.g. listing all files) would not be practical to achieve.

This is good to keep in mind in general, but for the listing all files example, we can use the mediawiki databases to get images currently on a wiki:

[commonswiki]> select page_title from page where page_namespace = 6 limit 10;

As well as deleted images (are these still in Swift, or do we delete them for good?):

[commonswiki]> select distinct ar_title from archive where ar_namespace = 6 limit 10;

This will probably be useful in both option 1 and option 2. Also, depending on how deleted images are handled, and where the processing is being done, and how many wikis we're going over, it might be easier to query mediawiki_page_history in hive (https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_page_history). This has all the wikis in one place and information about whether a page was deleted and when it was deleted. This latter bit might be interesting if you want to classify what kind of images get deleted and how that happens over time (if we have the deleted images still available in object storage, that is).

WRT solution 2 consider also that the backing system is still object-storage with a lot of files, thus certain operations (e.g. listing all files) would not be practical to achieve.

This is good to keep in mind in general, but for the listing all files example, we can use the mediawiki databases to get images currently on a wiki:

[commonswiki]> select page_title from page where page_namespace = 6 limit 10;

I agree this would be the best way, namely obtain from mediawiki a list of interesting filenames and their metadata and then fetch the image bytes from swift.

As well as deleted images (are these still in Swift, or do we delete them for good?):

[commonswiki]> select distinct ar_title from archive where ar_namespace = 6 limit 10;

I believe the archive is kept, so those should be available as well.

238482n375 lowered the priority of this task from Medium to Lowest.
238482n375 moved this task from Next Up to In Code Review on the Analytics-Kanban board.
238482n375 edited subscribers, added: 238482n375; removed: Aklapper.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDIwMTgtMDYtMTQtMTEpCk9wZW4sIE5lZWRzIFRyaWFnZVB1YmxpYwoKICAgIEVkaXQgVGFzawogICAgRWRpdCBSZWxhdGVkIFRhc2tzLi4uCiAgICBFZGl0IFJlbGF0ZWQgT2JqZWN0cy4uLgogICAgUHJvdGVjdCBhcyBzZWN1cml0eSBpc3N1ZQoKICAgIE11dGUgTm90aWZpY2F0aW9ucwogICAgQXdhcmQgVG9rZW4KICAgIEZsYWcgRm9yIExhdGVyCgpUYWdzCgogICAgQW5hbHl0aWNzLUthbmJhbiAoSW4gUHJvZ3Jlc3MpCgpTdWJzY3JpYmVycwpBa2xhcHBlciwgSkFsbGVtYW5kb3UKQXNzaWduZWQgVG8KSkFsbGVtYW5kb3UKQXV0aG9yZWQgQnkKSkFsbGVtYW5kb3UsIEZyaSwgSnVuIDE1CkRlc2NyaXB0aW9uCgpPb3ppZSBqb2JzIGhhdmUgYmVlbiBmYWlsaW5nIGF0IGxlYXN0IGEgZmV3IHRpbWVzIGVhY2guIE1vcmUgaW52ZXN0aWdhdGlvbiBuZWVkZWQuCkpBbGxlbWFuZG91IGNyZWF0ZWQgdGhpcyB0YXNrLkZyaSwgSnVuIDE1LCA3OjIxIEFNCkhlcmFsZCBhZGRlZCBhIHN1YnNjcmliZXI6IEFrbGFwcGVyLiC3IFZpZXcgSGVyYWxkIFRyYW5zY3JpcHRGcmksIEp1biAxNSwgNzoyMSBBTQpKQWxsZW1hbmRvdSBjbGFpbWVkIHRoaXMgdGFzay5GcmksIEp1biAxNSwgNzoyMiBBTQpKQWxsZW1hbmRvdSB1cGRhdGVkIHRoZSB0YXNrIGRlc2NyaXB0aW9uLiAoU2hvdyBEZXRhaWxzKQpKQWxsZW1hbmRvdSBhZGRlZCBhIHByb2plY3Q6IEFuYWx5dGljcy1LYW5iYW4uCkpBbGxlbWFuZG91IG1vdmVkIHRoaXMgdGFzayBmcm9tIE5leHQgVXAgdG8gSW4gUHJvZ3Jlc3Mgb24gdGhlIEFuYWx5dGljcy1LYW5iYW4gYm9hcmQuCkNoYW5nZSBTdWJzY3JpYmVycwpDaGFuZ2UgUHJpb3JpdHkKQXNzaWduIC8gQ2xhaW0KTW92ZSBvbiBXb3JrYm9hcmQKQ2hhbmdlIFByb2plY3QgVGFncwpBbmFseXRpY3MtS2FuYmFuCtcKU2VjdXJpdHkK1wpXaWtpbWVkaWEtVkUtQ2FtcGFpZ25zIChTMi0yMDE4KQrXClNjYXAK1wpTY2FwIChTY2FwMy1BZG9wdGlvbi1QaGFzZTIpCtcKQWJ1c2VGaWx0ZXIK1wpEYXRhLXJlbGVhc2UK1wpIYXNodGFncwrXCkxhYnNEQi1BdWRpdG9yCtcKTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2kK1wpMYW5ndWFnZS0yMDE4LUFwci1KdW5lCtcKTGFuZ3VhZ2UtMjAxOC1KYW4tTWFyCtcKSEhWTQrXCkhBV2VsY29tZQrXCkJvbGQKSXRhbGljcwpNb25vc3BhY2VkCkxpbmsKQnVsbGV0ZWQgTGlzdApOdW1iZXJlZCBMaXN0CkNvZGUgQmxvY2sKUXVvdGUKVGFibGUKVXBsb2FkIEZpbGUKTWVtZQpQcmV2aWV3CkhlbHAKRnVsbHNjcmVlbiBNb2RlClBpbiBGb3JtIE9uIFNjcmVlbgoyMzg0ODJuMzc1IGFkZGVkIHByb2plY3RzOiBTZWN1cml0eSwgV2lraW1lZGlhLVZFLUNhbXBhaWducyAoUzItMjAxOCksIFNjYXAgKFNjYXAzLUFkb3B0aW9uLVBoYXNlMiksIEFidXNlRmlsdGVyLCBEYXRhLXJlbGVhc2UsIEhhc2h0YWdzLCBMYWJzREItQXVkaXRvciwgTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2ksIExhbmd1YWdlLTIwMTgtQXByLUp1bmUsIExhbmd1YWdlLTIwMTgtSmFuLU1hciwgSEhWTSwgSEFXZWxjb21lLlBSRVZJRVcKMjM4NDgybjM3NSBtb3ZlZCB0aGlzIHRhc2sgZnJvbSBJbiBQcm9ncmVzcyB0byBJbiBDb2RlIFJldmlldyBvbiB0aGUgQW5hbHl0aWNzLUthbmJhbiBib2FyZC4KMjM4NDgybjM3NSByZW1vdmVkIEpBbGxlbWFuZG91IGFzIHRoZSBhc3NpZ25lZSBvZiB0aGlzIHRhc2suCjIzODQ4Mm4zNzUgdHJpYWdlZCB0aGlzIHRhc2sgYXMgTG93ZXN0IHByaW9yaXR5LgoyMzg0ODJuMzc1IHJlbW92ZWQgc3Vic2NyaWJlcnM6IEFrbGFwcGVyLCBKQWxsZW1hbmRvdS4KQ29udGVudCBsaWNlbnNlZCB1bmRlciBDcmVhdGl2ZSBDb21tb25zIEF0dHJpYnV0aW9uLVNoYXJlQWxpa2UgMy4wIChDQy1CWS1TQSkgdW5sZXNzIG90aGVyd2lzZSBub3RlZDsgY29kZSBsaWNlbnNlZCB1bmRlciBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSAoR1BMKSBvciBvdGhlciBvcGVuIHNvdXJjZSBsaWNlbnNlcy4gQnkgdXNpbmcgdGhpcyBzaXRlLCB5b3UgYWdyZWUgdG8gdGhlIFRlcm1zIG9mIFVzZSwgUHJpdmFjeSBQb2xpY3ksIGFuZCBDb2RlIG9mIENvbmR1Y3QuILcgV2lraW1lZGlhIEZvdW5kYXRpb24gtyBQcml2YWN5IFBvbGljeSC3IENvZGUgb2YgQ29uZHVjdCC3IFRlcm1zIG9mIFVzZSC3IERpc2NsYWltZXIgtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu

Aklapper raised the priority of this task from Lowest to Medium.
leila moved this task from In Progress to Backlog on the Research board.
Miriam removed subscribers: dr0ptp4kt, Milimetric, zhuyifei1999.

Just removing the Research tag, i'll still be the POC for this in the future.

Hi, revisiting this subject! With T220081 the swift cluster is reachable from analytics, does this allow us to proceed with one or both of the options described?

For the use case I have in mind - a spark pipeline that enriches the image bytes with other features about the image (e.g. descriptions/captions/categories/usages) and stores a dataset on hdfs as input for machine learning pipelines - option 1 (download) might be more suitable? In terms of querying the swift cluster, are there good practices in terms of query per seconds we can safely execute? At first the pipeline will run on small datasets on a single thread to create a proof of concept. @fgiunchedi any input you might have is appreciated, especially pointers to code that queries swift or docs would be super useful. Thank you.

Hello @fkaelin, thanks for reaching out !

Hi, revisiting this subject! With T220081 the swift cluster is reachable from analytics, does this allow us to proceed with one or both of the options described?

Yes that's correct, you should be able to reach swift (i.e. ms-fe.svc.eqiad.wmnet) from the analytics cluster

For the use case I have in mind - a spark pipeline that enriches the image bytes with other features about the image (e.g. descriptions/captions/categories/usages) and stores a dataset on hdfs as input for machine learning pipelines - option 1 (download) might be more suitable? In terms of querying the swift cluster, are there good practices in terms of query per seconds we can safely execute? At first the pipeline will run on small datasets on a single thread to create a proof of concept. @fgiunchedi any input you might have is appreciated, especially pointers to code that queries swift or docs would be super useful. Thank you.

Agreed, downloading the images from swift with option (1) is the way to go in this case. WRT download rate, anything below say 80-100 download/s should pose no problems.

When downloading public thumbnails you can use any HTTP client, provided of course you have the list of commons (or other wikis really) file names you are interested in.

This task has a little more context but tl;dr is that the mediawiki database(s) or APIs can be queried for a list of filenames (e.g. for File:US_Capitol_west_side.JPG the original lives at /wikipedia/commons/4/4f/US_Capitol_west_side.JPG) and from there you can get a predictable URL to download the 256px-wide thumbnails from (in the example above, the thumbnail is at /wikipedia/commons/thumb/4/4f/US_Capitol_west_side.JPG/256px-US_Capitol_west_side.JPG)

Hope that helps!

Thanks for the information @fgiunchedi.

How are the ACL for images configured? I was able to query containers that are owned by the analytics:admin swift user that I am testing with, but not for any image containers- though it is very possible that the request is not configured correctly. Or do I need to use specific credentials?

Assuming I have a collection of filenames, is the "predictable URL" you mention the sharding done in FileBackendStore.php#1659 with this config?

Thank you!

Thanks for the information @fgiunchedi.

How are the ACL for images configured? I was able to query containers that are owned by the analytics:admin swift user that I am testing with, but not for any image containers- though it is very possible that the request is not configured correctly. Or do I need to use specific credentials?

For public images (e.g. commons) I think it might be simpler to download from swift the same way that e.g. the frontend cache would do, in other words urls like this (following the example above)

https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/4/4f/US_Capitol_west_side.JPG

Assuming I have a collection of filenames, is the "predictable URL" you mention the sharding done in FileBackendStore.php#1659 with this config?

The code refers to container sharding, which is specific to swift and hidden from e.g. the URL above. When the input is a list of filenames (assuming all from commons) I think the easiest to get file's path is to ask the MW action API via prop=imageinfo&iiprop=url, for example the US capitol image. That's to the best of my knowledge at least, perhaps there's better/different ways to go from filename to its URL.

Note that the API will return the public hostname (i.e. upload.wikimedia.org) which is cached, for internal usage you should access swift directly (hence the ms-fe.svc.eqiad.wmnet hostname above.

Hope that helps!

Hi! thanks @fgiunchedi for this info! This is what I have to get/download image URLs given the filename:
Take this url: https://upload.wikimedia.org/wikipedia/commons/thumb/a/a8/Tour_Eiffel_Wikimedia_Commons.jpg/600px-Tour_Eiffel_Wikimedia_Commons.jpg

  • The first part is always the same: https://upload.wikimedia.org/wikipedia/commons/ this should be changed with the swift url
  • The second part is the first character of the MD5 hash of the file name. For Example, the MD5 hash of Tour_Eiffel_Wikimedia_Commons.jpg is #a85d416ee427dfaee44b9248229a9cdd, so we get /a.
  • The third part is the first two characters of the MD5 hash from above: /a8.
  • The fourth part is the file name: /Tour_Eiffel_Wikimedia_Commons.jpg
  • Then you have the thumbnail size, e.g. 600px, and again the file name /600px-Tour_Eiffel_Wikimedia_Commons.jpg

My script:

# This is to download quality images, using the query results form https://quarry.wmflabs.org/query/22050
thumbnailsize=600

while read -r name
do
	long=$(echo -n $name | md5sum)
	first=$(echo "${long}" | head -c1)
	second=$(echo "${long}" | head -c2)
	wget -q "https://upload.wikimedia.org/wikipedia/commons/thumb/$first/$second/$name/"$thumbnailsize"px-$name" -O $2/$name
done<$1

Hope this helps :)

Thanks @Miriam, that's indeed even better as no API access is required!

Thanks for the pointers @fgiunchedi and @Miriam. This approach works well, including scaling the download on spark.

For reference, this is the swift middle layer that rewrites a query from the url (eg https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/4/4f/US_Capitol_west_side.JPG) into the swift object request (ie container, object). It is not clear to me why the sharding implementation, how the images are distributed over many swift containers, is visible external to that middle layer. It would seem preferable to have a url be https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/US_Capitol_west_side.JPG, and have the md5 based sharding applied when the request is rewritten into a swift object request.

I use a configurable number of workers that download images sequentially. Some basic measurements show one worker can download ~5 images/second for 256px thumbnails. I assume caching can have quite an impact on this, and I only tested with small datasets (~2000 images) with a low number of workers querying swift (20 threads, so ~100qps total). I was looking the swift grafana dashboard, but the load was too small to be visible, or I am looking in the wrong place. @fgiunchedi, is this the dashboard to monitor the swift cluster when running a larger job?

Thanks for the pointers @fgiunchedi and @Miriam. This approach works well, including scaling the download on spark.

For reference, this is the swift middle layer that rewrites a query from the url (eg https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/4/4f/US_Capitol_west_side.JPG) into the swift object request (ie container, object). It is not clear to me why the sharding implementation, how the images are distributed over many swift containers, is visible external to that middle layer. It would seem preferable to have a url be https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/US_Capitol_west_side.JPG, and have the md5 based sharding applied when the request is rewritten into a swift object request.

It is a long story :) I don't have time to dig into it but tl;dr as far as I know the filename sharding is exposed for historical reasons

I use a configurable number of workers that download images sequentially. Some basic measurements show one worker can download ~5 images/second for 256px thumbnails. I assume caching can have quite an impact on this, and I only tested with small datasets (~2000 images) with a low number of workers querying swift (20 threads, so ~100qps total). I was looking the swift grafana dashboard, but the load was too small to be visible, or I am looking in the wrong place. @fgiunchedi, is this the dashboard to monitor the swift cluster when running a larger job?

You are looking in the right spot yes!

I picked this up last week again, and ran a more substantial test job using 50 workers downloading ~1million commons images (400px thumbnails) using a spark job. Some more questions before I run a job on the full datasets (~53M image files). Looking at the grafana dashboard,

  • what does the increase in put 201 in the object state-changing? cache misses for the thumbnails that get filled?
  • client errors chart. we do expect to see a lot of 404 since some images we query for will now be deleted. however, I also notice a high number of timeouts with a timeout of 5 seconds. Is this to be expected? I am doing retries and will increase the timeout but it seems high.

@fgiunchedi Does this dashboard and approach look ok to you from the swift perspective? If so, I kick off the main job this week, it is expected to run for ~6days.

messed up the link to the grafana above (edited) and adding it as a screenshot

swift_grafana.png (1×3 px, 534 KB)

I picked this up last week again, and ran a more substantial test job using 50 workers downloading ~1million commons images (400px thumbnails) using a spark job. Some more questions before I run a job on the full datasets (~53M image files). Looking at the grafana dashboard,

  • what does the increase in put 201 in the object state-changing? cache misses for the thumbnails that get filled?

Possible but hard to say from that graph, when did the job start/finish ? I'm assuming ~23:30 to ~1:40 but best to confirm
Something else to check for thumbnailing activity is the Thumbor dashboard (for the same timeframe):
https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&from=1615330800000&to=1615341600000

  • client errors chart. we do expect to see a lot of 404 since some images we query for will now be deleted. however, I also notice a high number of timeouts with a timeout of 5 seconds. Is this to be expected? I am doing retries and will increase the timeout but it seems high.

Yes some timeouts are to be expected for sure; were the timeouts for a certain kind of file types? It might be a thumb miss plus long thumb regeneration time, or it might be Swift timing out while fetching the image. Indeed timeout + exponential retries should get you basically all the way there.

@fgiunchedi Does this dashboard and approach look ok to you from the swift perspective? If so, I kick off the main job this week, it is expected to run for ~6days.

It seems to generally be fine, although I'm surprised at the thumbnailing activity being higher than expected, as in 400px should be pre-generated at upload time by MW for most wikis ($wgThumbLimits in mediawiki-config).

I picked this up last week again, and ran a more substantial test job using 50 workers downloading ~1million commons images (400px thumbnails) using a spark job. Some more questions before I run a job on the full datasets (~53M image files). Looking at the grafana dashboard,

  • what does the increase in put 201 in the object state-changing? cache misses for the thumbnails that get filled?

Possible but hard to say from that graph, when did the job start/finish ? I'm assuming ~23:30 to ~1:40 but best to confirm
Something else to check for thumbnailing activity is the Thumbor dashboard (for the same timeframe):
https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&from=1615330800000&to=1615341600000

Thanks for that dashboard, that is useful to look at. I should have clarified the period, your assumption is right ~23:30 to ~1:40.

  • client errors chart. we do expect to see a lot of 404 since some images we query for will now be deleted. however, I also notice a high number of timeouts with a timeout of 5 seconds. Is this to be expected? I am doing retries and will increase the timeout but it seems high.

Yes some timeouts are to be expected for sure; were the timeouts for a certain kind of file types? It might be a thumb miss plus long thumb regeneration time, or it might be Swift timing out while fetching the image. Indeed timeout + exponential retries should get you basically all the way there.

@fgiunchedi Does this dashboard and approach look ok to you from the swift perspective? If so, I kick off the main job this week, it is expected to run for ~6days.

It seems to generally be fine, although I'm surprised at the thumbnailing activity being higher than expected, as in 400px should be pre-generated at upload time by MW for most wikis ($wgThumbLimits in mediawiki-config).

Looking at the errors, I noticed that I actually used a 3s timeout - but still there are so many timeouts, with no retries almost 25% of requests fail. All errors are timeout errors, and the distribution of file types of successful and failed attempts are roughly the same. Is it possible that somehow the servers are overloaded?

  • the 75 and 90 percentile latency seem very high at almost 15seconds for two of the four hosts (thumbor1003, thumbor1004). Plus it seems like that the max latency is 15seconds?
  • there is a spike of 429 codes around 1.30 as well, is swift rate limiting internally as well?

image.png (1×4 px, 870 KB)

I picked this up last week again, and ran a more substantial test job using 50 workers downloading ~1million commons images (400px thumbnails) using a spark job. Some more questions before I run a job on the full datasets (~53M image files). Looking at the grafana dashboard,

  • what does the increase in put 201 in the object state-changing? cache misses for the thumbnails that get filled?

Possible but hard to say from that graph, when did the job start/finish ? I'm assuming ~23:30 to ~1:40 but best to confirm
Something else to check for thumbnailing activity is the Thumbor dashboard (for the same timeframe):
https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&from=1615330800000&to=1615341600000

Thanks for that dashboard, that is useful to look at. I should have clarified the period, your assumption is right ~23:30 to ~1:40.

  • client errors chart. we do expect to see a lot of 404 since some images we query for will now be deleted. however, I also notice a high number of timeouts with a timeout of 5 seconds. Is this to be expected? I am doing retries and will increase the timeout but it seems high.

Yes some timeouts are to be expected for sure; were the timeouts for a certain kind of file types? It might be a thumb miss plus long thumb regeneration time, or it might be Swift timing out while fetching the image. Indeed timeout + exponential retries should get you basically all the way there.

@fgiunchedi Does this dashboard and approach look ok to you from the swift perspective? If so, I kick off the main job this week, it is expected to run for ~6days.

It seems to generally be fine, although I'm surprised at the thumbnailing activity being higher than expected, as in 400px should be pre-generated at upload time by MW for most wikis ($wgThumbLimits in mediawiki-config).

Looking at the errors, I noticed that I actually used a 3s timeout - but still there are so many timeouts, with no retries almost 25% of requests fail. All errors are timeout errors, and the distribution of file types of successful and failed attempts are roughly the same. Is it possible that somehow the servers are overloaded?

It is certainly possible, during Swift rebalances when we add/remove hardware the latencies are bound to go up (although there we no rebalances when the job ran). And of course Thumbor can process only so many thumbnails at a time, I'm surprised though that there's so much thumbnail activity -- I would have expected for the 400px images to be already in swift.

  • the 75 and 90 percentile latency seem very high at almost 15seconds for two of the four hosts (thumbor1003, thumbor1004). Plus it seems like that the max latency is 15seconds?
  • there is a spike of 429 codes around 1.30 as well, is swift rate limiting internally as well?

Yes swift does rate limit to some extent, and thumbor does reply too 429 for thumbnails that repeatedly fail to thumbnail