Page MenuHomePhabricator

Allow GWT uploads from ggpht.com
Closed, InvalidPublic

Description

please add the following domain(s) to the wgCopyUploadsDomains whitelist:
*.ggpht.com

This is to support uploads from the Rijksmuseum, for example:
http://lh4.ggpht.com/PwdJop7AQKAOvtiEZnfLkLezQKOyO8le69XKOxVYwtLoF0hAfkAa2o_u5eA8CAW_tk4Dm0gfjU8kTyDA6TW8hIAFXg=s0
is the image that Rijksmuseum returns via their API for artefact at:
http://www.rijksmuseum.nl/collectie/AK-MAK-127


Version: wmf-deployment
Severity: enhancement

Details

Reference
bz64907

Event Timeline

bzimport raised the priority of this task from to Normal.
bzimport set Reference to bz64907.
bzimport added a subscriber: Unknown Object (MLST).
Fae created this task.May 5 2014, 5:58 PM
Fae added a comment.May 5 2014, 6:03 PM

Note that from a small test set, I see lh3, lh4, lh5, lh6 all being subdomains in use by the RM for images hosted at ggpht.com. However I feel that a more complex regex restriction is likely to be unnecessary.

tomasz added a comment.May 5 2014, 6:33 PM

Apparently ggpht.com is a domain "Google uses to host data for YouTube", according to the internets; I'm unsure whether whitelisting the whole domain is a good idea.

Fae added a comment.May 5 2014, 7:03 PM

Perhaps we can whitelist "lh\d*.ggpht.com" as more limited regex?

(In reply to Fæ from comment #3)

Perhaps we can whitelist "lh\d*.ggpht.com" as more limited regex?

I don't think that would a significant difference. (vs. a wildcard)

I don't know commons policy well but I guess flickr is ok because there are bots to check what the licensing is at flickr and record the value in an edit to file desc page. (and even then we still have to worry sometimes about flickrwashing)

(In reply to Tomasz W. Kozlowski from comment #2)

Apparently ggpht.com is a domain "Google uses to host data for YouTube",

Seems to be more widespread. e.g. including Picasa pix

This would allow essentially the same range of content/uploaders as Google Drive unless we had a bot somehow checking for license metadata associated with a given URL (like we do with flickr)?

Fae added a comment.May 5 2014, 11:32 PM

Apart from a more complex regex, like the "lh\d" or maybe "lh[1-9]" domain limitation, I am unsure what else to recommend.

I welcome other eyes on the example at https://www.rijksmuseum.nl/nl/collectie/BK-1968-212. This shows an artefact image which is broken into tiles, each tile appears hosted at ggpht.com. The API call I get my data from for the same artefact is https://www.rijksmuseum.nl/api/en/collection/BK-1968-212?key=xxxxxxxx&format=xml (blanked out my API key), this gives some interesting values, including a link to the full image:
<guid>4a53f0d0-9e70-4d00-b4e4-8f6ac028d276</guid>
<url>http://lh3.ggpht.com/HMIugFrj7Ostdj-FshnLkVcb7WQhL-mUEeJKS5ODQtexbsfaKb2jaMroIN7s7W_HV2RbenFGhbxSymNdEJJVGzjfed7-=s0</url>

If there is a way of adding some suitable verification to the image page, that we might make requirement of using this tricky Google domain, I would be happy to look into it.

There is an alternative of using the images available at Europeana, however this limits us to whatever subset Europeana happen to be hosting (it is not simply a mirror), and in truth adds no value as the images for the Rijksmuseum were actually taken from the same source I am attempting to enable for the GWT to read for itself.

Fae added a comment.May 11 2014, 2:08 AM

Some more research has led me to an alternative (which was not in the least bit obvious from their API).

In the previous example of artefact "BK-1968-212", I can upload from http://www.rijksmuseum.nl/media/assets/BK-1968-212 and not have to rely on the hosted version at Google.

I presume that the RM are using a Google mirror when serving images to end users to reduce their server traffic. Unfortunately even their API does not provide the "internal" link as an alternative source, it has to be deduced and does not appear in the public facing documentation.

I am marking this request as resolved as I can apply this work-around.