Page MenuHomePhabricator

Block hotlinking
Open, NormalPublic

Description

Before we do, we need to double-check what kind of referrers will be affected. But when I looked at 302 requests in T148410: Investigate source of thumbnail 302 redirects the vast majority of hits were coming for garbage blogs and websites that just copied stale html from our projects with no added value. Most of the time those pages were filled with ads. If the referrer breakdown for 200s looks the same as 302s, I see no reason to justify the ongoing expense.

The only semi-legit uses I stumbled upon (http://thewikigame.com/ http://www.buildyourmap.com/wikipediaimport.htm http://www.wikiwand.com/) can certainly work around the hotlinking limitation we would introduce. Or we could act nice and whitelist them first, reach out to ask them to stop hotlinking, then get rid of the temporary whitelist.

If legit uses that align with our mission needed this to function, I think it would manifest itself in the access log. I think that allowing hotlinking is a "failed experiment" that led to nothing other than a silent waste of donor money.

Event Timeline

Gilles created this task.Dec 1 2016, 12:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 1 2016, 12:52 PM
Gilles updated the task description. (Show Details)Dec 1 2016, 12:53 PM
BBlack added a comment.Dec 1 2016, 2:15 PM

I've made this argument before. I'm not fond of Commons/upload images/thumbs being hotlinkable. In my mind, Commons exists to serve the multimedia needs of the encyclopedic content, and hotlinking from it has nothing to do with that. The hotlinked traffic is also a fairly large chunk of our overall bandwidth.

However, when I've made this argument in the past, I've been corrected that Commons does not exist solely to serve the multimedia needs of the encyclopedic content, as I would've imagined or hoped. It exists on its own justification as a library of public domain multimedia works so long as they could conceivably provide any educational value to anyone, as laid out here: https://commons.wikimedia.org/wiki/Commons:Project_scope .

There are many legitimate uses of hotlinks to our images for reasonable purposes which include appropriate attribution/licensing text, and sorting good from bad hotlinking would be a pretty heavy and ongoing task for humans if we tried to whitelist or blacklist on referrer headers.

I think it's a legitimate debate to have, but the scope of the debate and the weight of the decision is huge. It's not something we can unilaterally decide in a phab task. This is the kind of debate that would require the input and involvement of the Commons (and broader) community and upper management at the WMF, I think.

Gilles added a comment.Dec 1 2016, 2:26 PM

Except I've looked at the data for 302s and there was no legitimate use. Even on the rare instances of a blog that seemed to have an educative purpose, there was no attribution and the text was probably copy/pasted as well. This is what I mean, the reality is that the hypothetical good use never materialized and it's exclusively junk that doesn't respect the basic principle of attribution Commons expects.

I think the same study of the 200s has to happen, along with a budget calculation of how much we're spending on supporting this.

As for the scope of the discussion for this discussion, I'm not sure. Some people will probably cling onto the strawman legit use case, ignoring how much we're wasting on this right now. Should a single hotlinker produce a huge amount of traffic and cause an incident, we wouldn't run a consultation to block them.

And like I've mentioned, we could whitelist what we consider "legit". In fact there could be a process created to apply for hotlinking rights. This would flip the default and still support the theoretical use cases people are likely to fight for. In fact the error happening when hotlinking could include a link to a page explaining the default and how to apply for an exception.

BBlack added a comment.EditedDec 1 2016, 2:57 PM

Keep in mind I fundamentally agree with you from personal POV, but I feel the need to play devil's advocate for the existing stance today here:

Except I've looked at the data for 302s and there was no legitimate use.

What are the 302 cases commonly? (as in, why are they redirects at all, what kind of links are they?)

Even on the rare instances of a blog that seemed to have an educative purpose, there was no attribution and the text was probably copy/pasted as well. This is what I mean, the reality is that the hypothetical good use never materialized and it's exclusively junk that doesn't respect the basic principle of attribution Commons expects.
I think the same study of the 200s has to happen

I know I've seen legitimate hotlinking of our images in many legitimate places (even just in random blogs and news articles I read), with attribution, which would fall into the 200 logs. I don't know how you'd define or categorize legitimacy today (other than proper attributions, most other measures are fairly subjective), but there has to be a lot of legitimate use-case out there we'd be killing by default.

along with a budget calculation of how much we're spending on supporting this.

Eliminating image hotlinks would not allow us, for example, to downgrade our Transport links (but maybe Transit a little). We pay the same cost for them whether they're more or less utilized. You could make an argument that it might save us a fraction of our cache or swift infrastructure, but that's a relatively small budget figure: at best saving a handful of machines whose cost is spread over a period of years. Probably the most legitimate argument isn't on cost savings, but on performance benefits (it's easy to have higher cache hitrates if we block most hotlinking, as the active set size gets drastically reduced and the patterns are better). It doesn't get us away from caring about breaking old links and APIs unless we never whitelist any significant number of sites.

As for the scope of the discussion for this discussion, I'm not sure. Some people will probably cling onto the strawman legit use case, ignoring how much we're wasting on this right now.

How much time are we wasting on it right now?

Should a single hotlinker produce a huge amount of traffic and cause an incident, we wouldn't run a consultation to block them.

Right. We always reserve the right to block abusive traffic at the techops level, but this has little to do with the general image-hotlinking debate. If we saw a single IP address or netblock causing 70% of our wiki pageviews on the text cluster and spiking traffic / hurting infra, we'd block those IPs from viewing wiki articles as a quick-fix for abuse as well.

And like I've mentioned, we could whitelist what we consider "legit". In fact there could be a process created to apply for hotlinking rights. This would flip the default and still support the theoretical use cases people are likely to fight for. In fact the error happening when hotlinking could include a link to a page explaining the default and how to apply for an exception.

There may be thousands of "illegit" uses, but there are probably also tens of thousands of long-tail legit uses. Commons/upload.wm.o has been a free source of hotlinkable images on the Internet for a very long time now...

Gilles added a comment.Dec 1 2016, 2:59 PM

The 302s are links to thumbnails that have moved because the original was moved. Mediawiki honors those redirects on misses, figuring out what the new thumbnail location is.

Gilles added a comment.EditedDec 1 2016, 3:09 PM

The examples you'e provided for legitimate use cases aren't compelling examples of us providing a free CDN being a necessity. The examples I've seen on blogspot could host the images there, and it would be the natural reaction of people who maintain the content hotlinking to us to start hosting it themselves and fix their content when we break it.

As for the downlink not going to be downgraded, sure, what about future upgrades that won't have to happen thanks to the freed capacity?

When wikipedia started, sure, hosting your own site with images might have been costly. But so far I have yet to see an example of legit use case, in the generous sense of the term, that couldn't host the images. Because clearly they were hosting a whole lot of other images on the pages I've seen, that weren't hotlinked.

I really feel like we're talking about continuing support for something that was only relevant in the 90s. Nowadays hosting images should prove no challenge even to people who just started blogging. And I don't see how it's our mission to avoid breaking abandoned 3rd-party blog articles.

faidon added a subscriber: faidon.Dec 1 2016, 3:23 PM

It seems that you're objecting to this feature on two different grounds: one is the legality of how it's being used by users (copyvios, mainly missing attribution when the content's license requests it) and the other one is the infrastructure cost for operating it. Is this the case? If so, these two are very different objections and we (ops/traffic, but engineering in general I'd say) can mainly have an opinion on the former, I think.

I haven't run the numbers, but my gut tells me that the infrastructure & network costs are probably not going to play a major factor here.

I'm personally on the other side of the argument and see this as a useful thing to have (and certainly explicit blocking of hotlinking, like some other sites do, as super annoying). In any case, I otherwise fully agree with what @BBlack said, that this isn't something that we can just unilaterally decide on a whim.

fgiunchedi triaged this task as Normal priority.Dec 1 2016, 6:58 PM

The examples you'e provided for legitimate use cases aren't compelling examples of us providing a free CDN being a necessity. The examples I've seen on blogspot could host the images there, and it would be the natural reaction of people who maintain the content hotlinking to us to start hosting it themselves and fix their content when we break it.

Embedding images in a forum post is a very clear example of a legitimate use case where hotlinking is beneficial (the alternative is re-uploading it to e.g. imgur, but that is guaranteed to lose information on the author of the image). We even actively stimulate this use case with a 'share and embed' such as:

<p><a href="https://commons.wikimedia.org/wiki/File:RijksmuseumAmsterdamMuseumplein2.50,1.jpg#/media/File:RijksmuseumAmsterdamMuseumplein2.50,1.jpg"><img src="https://upload.wikimedia.org/wikipedia/commons/e/e6/RijksmuseumAmsterdamMuseumplein2.50%2C1.jpg" alt="RijksmuseumAmsterdamMuseumplein2.50,1.jpg" height="1200" width="3000"></a><br>By <a href="//commons.wikimedia.org/wiki/User:Massimo_Catarinella" title="User:Massimo Catarinella">Massimo Catarinella</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="http://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=37643098">Link</a></p>

(from mediaviewer)

or

[url=https://commons.wikimedia.org/wiki/File%3ARijksmuseumAmsterdamMuseumplein2.50%2C1.jpg][img]https://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/RijksmuseumAmsterdamMuseumplein2.50%2C1.jpg/512px-RijksmuseumAmsterdamMuseumplein2.50%2C1.jpg[/img][/url]
[url=https://commons.wikimedia.org/wiki/File%3ARijksmuseumAmsterdamMuseumplein2.50%2C1.jpg]RijksmuseumAmsterdamMuseumplein2.50,1[/url] [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], by Massimo Catarinella (Own work), from Wikimedia Commons

(Commons 'use this file' button).

Furthermore, on the attribution side, allowing hotlinking is actually better, as it uses an image url that allows someone to easily trace back the origin (https://upload.wikimedia.org/wikipedia/commons/e/e6/RijksmuseumAmsterdamMuseumplein2.50%2C1.jpg versus e.g. imgur.com/randomcharacters.jpg)

Great points @valhallasw. It'd be nice to know how much it really costs, though, even if it's already priced into the current infrastructure. When I heard that it represented a large share of our traffic it struck me as something with a non-trivial cost. While we wouldn't reclaim the freed capacity now if we stopped doing that, it might mean that the next upgrades will be further in the future, which would generate some amount of savings. The question is, how much?

I don't disagree with the utility in the grand scheme of things, to me the question is: should we be the ones providing it? We make the content available, that doesn't mean that we have to make it available in that particular way.

Using the Media Viewer blog embed codes as an example, it could point to a copy of the file on the Internet Archive (should they be inclined to host that stuff), or behind a free CDN like cloudflare, etc. As in, we wouldn't allow direct hotlinking, but would facilitate hotlinking to the same content hosted by a 3rd-party.

But the question of cost really comes first, because it determines whether alternatives are worth pursuing or not.

ema moved this task from Triage to Watching on the Traffic board.Dec 5 2016, 11:16 AM
Yann added a subscriber: Yann.Jun 26 2019, 2:25 PM