Page MenuHomePhabricator

Allow upload-by-URL from upload.wikimedia.org
Open, MediumPublic

Description

This might seem ridiculous at first glance, but it would be incredibly useful for writing Commons transfer scripts (similar in concept to CommonsHelper, but calling the API from JavaScript).

It may be as simple as adding upload.wikimedia.org to $wgCopyUploadsDomains in InitialiseSettings.php. However, I don't know if the server configuration will allow this to work straight away.

See also T22512.

Details

Reference
bz42473

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:01 AM
bzimport set Reference to bz42473.
bzimport added a subscriber: Unknown Object (MLST).

See also my comment at bug 14919 comment 5

Adding dependency to tracking bug 37883 (Wikimedia Commons features).

Adding reedy as CC to get feedback on the server configuration issue.

[feature => severity "enhancement"]

Adding Ryan; he's the man, apparently

One issue with this is that the proxy server currently handling upload-by-URL requests can't do HTTPS. So we would either need to fix that bug, or give some warning that HTTPS requests will error out.

Is there already a bug "add HTTPS capability to the proxy server"?

If so, please add a dependency.

Could we now enable this feature or is there another blocker?

(In reply to comment #9)

Could we now enable this feature or is there another blocker?

I guess it should be enabled on testwiki and confirmed to work first...

Could someone please go ahead and enable this on testwiki?

(In reply to comment #11)

Could someone please go ahead and enable this on testwiki?

https://gerrit.wikimedia.org/r/47299

Thanks, however it doesn't seem to work for me. I ran a test from test2wiki (this was easier because my JS code is set up for CORS):

HTTP POST to http://test.wikipedia.org/w/api.php

action=upload
filename=0.28589522187660577.png
text=this is a test file
comment=upload comment
token=<VALID EDIT TOKEN>
url=http%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Ftest2%2F5%2F53%2F0.28589522187660577.png
ignorewarnings=true
format=json
origin=http%3A%2F%2Ftest2.wikipedia.org

This is the response:

{"servedby":"srv193","error":{"code":"http-bad-status","info":"Error fetching file from remote source","0":"403","1":"Forbidden"}}

(In reply to comment #13)

{"servedby":"srv193","error":{"code":"http-bad-status","info":"Error fetching
file from remote source","0":"403","1":"Forbidden"}}

acl to-wikimedia dst 208.80.152.0/22
acl to-wikimedia dst 91.198.174.0/24
acl to-wikimedia dst 10.0.0.0/16
acl to-wikimedia dst 10.64.0.0/16

Do not allow any fetches from our own IP ranges

http_access deny to-wikimedia

I'm not sure if the answer is to make squid serve those requests, or add a list of sites that shouldn't use $wgCopyUploadProxy

Suspect that's a question for ops whether they're ok with letting the proxy read from the cluster..

No, an upload-by-url proxy is the wrong day to do it. If we want to copy files within the upload.wm.org realm, then we should use efficient server-side copies (e.g. Swift's X-Copy-From header), not go through the application servers and upload-by-URL proxies.

Moreover, copying files internally seems wrong to me in general. It's probably okay if it's a limited use case, but if it's something that's going to get popular, then some other way of multiple reference to the same file should be found, rather than having the same contents copied over and over in the media storage backends.

Maybe so. However, Commons transfer has always been done by a download-upload process (this is what CommonsHelper on toolserver does, for example). Fixing this bug would allow this tried-and-true approach to continue at a faster rate. Or, we could wait an indefinite amount of time for the file storage backend to be complexified, convoluted, etc...

(In reply to comment #14)

Suspect that's a question for ops whether they're ok with letting the proxy
read from the cluster..

Were ops ever contacted about this?

(In reply to comment #17)

Were ops ever contacted about this?

See answer in comment 15 by Faidon.

(In reply to comment #18)

See answer in comment 15 by Faidon.

My bad, I didn't realise Faidon was part of the ops team.

It seems we've reached a stalemate: ops is refusing to fulfil the request, but no alternative is being suggested.

(In reply to comment #15)

It's probably
okay if it's a limited use case, but if it's something that's going to get
popular

Just so you are aware, Faidon... I daresay hundreds of thousands of files have already been copied from WMF wikis to Commons, leading already to massive duplication on the servers. So this process is already rather popular, and this bug is a way to streamline the process.

To be clear, I would welcome an alternative internal approach, or a rationalisation of the file storage backend, but I don't see those things happening anytime soon. Going ahead and reconfiguring the proxy can be done now (as far as I can tell) and would make the process as it already exists a lot simpler.

[CC'ing Fabrice as this covers Uploading/Multimedia]

  • Bug 62820 has been marked as a duplicate of this bug. ***

RfC is running at Commons: https://commons.wikimedia.org/wiki/Commons:Requests_for_comment/Allow_transferring_files_from_other_Wikimedia_Wikis_server_side

I didn't conceal that it's possibly not implemented *but* I hope that strong consensus and some of the comments by the community possibly motivate responsible persons to re-consider their position. The way transferring files is currently done adds likely more load the the WMF servers as if the proxies would allow to fetch from WMF directly.

Status update: On [[Commons:Commons:Requests for comment/Allow transferring files from other Wikimedia Wikis server side]], we have an unanimous consensus.

(In reply to Faidon Liambotis from comment #15)

No, an upload-by-url proxy is the wrong day to do it. If we want to copy
files within the upload.wm.org realm, then we should use efficient
server-side copies (e.g. Swift's X-Copy-From header), not go through the
application servers and upload-by-URL proxies.

Moreover, copying files internally seems wrong to me in general. It's
probably okay if it's a limited use case, but if it's something that's going
to get popular, then some other way of multiple reference to the same file
should be found, rather than having the same contents copied over and over
in the media storage backends.

Actually, we already do that with manual bots and tools to transfer media from local Wikimedia to Commons when they have been cleared as freely licensed or in public domain.

So I offer to enable it as it won't create more copy than we currently have, and then open a new bug to work on a better solution.

tomasz set Security to None.

So I offer to enable it as it won't create more copy than we currently have, and then open a new bug to work on a better solution.

That would be great, indeed. Can you enable that now?

So I offer to enable it as it won't create more copy than we currently have, and then open a new bug to work on a better solution.

@Dereckson Just asking about the status :-). I read the discussion again and it looks like it is possible now to enable this. Or not? It needs some special config? There is also T78167. Thanks in advice.

It would need the acls in the squid config for url-downloader.wikimedia.org to be changed. Someone (@csteipp ?) Would probably need to asses the security risk of such a change.

Stale for nearly a year. Any news about this?

It sounds like someone needs to create a new ticket out of T44473#1198327, assign it to Ops and Security, and add it as a blocker to this bug.

No, an upload-by-url proxy is the wrong day to do it. If we want to copy files within the upload.wm.org realm, then we should use efficient server-side copies (e.g. Swift's X-Copy-From header), not go through the application servers and upload-by-URL proxies.

Moreover, copying files internally seems wrong to me in general. It's probably okay if it's a limited use case, but if it's something that's going to get popular, then some other way of multiple reference to the same file should be found, rather than having the same contents copied over and over in the media storage backends.

So assuming that @faidon 's comment still stands. What is the way forward here?

How about having a config variable that gives a regex which converts urls to mwstore:// virtual urls. Thus if MW see's a url matching that regex, instead of doing an http request to copy the file, it will do an internal SWIFT copy.

Part of the problem is that the Upload class is very stiff and difficult to modify. However I think this is do-able.

So assuming that @faidon 's comment still stands. What is the way forward here?

How about having a config variable that gives a regex which converts urls to mwstore:// virtual urls. Thus if MW see's a url matching that regex, instead of doing an http request to copy the file, it will do an internal SWIFT copy.

@faidon: Any opinion on that approach?

see T140462 and T190716, this problem has been solved in another way?

Looking at the description of this ticket FileImporter can not be called from scripts right now, it is just a special page.
I don't think there is a ticket for this.

Any progress on this? Upload by url has literally been around for years and being unable to import from *.wikimedia.org is a pretty massive oversight.

edit:
I just tried this on testwiki and it works fine. What exactly is the hold up?