Page MenuHomePhabricator

WS-Export is not available to users on blocked IP addresses
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Use a VPN, eg. free VPN available in Opera
  • Choose any Wikisource page in the main namespace (computer view)
  • click "Download" button at the top

What happens?:

What should have happened instead?:

  • content download through this tool should be available to anybody as Wikimedia policy is not to block accessing content by anybody

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Event Timeline

Wikimedia policy is not to block accessing content by anybody

@Ankry: Citation needed. Of course we block abusive traffic.

Wikimedia policy is not to block accessing content by anybody

@Ankry: Citation needed. Of course we block abusive traffic.

https://meta.wikimedia.org/wiki/No_open_proxies#Policy

"No restrictions are placed on reading Meta or another Wikimedia project through an open or anonymous proxy. "

Downloading eBooks (which should ba cached) is read-only access.

I'm not quite sure what the solution is here. The above policy says that people "may freely use proxies until those are blocked", and in the case above the IP has obviously already been blocked. WS Export uses the globalblocks data from Meta, as a way to limit resource usage (there are lots of spam bots crawling links from Wikisources).

Samwilson renamed this task from ePub download through WS-Export is not available to anonymous VPN users to WS-Export is not available to users on blocked IP addresses.Oct 9 2024, 6:05 AM

I've updated the title, because it doesn't look like this is actually anything to do with VPNs.

I'm not quite sure what the solution is here. The above policy says that people "may freely use proxies until those are blocked", and in the case above the IP has obviously already been blocked. WS Export uses the globalblocks data from Meta, as a way to limit resource usage (there are lots of spam bots crawling links from Wikisources).

I focused on the phrase that no restrictions are placed on reading.

While I agree that bots crawling the generator (and ignoring norobots clause) may be problematic, I think that accessing the already generated books in cache should be OK.
Maybe, they can be even directly linked from wiki if there is an apporpriate API to check their presence?
If massive accessing the cache in wmfcloud is still a problem, maybe we should think of another location for the cache? Do wiki pages and Commons files have similar access limits?

And, per my knowledge, VPNs are generally blocked due to user identification problems and LTAs. Isn't it a bit overkill to use a proxy blocklist to prevent bots from accessing?

I've updated the title, because it doesn't look like this is actually anything to do with VPNs.

Unsure. The problem is related to specific global blocks only.

accessing the already generated books in cache should be OK.

I totally agree, but part of the issue is that we only cache parts of books, and so every request still results in some work being done to combine those (and often to refresh the cache), and most notable to generate any derivative formats (anything other than Epub). I have wondered if we should move to a queue system (T345406), and also cache entire Epubs/PDFs/etc. That'd mean we could split up the generation process on to multiple servers, and upload the generated files to the new object storage service.

Isn't it a bit overkill to use a proxy blocklist to prevent bots from accessing?

It might well be! We might have our hand forced soon as IP addresses are going to be inaccessible to unprivileged accounts, so we might not be able to look up the block list anyway. It'd be nicer if we didn't have to, for sure. It's just that, before we added that, the service would get overwhelmed quite often.