Page MenuHomePhabricator

Wikisource Ebooks: investigate if we can prevent automated downloads (to improve reliability) [8H]
Closed, ResolvedPublicNov 4 2020

Description

As a Wikisource user, I want the team to see if we can prevent web crawlers from downloading books, so the ebook exports can only be done by real users (and, therefore, the queue will be smaller & more efficient).

Background: This is a follow-up of T256018. As we have discussed, if we keep the download links in the sidebar, we will still have web crawlers. However, if we add a download button at the top right of the book, it will not have web crawler access. This brings up a question: Can we replace all current download links to the new system, so we can prevent automated downloads and therefore increase reliability?

Acceptance Criteria:

  • Investigate if we can prevent automated downloads via bots & webcrawlers
  • Investigate how we can prevent automated downloads via bots & webcrawlers
  • Investigate the main challenges, risks, and dependencies associated with such work
  • Provide a general estimate/idea, if possible, of the potential impact it may have on ebook export reliability
  • Provide a general estimation/rough sense of the level of difficulty of effort required in doing such work
  • Can there be a system to allow approved bots to download books?
  • Discuss with @Prtksxna how UX changes may prevent bots from downloading books
  • When discussing potential options and solutions, consider that people may want a way to download a bulk or mass of books as well (rather than only being provided an option to download books one at a time).
  • Share findings with the team

Details

Due Date
Nov 4 2020, 5:00 AM

Event Timeline

ifried renamed this task from Wikisource Ebooks: investigate if we can distinguish between user-generated vs. web crawler downloads [placeholder] to Wikisource Ebooks: investigate if we can distinguish between user-generated vs. web crawler downloads.Oct 1 2020, 10:56 PM
ifried updated the task description. (Show Details)
ifried renamed this task from Wikisource Ebooks: investigate if we can distinguish between user-generated vs. web crawler downloads to Wikisource Ebooks: investigate if we can prevent automated downloads (to improve reliability).Oct 6 2020, 8:22 PM
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried added a subscriber: Prtksxna.
ARamirez_WMF renamed this task from Wikisource Ebooks: investigate if we can prevent automated downloads (to improve reliability) to Wikisource Ebooks: investigate if we can prevent automated downloads (to improve reliability) [8H].Oct 8 2020, 6:01 PM
ARamirez_WMF changed the subtype of this task from "Task" to "Deadline".

Investigate if we can prevent automated downloads via bots & webcrawlers

To some extent we can, but it's unlikely to be completely foolproof. We're not looking to block absolutely all, however, but rather to just make sure they're not putting undue load on our systems.

Investigate how we can prevent automated downloads via bots & webcrawlers

We already block some bots by looking at the user agent. This is never going to be completely effective however, because only well-behaved bots set a meaningful user agent, and the others just pretend to be normal web browsers. We can also look at whether the referer URL (of which we only get the domain name and not the full path) helps (i.e. look at the frequency of hits with the same referer, and maybe they'll be infrequent enough to always allow through – until the bots start providing matching referers…).

A better system would be to implement request throttling, and ask users to log in if they need to avoid the throttle. The rate limit would need to be set at some level that's still useful to most users, because the tool is mainly for readers and we don't want to assume they have Wikimedia accounts. Figuring out a rate limit that's high enough to reduce load while still being low enough for legitimate users would be a tricky thing.

Investigate the main challenges, risks, and dependencies associated with such work

The fundamental issue we have here is that (as far as I know) we've only got two pieces of information about users: the user agent, and the referer. These, even combined, aren't really enough to accurately fingerprint a user, and so blocking based on them has a danger of false positives, and we may block human users.

For throttling we'd want to try to only throttle one user, but in fact might only be able to throttle the whole service. That might be fine, and once we have a job queue we'll be able to display how long the queue is and how quickly it's being processed. My feeling is that we won't need to throttle anything yet. When we do, we could limit how fast the queue is process, or various parts within it, or how jobs are added to it (e.g. in times of high load only allow logged-in users to add jobs), or maybe even truncate certain parts of the service (e.g. leave out credits information if the queue is over a certain size).

Provide a general estimate/idea, if possible, of the potential impact it may have on ebook export reliability

Blocking automated downloads will help reliability if there are lots of them relative to legitimate requests, but if we can't reliably tell the difference then we'll have to err on the side of only blocking the obvious ones.

Provide a general estimation/rough sense of the level of difficulty of effort required in doing such work

Once we have the job queue functioning, adding extra features to it to limit rates or only add from authenticated users shouldn't be *too* hard.

Can there be a system to allow approved bots to download books?

We can add a system of requiring authentication, which offloads the "prove you're human" stuff to the normal Wikimedia account-creation process. Whether we'd also need to be able to tell the difference between an authenticated user and an authenticated bot, I'm not sure, but it wouldn't be too hard if required.

Discuss with @Prtksxna how UX changes may prevent bots from downloading books

My understanding of this idea is that it could be that on-wiki links are causing bots to crawl wsexport and prompting the downloads. As part of T256392 we'll be improving the way these links are constructed, and can add rel="nofollow" to them. That might help with bots that behave themselves, but others will just ignore it.

Also, there are lots of other places that link to wsexport, and we can't control all of them. A solution at the wsexport end is preferable.

When discussing potential options and solutions, consider that people may want a way to download a bulk or mass of books as well (rather than only being provided an option to download books one at a time).

This is especially true as we continue to move to systems that consume Wikidata metadata, because they'll easily be able to get large sets of titles that they want to export. We might want to look at a feature that, for example, allows a Sparql query to be used to great a list of books to export. That sort of feature could reasonably be limited to authenticated users, I think (e.g. GLAM institutions who want to extract particular categories of works).


In conclusion, I think we should hold off on doing anything more to block bots until we've done the following:

  1. Started to cache generated books – T222936
  2. Added a job queue, so that multiple closely-timed requests for the same book won't increase resource usage – T253283
  3. Determined that the request load is impacting performance
  4. Determined which part of the stack is causing the most trouble
  5. Added an easy way to view user agents and their request numbers – T261480 will help

Then we should

  1. Add a queue-processing throttle (and tell users where they are in the queue and how fast its moving)
  2. Add an authentication system with which users can bypass the throttle, or perhaps just jump the queue
  3. Investigate adding a 'bulk generation' option (although that'd probably also be on the queue and so impact general performance)
ARamirez_WMF changed Due Date from Oct 21 2020, 4:00 AM to Nov 4 2020, 5:00 AM.Oct 22 2020, 7:39 PM

I’m marking this investigation as Done.

Overall, we have decided to not take any action on this topic for now for a few reasons, as outlined in the findings. The reasons include: We already block some bots, we’ll never be able to block all bots, we are at risk of blocking good actors (false positives), the throttling solution may be not be necessary, and there is other work we should focus on first.

It should be noted that some of the proposed steps of this investigation are already done or in progress. We have already done the cache work (step 1) and added an easy way to view user agents (step 5). Meanwhile, we have investigated adding a job queue, but we’re not sure if we will implement it yet. Instead, we have first focused on other improvements to reliability and performance that are smaller in scope and more manageable for the team. We have seen positive results from this work so far. So, we’ll continue to focus on this work, and we can later revisit this investigation, if we do later have an interest in implementing some way of further blocking bots.