Getting this in Phabricator, so we can look into this together...
Question: Can we determine the percentage of automated vs. human downloads of books on Wikisource?
In order to determine this, we would need the following information:
- How are the download events logged? Is there a data table capturing these events?
- If we have data table, does the data log: 1) user id/user name 2) timestamp of the download events 3) wiki name
- Is the crawler marked as a bot? If not, maybe we can estimate based on their behavior pattern.
What we have now:
- We only have the access log (time, book downloading, and user agent of person downloading book → and if user agent is too old, it is probably a crawler) on WSExport.
- We don’t have any of the other stuff. But we can add event logging. We cannot log IPs.