Page MenuHomePhabricator

Wikisource Ebooks: Can we determine the percentage of automated vs. human downloads of books on Wikisource?
Open, Needs TriagePublic

Description

Getting this in Phabricator, so we can look into this together...

Question: Can we determine the percentage of automated vs. human downloads of books on Wikisource?

In order to determine this, we would need the following information:

  • How are the download events logged? Is there a data table capturing these events?
  • If we have data table, does the data log: 1) user id/user name 2) timestamp of the download events 3) wiki name
  • Is the crawler marked as a bot? If not, maybe we can estimate based on their behavior pattern.

What we have now:

  • We only have the access log (time, book downloading, and user agent of person downloading book → and if user agent is too old, it is probably a crawler) on WSExport.
  • We don’t have any of the other stuff. But we can add event logging. We cannot log IPs.

Event Timeline

@dmaza When you get the chance, Jennifer would like the page of the table definition added to this ticket, so she can investigate. Thanks in advance!

Each export is logged in a table with this structure:

			CREATE TABLE `books_generated` (
				`time` DATETIME(2) NOT NULL,
				INDEX time (`time`),
				`lang` VARCHAR(10) CHARACTER SET utf8mb4 NOT NULL,
				INDEX lang (`lang`),
				`title` VARBINARY(255) NOT NULL,
				`format` VARCHAR(10) CHARACTER SET utf8mb4 NOT NULL
			) DEFAULT CHARSET=utf8mb4;'

And dumps of the data are available at https://wsexport.wmflabs.org/logs/

Thanks, @Samwilson!

@jwang We have provided information on the table structure (see above comment). Let us know if you need anything else. Thanks!

I want to point out that the user-agent is the only piece of identifiable data that we could "use" to tie different requests to one user/bot and it is a bit unreliable since multiple users could potentially have the same one.

Furthermore, user-agent is only available in the server's access log and not stored in the database. Unless I'm missing something, looking at the logs would be the only way to try and figure this out

Yes, that's a very good point, I was wondering that too. After T261480 we can look at the user agents more easily; should this task wait till that's done?

Also, are we allowed to log user agents? Aren't they sort of user-identifing?

Yes, that's a very good point, I was wondering that too. After T261480 we can look at the user agents more easily; should this task wait till that's done?

Also, are we allowed to log user agents? Aren't they sort of user-identifing?

This is tricky and I do not know the answer. AFAIK you can log that kind of data as long as it has a 90 days TTL and it is only available to users with a NDA. Technically, user-agent is already stored in the server so anyone with sudo access to the VPS can see it (which is fine 'cause you sign an NDA to get access)
@aezell @Niharika you've recently dealt with this in Checkuser/IP-Info/etc; could you provide more info?

Yes, that's a very good point, I was wondering that too. After T261480 we can look at the user agents more easily; should this task wait till that's done?

Also, are we allowed to log user agents? Aren't they sort of user-identifing?

This is tricky and I do not know the answer. AFAIK you can log that kind of data as long as it has a 90 days TTL and it is only available to users with a NDA. Technically, user-agent is already stored in the server so anyone with sudo access to the VPS can see it (which is fine 'cause you sign an NDA to get access)
aezell Niharika you've recently dealt with this in Checkuser/IP-Info/etc; could you provide more info?

Cloud Services is not the same as production. Anyone with a developer account can create a tool and log user agents on Toolforge, you just need to turn it on https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web/Lighttpd#Web_logs. IPs are anonymized, however. The same is true with VPS.

The terms of use doesn't mention anything about user agents. We were given green light on awstats some years ago for XTools (see T169590#3423594), so I think that is fine because again IPs are the really sensitive bit and those are scrubbed. We of course don't want to keep our logs indefinitely anyway because that would eventually fill up the disk. The log rotation that is in place now for Wikisource Export will delete files older than 7 days. I'm not sure if the log rotation will interfere with awstats or not, but regardless I would think a week's worth of data is plenty for our purposes.

TIL. Thank you for the links and explanation. For some reason I was under the impression that you needed a NDA to be able to access a VPS.

For some reason I was under the impression that you needed a NDA to be able to access a VPS.

You do need to request a VPS project, but no I don't think you have sign an NDA. Under default configuration (as is the case with wsexport), VPS conveys no more user information than Toolforge does, and it's apparent that UAs are okay (within reason). Tool-Pageviews for instance used to automatically create Phabricator pastes whenever a user encountered an error, containing the stacktrace and user agent. This was done with permission from WMCS.

Anyway, I don't know how much you're going to gather from the UAs. In my experience with XTools, a lot of crawlers have very human-like UAs. The ones that don't can easily be blocked via apache config. What I usually do to find those is filter out keywords common to human UAs, such as with:

sudo cat /var/log/apache2/access.log | grep -v "Gecko" | grep -v "Safari" | grep -v "UptimeRobot"

Doing this I see there are a few bots in there, but not many. I did block one obvious crawler that was continually hitting us, but it didn't appear to be going at problematic speeds.

Crawlers as you know tend to click on every link on a page. I took advantage of this and built a crawler-detection thing in XTools, since some of the first links visible to a bot are the language dropdown links at the top-right https://xtools.wmflabs.org. No human will want to view XTools in all those different languages, one by one, in quick succession. So, in the application I throttle requests that contain ?uselang= in the URL and log the user agents/URIs/etc to a file. Almost always when I check this file the UAs look perfectly human. Basically all I'm saying is going by UAs isn't a super effective way to stop crawlers since only the responsible crawlers will use an identifiable UA.

That said, a quick look at our access logs still shows the occasional user (human or not) making the request repeatedly, over a short period of time. Just run sudo less /var/log/apache2/access.log, hold down [Enter] and look for patterns. You won't have to look far. I think the short-term caching proposed in T222936 along with the job queue (T253283) will do a great deal towards stability.

Thank you all for the inputs. I finally got toolforge access and is able to query the data. And note what I have found with the data we have now.
Currently table books_generated captures timestamp, language, title and format. With these info, we could not determine automated vs. human downloads yet.

SQL

SELECT * FROM books_generated

Table example

timelangtitleformat
2017-10-26 04:50:02.00slsomethingepub-3