Page MenuHomePhabricator

[Open question] Improve bot identification at scale
Open, MediumPublic


Better identification of bots can help us with having more reliable pageview by humans numbers. This can improve the numbers used by Comms teams T117221, the issues that the Wikinews and some other communities experienced T136084.

This task requires collaboration between Research-and-Data and Analytics.

Related Objects

Resolved elukey

Event Timeline

leila created this task.Jun 20 2016, 10:25 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 20 2016, 10:25 AM
Milimetric triaged this task as Medium priority.Aug 1 2016, 4:45 PM
Milimetric moved this task from Incoming to Modern Event Platform on the Analytics board.

@Tbayer: this ticket is for bot identification measures for bots that do not identify as such on user agent.

leila added a comment.Feb 18 2017, 3:17 PM
This comment was removed by leila.
leila added a comment.Feb 18 2017, 3:18 PM

@Nuria please ping us before locking down this as a goal for Q1 so we can set aside enough time for it.

Nuria added a comment.Feb 19 2017, 6:10 AM

It will be at some point next fiscal.

Nuria added a subtask: Restricted Task.Apr 25 2017, 3:11 PM
leila edited projects, added Research; removed Research-Backlog.Apr 25 2017, 3:41 PM
leila added a comment.May 29 2017, 5:49 PM

@Nuria and team: I see that there is a tag for this task to be picked up in July-September 2017. If that is the case, please let me know and I will set aside time for it.

Nuria added a comment.May 29 2017, 6:08 PM

@leila: it will probably get bumped up to after september 2017

leila added a comment.May 29 2017, 6:28 PM

@Nuria I got you. Then Q2 it is, it seems. :)

leila renamed this task from Improve bot identification at scale to [Open question] Improve bot identification at scale.May 29 2017, 6:28 PM
Akeron added a subscriber: Akeron.Jul 22 2017, 8:00 AM
leila added a comment.Nov 28 2017, 6:13 PM

@Nuria This task came up again, in a recent discussion with Search Platform team. They are looking in a few directions to improve search and it's hard to measure metrics changes for these initiatives when the search data can be diluted with bot search data. Any chance your team wants to allocate some resources to this on Q3 and Q4, starting January 2018? If yes, I'm happy to spend some time to pitch this as a research collaboration to a few people who can help us move forward on this front. We need some of your time to sit with the student/faculty though, so I'd say, some of your time in Q3, and maybe more of your time in Q4 if we want to adopt an updated technology.

Nuria moved this task from Wikistats Production to Bots on the Analytics board.Feb 5 2018, 5:45 PM

I linked this task from Analytics goals (, as it has been the oldest and acts like a parent task to all others.

Is this going to be carried forward into the 2018-19 annual plan? Improvements in this area would be very valuable for reader analytics in Audiences, too.

Is this going to be carried forward into the 2018-19 annual plan? Improvements in this area would be very valuable for reader analytics in Audiences, too.

(I now understand from @Nuria that this is indeed the intention, after the project had to be postponed this FY due to major unplanned work in other areas.)

Nuria added a comment.Jul 10 2018, 3:45 PM

Note to self: considerer referrer as a predictor of whether request is from a bot (no referrer most likely be a direct hit). Completely unrelated but interesting read:

After conversation with @leila:

  • Will calculate probability per request of whether this is a bot
  • It is hard to create a negative set. We will need localized events spiky in nature, soccer matches?
leila added a comment.Jul 17 2018, 2:21 AM

@Nuria a few more thoughts:

  • You should make a decision if you want to label human activity which is bot-like as bot or not. For example, if I'm playing a Wikipedia game such as where the goal is to get from article A to article B as fast as possible, are the webrequests associated with my activity bot requests or human requests? :) The answer to this question is context dependent, perhaps: for example, if you want to report human pageviews, you will likely want to keep these requests out but if you're reporting human consumption vs. machine consumption or if you consider offering different levels of service to the users, perhaps because there is human cognition and learning involved you'd want to count these as human requests.
  • Regarding building a balanced training set: For finding negative samples (human requests), there is information to be used in sessions (which you should create using webrequest logs) and if at least one request within the session includes Edit information. Of course, edits can be by bots as well, so some fine-tuning is needed there. For positive samples: you can use the userAgent information of bots that report themselves as such to gather a positive sample.
  • Unless you find a creative and better way to create labels for a training set, at best you will be left with a sample of requests that you can be sure that are bots, a sample that you can be sure that are humans (if you can tell bot vs. non-bot edits). If you can't make the differentiation happen between bot vs. non-bot edits, you may end up having a second sample which has mixed labels. There are ways to handle this.
Nuria added a subtask: Restricted Task.Oct 4 2018, 9:39 PM
leila added a comment.Nov 6 2018, 7:59 PM

@Nuria I recently ran into this paper which we may want to read for our next session:

. (I should be able to ready it before the middle of next week, fyi.)

leila moved this task from Staged to In Progress on the Research board.Jan 22 2019, 4:32 PM
leila moved this task from In Progress to Staged on the Research board.Apr 1 2019, 11:36 PM
leila edited projects, added Research-Backlog; removed Research.Jul 11 2019, 3:47 PM
Nuria closed subtask Restricted Task as Declined.Nov 27 2019, 8:57 PM