Page MenuHomePhabricator

[Open question] Improve bot identification at scale
Open, MediumPublic

Description

Better identification of bots can help us with having more reliable pageview by humans numbers. This can improve the numbers used by Comms teams T117221, the issues that the Wikinews and some other communities experienced T136084.

This task requires collaboration between Research-and-Data and Analytics.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedNuria
ResolvedNone
Resolved Mholloway
ResolvedNuria
ResolvedNone
DuplicateNone
DeclinedNone
DeclinedMilimetric
Resolvedelukey
ResolvedOttomata
ResolvedJAllemandou
DeclinedNuria
DuplicateNuria
OpenNone
ResolvedNone
DeclinedNone
ResolvedAddshore
OpenNone
ResolvedNuria
ResolvedNuria
ResolvedNuria
ResolvedNone
ResolvedNone
ResolvedNuria
ResolvedNuria
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedNone
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedMusikAnimal
OpenNone
ResolvedJoeWalsh
ResolvedNone
ResolvedNone
OpenMilimetric

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Milimetric triaged this task as Medium priority.Aug 1 2016, 4:45 PM
Milimetric moved this task from Incoming to Event Platform on the Analytics board.

@Tbayer: this ticket is for bot identification measures for bots that do not identify as such on user agent.

This comment was removed by leila.

@Nuria please ping us before locking down this as a goal for Q1 so we can set aside enough time for it.

It will be at some point next fiscal.

@Nuria and team: I see that there is a tag for this task to be picked up in July-September 2017. If that is the case, please let me know and I will set aside time for it.

@leila: it will probably get bumped up to after september 2017

@Nuria I got you. Then Q2 it is, it seems. :)

leila renamed this task from Improve bot identification at scale to [Open question] Improve bot identification at scale.May 29 2017, 6:28 PM

@Nuria This task came up again, in a recent discussion with Search Platform team. They are looking in a few directions to improve search and it's hard to measure metrics changes for these initiatives when the search data can be diluted with bot search data. Any chance your team wants to allocate some resources to this on Q3 and Q4, starting January 2018? If yes, I'm happy to spend some time to pitch this as a research collaboration to a few people who can help us move forward on this front. We need some of your time to sit with the student/faculty though, so I'd say, some of your time in Q3, and maybe more of your time in Q4 if we want to adopt an updated technology.

I linked this task from Analytics goals (https://www.mediawiki.org/wiki/Wikimedia_Technology/Goals/2017-18_Q4#Program_7._Smart_tools_for_better_data), as it has been the oldest and acts like a parent task to all others.

Is this going to be carried forward into the 2018-19 annual plan? Improvements in this area would be very valuable for reader analytics in Audiences, too.

Is this going to be carried forward into the 2018-19 annual plan? Improvements in this area would be very valuable for reader analytics in Audiences, too.

(I now understand from @Nuria that this is indeed the intention, after the project had to be postponed this FY due to major unplanned work in other areas.)

Note to self: considerer referrer as a predictor of whether request is from a bot (no referrer most likely be a direct hit). Completely unrelated but interesting read: https://bit.ly/2NCImQW

After conversation with @leila:

  • Will calculate probability per request of whether this is a bot
  • It is hard to create a negative set. We will need localized events spiky in nature, soccer matches?

@Nuria a few more thoughts:

  • You should make a decision if you want to label human activity which is bot-like as bot or not. For example, if I'm playing a Wikipedia game such as https://thewikigame.com/ where the goal is to get from article A to article B as fast as possible, are the webrequests associated with my activity bot requests or human requests? :) The answer to this question is context dependent, perhaps: for example, if you want to report human pageviews, you will likely want to keep these requests out but if you're reporting human consumption vs. machine consumption or if you consider offering different levels of service to the users, perhaps because there is human cognition and learning involved you'd want to count these as human requests.
  • Regarding building a balanced training set: For finding negative samples (human requests), there is information to be used in sessions (which you should create using webrequest logs) and if at least one request within the session includes Edit information. Of course, edits can be by bots as well, so some fine-tuning is needed there. For positive samples: you can use the userAgent information of bots that report themselves as such to gather a positive sample.
  • Unless you find a creative and better way to create labels for a training set, at best you will be left with a sample of requests that you can be sure that are bots, a sample that you can be sure that are humans (if you can tell bot vs. non-bot edits). If you can't make the differentiation happen between bot vs. non-bot edits, you may end up having a second sample which has mixed labels. There are ways to handle this.
Nuria added a subtask: Restricted Task.Oct 4 2018, 9:39 PM

@Nuria I recently ran into this paper which we may want to read for our next session:

. (I should be able to ready it before the middle of next week, fyi.)

Nuria closed subtask Restricted Task as Declined.Nov 27 2019, 8:57 PM

Update on this, code is been deployed and running on shadow mode ( meaning that end users do not see yet the results of the bot/no bot classification)

Nuria closed subtask Restricted Task as Resolved.May 21 2020, 2:56 PM