Page MenuHomePhabricator

[Open question] Improve bot identification at scale
Open, MediumPublic

Description

Better identification of bots can help us with having more reliable pageview by humans numbers. This can improve the numbers used by Comms teams T117221, the issues that the Wikinews and some other communities experienced T136084.

This task requires collaboration between Research-and-Data and Analytics.

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolved Nuria
ResolvedNone
Resolved Mholloway
Resolved Nuria
ResolvedNone
DuplicateNone
DeclinedNone
DeclinedMilimetric
Resolvedelukey
ResolvedOttomata
ResolvedJAllemandou
Declined Nuria
Duplicate Nuria
OpenNone
ResolvedNone
DeclinedNone
ResolvedAddshore
OpenNone
Resolved Nuria
Resolved Nuria
Resolved Nuria
ResolvedNone
ResolvedNone
Resolved Nuria
Resolved Nuria
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedNone
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedMusikAnimal
ResolvedLGoto
ResolvedJoeWalsh
ResolvedNone
ResolvedNone
OpenNone
OpenNone
OpenBUG REPORTNone
OpenNone
ResolvedSNowick_WMF
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Tbayer: this ticket is for bot identification measures for bots that do not identify as such on user agent.

This comment was removed by leila.

@Nuria please ping us before locking down this as a goal for Q1 so we can set aside enough time for it.

It will be at some point next fiscal.

@Nuria and team: I see that there is a tag for this task to be picked up in July-September 2017. If that is the case, please let me know and I will set aside time for it.

@leila: it will probably get bumped up to after september 2017

@Nuria I got you. Then Q2 it is, it seems. :)

leila renamed this task from Improve bot identification at scale to [Open question] Improve bot identification at scale.May 29 2017, 6:28 PM

@Nuria This task came up again, in a recent discussion with Search Platform team. They are looking in a few directions to improve search and it's hard to measure metrics changes for these initiatives when the search data can be diluted with bot search data. Any chance your team wants to allocate some resources to this on Q3 and Q4, starting January 2018? If yes, I'm happy to spend some time to pitch this as a research collaboration to a few people who can help us move forward on this front. We need some of your time to sit with the student/faculty though, so I'd say, some of your time in Q3, and maybe more of your time in Q4 if we want to adopt an updated technology.

I linked this task from Analytics goals (https://www.mediawiki.org/wiki/Wikimedia_Technology/Goals/2017-18_Q4#Program_7._Smart_tools_for_better_data), as it has been the oldest and acts like a parent task to all others.

Is this going to be carried forward into the 2018-19 annual plan? Improvements in this area would be very valuable for reader analytics in Audiences, too.

Is this going to be carried forward into the 2018-19 annual plan? Improvements in this area would be very valuable for reader analytics in Audiences, too.

(I now understand from @Nuria that this is indeed the intention, after the project had to be postponed this FY due to major unplanned work in other areas.)

Note to self: considerer referrer as a predictor of whether request is from a bot (no referrer most likely be a direct hit). Completely unrelated but interesting read: https://bit.ly/2NCImQW

After conversation with @leila:

  • Will calculate probability per request of whether this is a bot
  • It is hard to create a negative set. We will need localized events spiky in nature, soccer matches?

@Nuria a few more thoughts:

  • You should make a decision if you want to label human activity which is bot-like as bot or not. For example, if I'm playing a Wikipedia game such as https://thewikigame.com/ where the goal is to get from article A to article B as fast as possible, are the webrequests associated with my activity bot requests or human requests? :) The answer to this question is context dependent, perhaps: for example, if you want to report human pageviews, you will likely want to keep these requests out but if you're reporting human consumption vs. machine consumption or if you consider offering different levels of service to the users, perhaps because there is human cognition and learning involved you'd want to count these as human requests.
  • Regarding building a balanced training set: For finding negative samples (human requests), there is information to be used in sessions (which you should create using webrequest logs) and if at least one request within the session includes Edit information. Of course, edits can be by bots as well, so some fine-tuning is needed there. For positive samples: you can use the userAgent information of bots that report themselves as such to gather a positive sample.
  • Unless you find a creative and better way to create labels for a training set, at best you will be left with a sample of requests that you can be sure that are bots, a sample that you can be sure that are humans (if you can tell bot vs. non-bot edits). If you can't make the differentiation happen between bot vs. non-bot edits, you may end up having a second sample which has mixed labels. There are ways to handle this.

@Nuria I recently ran into this paper which we may want to read for our next session:

. (I should be able to ready it before the middle of next week, fyi.)

Nuria closed subtask Restricted Task as Declined.Nov 27 2019, 8:57 PM

Update on this, code is been deployed and running on shadow mode ( meaning that end users do not see yet the results of the bot/no bot classification)

Nuria closed subtask Restricted Task as Resolved.May 21 2020, 2:56 PM

I'm sorry If I'm stating the obvious but this really looks like a classic case of anomaly detection which there is a dedicated rigor as subpart of ML to handle. The aforementioned paper utilizes classic anomaly detection as well. There are several ways to look at it like multi-variant Gaussian distribution or doing a dimensionality redaction (via your favorite tool) and then measuring the distance after it got turned back into the large vector (this is outlined in depth in Hands-On Unsupervised Learning Using Python and Feature Engineering for Machine Learning).

This should be rather doable to have a rather half automated way to train a model (based on samples) and deploy, then retrain and redeploy the next day and so on automatically.

@Ladsgroup indeed this is by no means an unsolvable problem at the theoretical level (and there is an existing solution for it in place). However, depending on the specifications and requirements we put on the model (which mainly comes from potential compute resource limitations and/or use-cases), updates will require careful model design and testing. A few scenarios that make this problem exciting are:

  • How fast do we need the result of bot/no-bot model? More specifically: should we use one of the streaming algorithms on the requests as they come in, or can we afford to store and then analyze the stored data. (with 100Ks of requests per second, either approach has its own advantages/disadvantages)
  • Labeled data: we know which webrequests are from bots that identify themselves as bots. The rest is effectively not labeled. The training data-set should be carefully built and there will always be approximations and assumptions there which will not hold true on the whole of the data. Building a good training data-set will continue to demand careful choices and considerations. :)
  • Currently the assumption is that whatever we do should work for all projects and languages.

These are a few considerations. There are more. Again I emphasize that this is not unsolvable and we have solved it in the past to some extent, but these models will need improvements and that needs some time and focus. :)

@leila I totally agreed that it's a complex problem but I hope we can move forward with improving quality of our data and metric in some shape or form.

Regarding labelled data, for unsupervised ML, you don't really need labelled data, you need some examples for the feature engineering but nothing beyond that. Am I missing something?

@Ladsgroup I'm with you on the overall theme that better bot detection is possible and an important project to work on. :)

@odimitrijevic I saw your note in T333950#8765538 . As I have mentioned to you before: I'm with you that we should tackle it. In the last Tech Steering prioritization meeting the decision was communicated that we don't prioritize this work for the April-June quarter (Q4). As a result, I did not prioritize this work with my team. Is it fair to say you want us to attempt to prioritize it in July-September (Q1) which means we will need Joseph or someone else from your team as well?

Gehel subscribed.

Removing DPE SRE until there is more clarity on what to do and if we need to be involved.