[Open question] Improve bot identification at scale
Open, MediumPublic
Actions

Assigned To

None

Authored By

	leila
	Jun 20 2016, 10:25 AM

Description

Better identification of bots can help us with having more reliable pageview by humans numbers. This can improve the numbers used by Comms teams T117221, the issues that the Wikinews and some other communities experienced T136084.

This task requires collaboration between Research-and-Data and Analytics.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T138207 [Open question] Improve bot identification at scale
Resolved		• Nuria	T148461 Bot Identification: Inconsistent data in #all-sites-by-os-and-browser for IE7
Resolved		None	T137454 Bot from an Azure cloud cluster is causing a false pageview spike (can we identify as bot?)
Resolved		• Mholloway	T143990 [Feed] Establish criteria for blacklisting likely bot-inflated most-read articles
Resolved		• Nuria	T149178 Non existing article is one of the most viewed according to the data returned by the /metrics/pageviews/top/ API
Resolved		None	T123442 Pageview API: Better filtering of bot traffic on top enpoints
Duplicate		None	T144715 Top Pageview stats for August 27th doesn't look right
Declined		None	T200630 Eventlogging's processors stopped working
Declined		Milimetric	T200760 Set a timeout for regex parsing in the Eventlogging processors
Resolved		elukey	T200765 Simplify and document how to increase log verbosity/level for Eventlogging
Resolved		Ottomata	T200769 Upgrade librdkafka on eventlog1002
Resolved		JAllemandou	T212854 Upgrade ua parser to latest version for both java and python
			Restricted Task
Declined		• Nuria	T146911 Quantify false positives when filtering for number of distinct user agents per page in top pages computation
Duplicate		• Nuria	T145043 Special characters showing up as question marks in /pageviews/top endpoint
Open		None	T133575 Provide weekly top pageviews stats
Resolved		None	T136084 Unexpected increase in traffic for 4 languages in same region, on smaller projects
Declined		None	T175870 Correct pageview_hourly and derived data for T141506
Resolved		Addshore	T199517 Investigate June Unique devices increase of 170% for wikidata
Open		None	T200020 Annotations in wikistats that are only visible on "all" time range get bundled up (probably an issue we cannot resolve until we have a more granular time range)
			Restricted Task
Resolved		• Nuria	T211359 POC More efficient Bot filtering on pageview data
Resolved		• Nuria	T206267 Create labeled dataset for bot identification
Resolved		• Nuria	T237282 Topviews Analysis of the Hungarian Wikipedia is flooded with spam
Resolved		None	T238357 Label high volume bot spikes in pageview data as automated traffic
Resolved		None	T238358 Deploy high volume bot spike detector to hungarian wikipedia
Resolved		• Nuria	T238360 Hourly Feature extraction for bot detection from webrequest
Resolved		• Nuria	T238361 Hourly labeling of "automated" traffic before loading of pageviews into pageview_hourly
Resolved		JAllemandou	T238363 Vet high volume bot spike detection code
Resolved		JAllemandou	T247342 Create UDF for actor id generation
Resolved		JAllemandou	T247344 Automated deletion of actor data for bot prediction after 90 days
Resolved		None	T239532 "Venuše (planeta)" on cs.wp has surprisingly high numbers in Pageviews Analysis (and also Topviews Analysis)
Resolved		JAllemandou	T250744 Unique devices, retrofit with bot detection code
Resolved		JAllemandou	T255467 Create intermediate dataset: pageview with actor information
Resolved		MusikAnimal	T232992 Manipulation of pageview statistics German Wikipedia
Resolved		LGoto	T236121 Trending articles is showing pages that had fake traffic
Resolved		JoeWalsh	T238942 Add problematic German spam articles to Top Read blacklist
Resolved		None	T249792 Abnormal peaks @ huwiki
Resolved		None	T213148 small bot activity marked as user in Manuel_de_Pedrolo page
Open		None	T282502 Analyse possible bot traffic for ptwiki article Ambev
Open		None	T313114 Analyze possible bot traffic for frwiki article Cookie (informatique)
Open	BUG REPORT	None	T321707 Bot Detection
Open		None	T360499 There are anomalies in some of the mostread data on zhwiki for March 2024
Resolved		SNowick_WMF	T328127 Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage
Open		None	T292449 Revisit approach to automated bot detection

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

elukey subscribed.Nov 3 2016, 3:17 PM

• Mholloway closed subtask T143990: [Feed] Establish criteria for blacklisting likely bot-inflated most-read articles as Resolved.Nov 3 2016, 8:56 PM

• Nuria added a subtask: T135251: Investigate requests flagged as pageview in analytics header coming from bots.Dec 5 2016, 5:20 PM

• Nuria added a subtask: T153699: Skewed pageviews for Azerbaijani and Bulgarian Wikipedias, September, October and November 2016.Jan 31 2017, 9:12 PM

• Nuria added a subtask: T123442: Pageview API: Better filtering of bot traffic on top enpoints.

• Tbayer added a subtask: T157528: Add "Damn Small XSS Scanner" (DSXS) to list of known bots.Feb 8 2017, 5:25 AM

@Tbayer: this ticket is for bot identification measures for bots that do not identify as such on user agent.

• Nuria removed a subtask: T153699: Skewed pageviews for Azerbaijani and Bulgarian Wikipedias, September, October and November 2016.Feb 8 2017, 5:45 PM

leila added a comment.Feb 18 2017, 3:17 PM

This comment was removed by leila.

@Nuria please ping us before locking down this as a goal for Q1 so we can set aside enough time for it.

It will be at some point next fiscal.

Addshore unsubscribed.Feb 20 2017, 11:12 AM

• Nuria added a subtask: Restricted Task.Apr 25 2017, 3:11 PM

leila edited projects, added Research; removed Research-Freezer.Apr 25 2017, 3:41 PM

@Nuria and team: I see that there is a tag for this task to be picked up in July-September 2017. If that is the case, please let me know and I will set aside time for it.

@leila: it will probably get bumped up to after september 2017

@Nuria I got you. Then Q2 it is, it seems. :)

leila renamed this task from Improve bot identification at scale to [Open question] Improve bot identification at scale.May 29 2017, 6:28 PM

• Nuria removed a subtask: T135251: Investigate requests flagged as pageview in analytics header coming from bots.Jun 15 2017, 4:15 PM

• Nuria added a subtask: T146911: Quantify false positives when filtering for number of distinct user agents per page in top pages computation.Jun 15 2017, 4:18 PM

• Nuria closed subtask T149178: Non existing article is one of the most viewed according to the data returned by the /metrics/pageviews/top/ API as Resolved.Jul 12 2017, 7:18 PM

• madhuvishy unsubscribed.Jul 18 2017, 2:53 PM

Akeron subscribed.Jul 22 2017, 8:00 AM

leila mentioned this in T171694: [Objective 7.4.2.] More efficient Bot filtering on pageview data (consulting).Jul 26 2017, 12:11 AM

mforns added a subtask: T133575: Provide weekly top pageviews stats.Jul 31 2017, 3:37 PM

mforns moved this task from Dashiki to Wikistats on the Analytics board.

• Nuria added a subtask: T136084: Unexpected increase in traffic for 4 languages in same region, on smaller projects .Aug 7 2017, 4:03 PM

• fdans added a subtask: T175870: Correct pageview_hourly and derived data for T141506.Sep 21 2017, 4:29 PM

@Nuria This task came up again, in a recent discussion with Search Platform team. They are looking in a few directions to improve search and it's hard to measure metrics changes for these initiatives when the search data can be diluted with bot search data. Any chance your team wants to allocate some resources to this on Q3 and Q4, starting January 2018? If yes, I'm happy to spend some time to pitch this as a research collaboration to a few people who can help us move forward on this front. We need some of your time to sit with the student/faculty though, so I'd say, some of your time in Q3, and maybe more of your time in Q4 if we want to adopt an updated technology.

• Nuria moved this task from Wikistats to Datasets on the Analytics board.Feb 5 2018, 5:45 PM

nshahquinn-wmf subscribed.Feb 23 2018, 11:52 AM

I linked this task from Analytics goals (https://www.mediawiki.org/wiki/Wikimedia_Technology/Goals/2017-18_Q4#Program_7._Smart_tools_for_better_data), as it has been the oldest and acts like a parent task to all others.

Is this going to be carried forward into the 2018-19 annual plan? Improvements in this area would be very valuable for reader analytics in Audiences, too.

In T138207#4261605, @Tbayer wrote:

Is this going to be carried forward into the 2018-19 annual plan? Improvements in this area would be very valuable for reader analytics in Audiences, too.

(I now understand from @Nuria that this is indeed the intention, after the project had to be postponed this FY due to major unplanned work in other areas.)

Note to self: considerer referrer as a predictor of whether request is from a bot (no referrer most likely be a direct hit). Completely unrelated but interesting read: https://bit.ly/2NCImQW

After conversation with @leila:

Will calculate probability per request of whether this is a bot
It is hard to create a negative set. We will need localized events spiky in nature, soccer matches?

Zebulon84 closed subtask T148461: Bot Identification: Inconsistent data in #all-sites-by-os-and-browser for IE7 as Resolved.Jul 15 2018, 6:56 AM

• Nuria added a subtask: T199517: Investigate June Unique devices increase of 170% for wikidata.Jul 16 2018, 4:02 PM

• Nuria reopened subtask T199517: Investigate June Unique devices increase of 170% for wikidata as Stalled.

@Nuria a few more thoughts:

You should make a decision if you want to label human activity which is bot-like as bot or not. For example, if I'm playing a Wikipedia game such as https://thewikigame.com/ where the goal is to get from article A to article B as fast as possible, are the webrequests associated with my activity bot requests or human requests? :) The answer to this question is context dependent, perhaps: for example, if you want to report human pageviews, you will likely want to keep these requests out but if you're reporting human consumption vs. machine consumption or if you consider offering different levels of service to the users, perhaps because there is human cognition and learning involved you'd want to count these as human requests.

Regarding building a balanced training set: For finding negative samples (human requests), there is information to be used in sessions (which you should create using webrequest logs) and if at least one request within the session includes Edit information. Of course, edits can be by bots as well, so some fine-tuning is needed there. For positive samples: you can use the userAgent information of bots that report themselves as such to gather a positive sample.

Unless you find a creative and better way to create labels for a training set, at best you will be left with a sample of requests that you can be sure that are bots, a sample that you can be sure that are humans (if you can tell bot vs. non-bot edits). If you can't make the differentiation happen between bot vs. non-bot edits, you may end up having a second sample which has mixed labels. There are ways to handle this.

You may find Chapter 4 of http://infolab.stanford.edu/~ullman/mmds/book.pdf about Mining Data Streams useful.

• Nuria added a subtask: T200630: Eventlogging's processors stopped working.Sep 3 2018, 11:45 PM

Addshore closed subtask T199517: Investigate June Unique devices increase of 170% for wikidata as Resolved.Oct 2 2018, 6:25 AM

• Nuria added a subtask: Restricted Task.Oct 4 2018, 9:39 PM

@Nuria I recently ran into this paper which we may want to read for our next session:

ClassAutoSearchTraffic.pdf480 KBDownload

. (I should be able to ready it before the middle of next week, fyi.)

• Nuria removed a subtask: T206267: Create labeled dataset for bot identification.Dec 6 2018, 5:42 PM

leila moved this task from Backlog to In Progress on the Research board.Jan 22 2019, 4:32 PM

Milimetric closed subtask T200630: Eventlogging's processors stopped working as Declined.Mar 4 2019, 4:51 PM

leila moved this task from In Progress to Backlog on the Research board.Apr 1 2019, 11:36 PM

leila edited projects, added Research-Freezer; removed Research.Jul 11 2019, 3:47 PM

Nemo_bis subscribed.Aug 27 2019, 9:48 AM

• Nuria added a subtask: T237282: Topviews Analysis of the Hungarian Wikipedia is flooded with spam .Nov 6 2019, 7:08 PM

SBisson subscribed.Nov 14 2019, 7:30 PM

• Nuria closed subtask T211359: POC More efficient Bot filtering on pageview data as Resolved.Nov 14 2019, 7:53 PM

• Nuria closed subtask Restricted Task as Declined.Nov 27 2019, 8:57 PM

Update on this, code is been deployed and running on shadow mode ( meaning that end users do not see yet the results of the bot/no bot classification)

• Nuria added a subtask: T232992: Manipulation of pageview statistics German Wikipedia.Mar 17 2020, 5:10 PM

• Nuria closed subtask T146911: Quantify false positives when filtering for number of distinct user agents per page in top pages computation as Declined.Mar 18 2020, 4:39 PM

cchen subscribed.Apr 15 2020, 5:25 AM

• Nuria added a subtask: T249792: Abnormal peaks @ huwiki.May 4 2020, 8:22 PM

• Nuria closed subtask T237282: Topviews Analysis of the Hungarian Wikipedia is flooded with spam as Resolved.May 5 2020, 2:45 PM

• Nuria closed subtask T238357: Label high volume bot spikes in pageview data as automated traffic as Resolved.May 14 2020, 2:43 PM

• Nuria closed subtask Restricted Task as Resolved.May 21 2020, 2:56 PM

• Nuria closed subtask T249792: Abnormal peaks @ huwiki as Resolved.May 22 2020, 5:23 AM

• Nuria closed subtask T175870: Correct pageview_hourly and derived data for T141506 as Declined.

• Nuria closed subtask T123442: Pageview API: Better filtering of bot traffic on top enpoints as Resolved.May 22 2020, 5:26 AM

• Nuria added a subtask: T213148: small bot activity marked as user in Manuel_de_Pedrolo page.May 22 2020, 5:40 AM

• Nuria closed subtask T137454: Bot from an Azure cloud cluster is causing a false pageview spike (can we identify as bot?) as Resolved.

• Nuria closed subtask T136084: Unexpected increase in traffic for 4 languages in same region, on smaller projects as Resolved.

• Nuria closed subtask T213148: small bot activity marked as user in Manuel_de_Pedrolo page as Resolved.May 22 2020, 5:43 AM

• Nuria closed subtask T232992: Manipulation of pageview statistics German Wikipedia as Resolved.May 22 2020, 5:51 AM

• Nuria moved this task from Datasets to Data Quality on the Analytics board.Jun 8 2020, 7:42 PM

Milimetric mentioned this in T282502: Analyse possible bot traffic for ptwiki article Ambev .Jul 13 2021, 8:45 PM

Milimetric added a subtask: T282502: Analyse possible bot traffic for ptwiki article Ambev .

odimitrijevic added a project: Data-Engineering.Jan 6 2022, 4:24 AM

odimitrijevic moved this task from Incoming (new tickets) to Datasets on the Data-Engineering board.Jan 6 2022, 5:19 AM

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:20 AM

• Urbanecm added a subtask: T313114: Analyze possible bot traffic for frwiki article Cookie (informatique).Jul 15 2022, 1:57 PM

Milimetric added a subtask: T321707: Bot Detection.Oct 26 2022, 3:48 PM

gmodena subscribed.Dec 19 2022, 11:25 AM

Seddon added a subtask: T328127: Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage.Jan 27 2023, 1:18 PM

I'm sorry If I'm stating the obvious but this really looks like a classic case of anomaly detection which there is a dedicated rigor as subpart of ML to handle. The aforementioned paper utilizes classic anomaly detection as well. There are several ways to look at it like multi-variant Gaussian distribution or doing a dimensionality redaction (via your favorite tool) and then measuring the distance after it got turned back into the large vector (this is outlined in depth in Hands-On Unsupervised Learning Using Python and Feature Engineering for Machine Learning).

This should be rather doable to have a rather half automated way to train a model (based on samples) and deploy, then retrain and redeploy the next day and so on automatically.

@Ladsgroup indeed this is by no means an unsolvable problem at the theoretical level (and there is an existing solution for it in place). However, depending on the specifications and requirements we put on the model (which mainly comes from potential compute resource limitations and/or use-cases), updates will require careful model design and testing. A few scenarios that make this problem exciting are:

How fast do we need the result of bot/no-bot model? More specifically: should we use one of the streaming algorithms on the requests as they come in, or can we afford to store and then analyze the stored data. (with 100Ks of requests per second, either approach has its own advantages/disadvantages)

Labeled data: we know which webrequests are from bots that identify themselves as bots. The rest is effectively not labeled. The training data-set should be carefully built and there will always be approximations and assumptions there which will not hold true on the whole of the data. Building a good training data-set will continue to demand careful choices and considerations. :)

Currently the assumption is that whatever we do should work for all projects and languages.

These are a few considerations. There are more. Again I emphasize that this is not unsolvable and we have solved it in the past to some extent, but these models will need improvements and that needs some time and focus. :)

@leila I totally agreed that it's a complex problem but I hope we can move forward with improving quality of our data and metric in some shape or form.

Regarding labelled data, for unsupervised ML, you don't really need labelled data, you need some examples for the feature engineering but nothing beyond that. Am I missing something?

SNowick_WMF closed subtask T328127: Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage as Resolved.Feb 28 2023, 5:48 PM

leila added a subtask: T292449: Revisit approach to automated bot detection.Apr 7 2023, 5:39 PM

@Ladsgroup I'm with you on the overall theme that better bot detection is possible and an important project to work on. :)

@odimitrijevic I saw your note in T333950#8765538 . As I have mentioned to you before: I'm with you that we should tackle it. In the last Tech Steering prioritization meeting the decision was communicated that we don't prioritize this work for the April-June quarter (Q4). As a result, I did not prioritize this work with my team. Is it fair to say you want us to attempt to prioritize it in July-September (Q1) which means we will need Joseph or someone else from your team as well?

Volans subscribed.Apr 7 2023, 5:50 PM

JArguello-WMF moved this task from Datasets to Data Products & Metrics on the Data-Engineering board.Jun 29 2023, 11:45 PM

BTullis added a project: Data-Platform-SRE.Jul 15 2023, 12:17 AM

lbowmaker moved this task from Data Products & Metrics to Icebox (not considered in current quarter) on the Data-Engineering board.Nov 10 2023, 2:26 PM