Page MenuHomePhabricator

Change the agent_type UDF to have three possible outputs: spider, bot, user {hawk} [13 pts]
Closed, ResolvedPublic

Description

Create a Convention for Constructive bots:

  • if user agent contains (eg. WikiBot) then mark agent_type = bot

Change the agent_type classification to include an exception for the convention above ^ (eg. WikiBot)

Event Timeline

Milimetric raised the priority of this task from to Needs Triage.
Milimetric updated the task description. (Show Details)
Milimetric added a project: Analytics-Backlog.
Milimetric added a subscriber: Milimetric.
Milimetric triaged this task as Medium priority.Aug 10 2015, 5:26 PM
Milimetric updated the task description. (Show Details)
Milimetric set Security to None.
Milimetric moved this task from Incoming to Tasked on the Analytics-Backlog board.

Quick analysis over a recent hour using three regexp, '(?i).*bot.*', '(?i).*crawler.*', and the one define by Bob (link in the task description).

  • Number of webrequests for that hour: 259664735
  • 8437330 (3.25%) are identified as agent_type = spider
  • 3288125 (1.27%) would newly be flagged as spiders by Bob's regexp
  • 1028759 (0.40%) would newly be flagged as spiders by '(?i).*bot.*' regexp
  • 55023 (0.02%) would newly be flagged as spiders by '(?i).*crawler.*'

Overall, if we were to use the three regexp in addition to what we currently have, the coverage would augement by 52%, from 3.25% to 4.93% of overall webrequests (still a small number).

For reference, I join the top 200 user_agent newly classified as bots if we were to use the three regexp (and which one of the regexp it matches).

Next thing to do: try to repackage ua-parser with new regexps and see if it changes something.

Longer term, we should invest in bot detection using behavioral patterns (heuristics and/or ML).

if user agent contains (eg. WikiBot) then mark agent_type = bot

Sounds good.

Overall, if we were to use the three regexp in addition to what we currently have, the coverage would augement by 52%, from 3.25% to 4.93% of overall webrequests (still a small number).

mmm..trying to catch bots with regexes seems that is trying to empty the sea with a bucket, on my opinion I do not think addition of new regexes buys us much.

Agreed and agreed. A couple more heuristics I've found useful:

  1. Looking for an email address and/or URL. This would need to be tested extensively but it's pretty robust.
  2. Even MORE robust: if there's no user agent. This is 99.9% of the time some ass who hasn't read our API guidelines.
  3. Regexes based around common language- or platform-specific default HTTP requestors (wcurl, wget, libwww-perl, Twisted Pagegetter, etc). See https://github.com/wikimedia-research/ZeroPlusPlus/blob/70a9a15ebddc17398c696c16bc13756cde140120/is_automata.cpp for some experimental work around this.

Question; when you say "repackage" do you mean "grab the new regex file" or "add our own"? Because if the latter, you realise one of the ua-parser maintainers works here, right? ;p

Thanks to @Milimetric I corrected a typo in my usage of Bob's regexp --> It covers more than in the previous report.
The numbers in that table are additive (no overlap among them).

Bot typenumber of webrequestspercentage of webrequestsdistinct user_agentspercentage of user agentsNote
Total (bot + no bot)259664735100%420809100%
no bot match24575677794.64%41103297.68%
agent_type = spider84373303.25%39320.93%Currently applied method
Bob's regexprMW53905809b4fb2.08%56431.34%Including previous rows overlapping data: 5.22% of requests and 2.26% of user agents --> This the most covering method of the test.
'(?i).*bot.*' regexp250260.01%1690.04%Including previous rows overlapping data: 3.18% of requests and 0.99% of user agents
'(?i).*crawler.*' regexp550220.02%330.01%Including previous rows overlapping data: 0.10% of requests and 0.2% of user agents

Overall, with Bob's regexp upated with case insitive flag in some places: the coverage would augment

  • by 64% (from 3.25% to 5.36%) of overall webrequests.
  • by 148% from 0.93% to 2.32% of distinct user agents.

Also, the new top 200 user_agent newly classified as bots if we were to use the new regexp (and which one of the regexp it matches).

The only lines I have found that could be false positive contain "YJApp-IOS jp.co.yahoo.ipn.appli/". I haven't found any constructive information on that ...

As for the WikiBot convention. Joseph was just saying that MediawikiBot is used by Bing in the user agent. So we have to be more careful. How about \bWikimediaBot\b? I'm leaning towards that because it's less ambiguous and less likely to be used by third parties unaware of this convention

No bot match the '.*WikimediaBot.*' for the hour of analysis while the list of bots containing WikiBot is not:

DotNetWikiBot/2.101 (Microsoft Windows NT 6.2.9200.0; .NET CLR 4.0.30319.34209)
DotNetWikiBot/3.14 (Microsoft Windows NT 6.2.9200.0; .NET CLR 4.0.30319.42000)
DotNetWikiBot/3.11 (Microsoft Windows NT 6.2.9200.0; .NET CLR 2.0.50727.8009)
WikiBot/0.1 AppEngine-Google; (+http://code.google.com/appengine; appid: s~newnewwikipedia)
DotNetWikiBot/3.14 (Unix 3.18.11.0; Mono 3.2.8; .NET CLR 2.0.50727.1433)
DotNetWikiBot/3.14 (Microsoft Windows NT 6.1.7601 Service Pack 1; .NET CLR 4.0.30319.34209)
DotNetWikiBot/3.0 (Microsoft Windows NT 6.1.7601 Service Pack 1; .NET CLR 4.0.30319.34209)
DotNetWikiBot/3.14 (Microsoft Windows NT 6.1.7601 Service Pack 1; .NET CLR 2.0.50727.5485)
DotNetWikiBot/3.11 (Unix 3.2.0.75; .NET CLR 2.0.50727.1433)
DotNetWikiBot/2.101 (Microsoft Windows NT 6.1.7601 Service Pack 1; .NET CLR 4.0.30319.34209)
WikiBot(http://www.servicesonrequest.com/)
DotNetWikiBot/2.92 (Microsoft Windows NT 6.1.7600.0; .NET CLR 2.0.50727.4984)
WikiBot/0.1
User-Agent: WikiBot (http://www.servicesonrequest.com)
DotNetWikiBot/3.10 (Unix 3.2.0.75; .NET CLR 4.0.30319.1)
DotNetWikiBot/3.0 (Unix 3.13.0.62; .NET CLR 4.0.30319.17020)
PerlWikiBot/5.006002
WikiBots/1.0.1 CFNetwork/711.5.6 Darwin/14.0.0
DotNetWikiBot/2.102 (Microsoft Windows NT 6.1.7601 Service Pack 1; .NET CLR 2.0.50727.5485)

Change 237419 had a related patch set uploaded (by Joal):
[WIP] Update agent_type in webrequest

https://gerrit.wikimedia.org/r/237419

Change 237392 had a related patch set uploaded (by Joal):
Update bot filtering for webrequests.

https://gerrit.wikimedia.org/r/237392

Change 237392 merged by Ottomata:
Update bot filtering for webrequests.

https://gerrit.wikimedia.org/r/237392

Change 237419 merged by Joal:
Bump core and hive jar versions

https://gerrit.wikimedia.org/r/237419