We agreed we need to run the detection in silent mode for a bit, will do so in a "shadow" pageview hourly table
|Open||None||T138207 [Open question] Improve bot identification at scale|
|Resolved||None||T238357 Label high volume bot spikes in pageview data as automated traffic|
|Resolved||None||T238358 Deploy high volume bot spike detector to hungarian wikipedia|
|Resolved||JAllemandou||T238363 Vet high volume bot spike detection code|
|Resolved||JAllemandou||T247342 Create UDF for actor id generation|
|Resolved||JAllemandou||T247344 Automated deletion of actor data for bot prediction after 90 days|
Vetting heuristic -- One day of manually computed automated actors has the exact same number than the one in predictions.actor_label_hourly:
spark.sql(""" SELECT md5(concat(ip, substr(user_agent,0,200), accept_language, uri_host, COALESCE(x_analytics_map['wmfuuid'],parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID'),''))) AS actor_id, COUNT(1) as pageview_count, cast((count(1)/(unix_timestamp(max(ts)) - unix_timestamp(min(ts))) * 60) as int) as pageview_ratio_per_min, sum(coalesce(x_analytics_map["nocookies"], 0L)) as nocookies, max(length(user_agent)) as user_agent_length FROM wmf.webrequest WHERE webrequest_source = 'text' AND year=2020 AND month=1 AND day=16 AND is_pageview AND agent_type = "user" AND user_agent not like "%weblight%" AND COALESCE(pageview_info['project'], '') != '' GROUP BY md5(concat(ip, substr(user_agent,0,200), accept_language, uri_host, COALESCE(x_analytics_map['wmfuuid'],parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID'),''))) HAVING pageview_count > 800 OR (pageview_count >= 10 AND ( pageview_ratio_per_min >= 30 OR nocookies > 10 OR user_agent_length > 400 OR user_agent_length < 25 )) """).count // 290559 spark.sql("select count(1) from predictions.actor_label_hourly where year = 2020 and month = 1 and day = 16 and hour = 23 and label = 'automated'").show(10, false) // 290559
This looks correct to me :)
Note: The trick of pageview >= 10 before checking other values for automated.
Per our conversation, we will take a look at:
- translation requests
- top pageview computation, like for example, this recent problem should disappear: T247085: clear bot spam-scraping [[en:United States Senate]] not being detected as a bot
*number of actors with nocookies set by access_type (desktop, mobile, mobile-app)
- overall percentage of pages flagged as automated per project
The context is different, but you will be evaluating traffic as actors on mediawiki websites. There will also be 'actor*' table(s) that we sqoop from Mediawiki. I can imagine someday joining this data together. E.g. it might be interesting to compute some scores about what mediawiki user-actors generate more non-bot pageviews...a leaderboard for editors?
Is 'actor' a term usually used for classifying web traffic, or did we make this one up? Perhaps we can bikeshed a better name?
Per post-standup conversation:
- The intent of this "actor" identifier is more inline with the "bad actor" semantics used when assessing security risk
- Actor_id is confusing cause it is identical to the mw column that has an entirely different meaning
- we agree that actor_signature is a better name for actor_id thus we will be changing the name on UDF and hive column accordingly