We agreed we need to run the detection in silent mode for a bit, will do so in a "shadow" pageview hourly table
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Add automated agent-type to pageview_hourly | analytics/refinery | master | +103 -48 | |
| Update actors for pageview labelling | analytics/refinery | master | +93 -71 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T138207 [Open question] Improve bot identification at scale | |||
| Resolved | None | T238357 Label high volume bot spikes in pageview data as automated traffic | |||
| Resolved | None | T238358 Deploy high volume bot spike detector to hungarian wikipedia | |||
| Resolved | JAllemandou | T238363 Vet high volume bot spike detection code | |||
| Resolved | JAllemandou | T247342 Create UDF for actor id generation | |||
| Resolved | JAllemandou | T247344 Automated deletion of actor data for bot prediction after 90 days |
Event Timeline
Change 578373 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add automated agent-type to pageview_hourly
Vetting heuristic -- One day of manually computed automated actors has the exact same number than the one in predictions.actor_label_hourly:
spark.sql("""
SELECT
md5(concat(ip, substr(user_agent,0,200), accept_language, uri_host, COALESCE(x_analytics_map['wmfuuid'],parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID'),''))) AS actor_id,
COUNT(1) as pageview_count,
cast((count(1)/(unix_timestamp(max(ts)) - unix_timestamp(min(ts))) * 60) as int) as pageview_ratio_per_min,
sum(coalesce(x_analytics_map["nocookies"], 0L)) as nocookies,
max(length(user_agent)) as user_agent_length
FROM wmf.webrequest
WHERE webrequest_source = 'text'
AND year=2020
AND month=1
AND day=16
AND is_pageview
AND agent_type = "user"
AND user_agent not like "%weblight%"
AND COALESCE(pageview_info['project'], '') != ''
GROUP BY md5(concat(ip, substr(user_agent,0,200), accept_language, uri_host, COALESCE(x_analytics_map['wmfuuid'],parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID'),'')))
HAVING pageview_count > 800
OR (pageview_count >= 10 AND (
pageview_ratio_per_min >= 30
OR nocookies > 10
OR user_agent_length > 400
OR user_agent_length < 25
))
""").count
// 290559
spark.sql("select count(1) from predictions.actor_label_hourly where year = 2020 and month = 1 and day = 16 and hour = 23 and label = 'automated'").show(10, false)
// 290559This looks correct to me :)
Note: The trick of pageview >= 10 before checking other values for automated.
Per our conversation, we will take a look at:
- translation requests
- top pageview computation, like for example, this recent problem should disappear: T247085: clear bot spam-scraping [[en:United States Senate]] not being detected as a bot
*number of actors with nocookies set by access_type (desktop, mobile, mobile-app)
- overall percentage of pages flagged as automated per project
Change 584602 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update actors for pageview labelling
OO we should be careful with naming here. 'actor' from the Mediawiki perspective refers to the MW User performing a change. It is AKA a 'performer'.
See https://phabricator.wikimedia.org/T167246 and specifically https://phabricator.wikimedia.org/T167246#5093402
On my opinion I do not think we need to rename, the context is very different than the MW one.
The context is different, but you will be evaluating traffic as actors on mediawiki websites. There will also be 'actor*' table(s) that we sqoop from Mediawiki. I can imagine someday joining this data together. E.g. it might be interesting to compute some scores about what mediawiki user-actors generate more non-bot pageviews...a leaderboard for editors?
Is 'actor' a term usually used for classifying web traffic, or did we make this one up? Perhaps we can bikeshed a better name?
Per post-standup conversation:
- The intent of this "actor" identifier is more inline with the "bad actor" semantics used when assessing security risk
- Actor_id is confusing cause it is identical to the mw column that has an entirely different meaning
- we agree that actor_signature is a better name for actor_id thus we will be changing the name on UDF and hive column accordingly
Change 584602 merged by Joal:
[analytics/refinery@master] Update actors for pageview labelling
Change 578373 merged by Joal:
[analytics/refinery@master] Add automated agent-type to pageview_hourly
closing, all documented here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection