Vet high volume bot spike detection code
Closed, ResolvedPublic13 Estimated Story Points
Actions

Assigned To

Authored By

	• Nuria
	Nov 14 2019, 7:40 PM

Description

We agreed we need to run the detection in silent mode for a bit, will do so in a "shadow" pageview hourly table

Details

	Subject	Repo	Branch	Lines +/-
	Add automated agent-type to pageview_hourly	analytics/refinery	master	+103 -48
	Update actors for pageview labelling	analytics/refinery	master	+93 -71

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T138207 [Open question] Improve bot identification at scale
Resolved	None	T238357 Label high volume bot spikes in pageview data as automated traffic
Resolved	None	T238358 Deploy high volume bot spike detector to hungarian wikipedia
Resolved	JAllemandou	T238363 Vet high volume bot spike detection code
Resolved	JAllemandou	T247342 Create UDF for actor id generation
Resolved	JAllemandou	T247344 Automated deletion of actor data for bot prediction after 90 days

Event Timeline

• Nuria created this task.Nov 14 2019, 7:40 PM

• Nuria renamed this task from Vet high volume bot spike detection in hungarian wikipedia to Vet high volume bot spike detection code.Nov 14 2019, 10:51 PM

• Nuria updated the task description. (Show Details)

Ottomata moved this task from Incoming to Datasets on the Analytics board.Nov 18 2019, 4:44 PM

• Nuria assigned this task to JAllemandou.Mar 4 2020, 3:51 AM

• Nuria added a project: Analytics-Kanban.

JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.Mar 9 2020, 7:15 PM

Change 578373 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add automated agent-type to pageview_hourly

https://gerrit.wikimedia.org/r/578373

gerritbot added a project: Patch-For-Review.Mar 9 2020, 7:15 PM

Vetting heuristic -- One day of manually computed automated actors has the exact same number than the one in predictions.actor_label_hourly:

spark.sql("""
  SELECT
    md5(concat(ip, substr(user_agent,0,200), accept_language, uri_host, COALESCE(x_analytics_map['wmfuuid'],parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID'),''))) AS actor_id,
    COUNT(1) as pageview_count,
    cast((count(1)/(unix_timestamp(max(ts)) - unix_timestamp(min(ts))) * 60) as int) as pageview_ratio_per_min,
    sum(coalesce(x_analytics_map["nocookies"], 0L)) as nocookies,
    max(length(user_agent)) as user_agent_length
  FROM wmf.webrequest
  WHERE webrequest_source = 'text'
    AND year=2020
        AND month=1
        AND day=16
        AND is_pageview
        AND agent_type = "user"
        AND user_agent not like "%weblight%"
        AND COALESCE(pageview_info['project'], '') != ''
  GROUP BY md5(concat(ip, substr(user_agent,0,200), accept_language, uri_host, COALESCE(x_analytics_map['wmfuuid'],parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID'),'')))
  HAVING pageview_count > 800
    OR (pageview_count >= 10 AND (
      pageview_ratio_per_min >= 30
      OR nocookies > 10
      OR user_agent_length > 400
      OR user_agent_length < 25
    ))
""").count
// 290559 

spark.sql("select count(1) from predictions.actor_label_hourly where year = 2020 and month = 1 and day = 16 and hour = 23 and label = 'automated'").show(10, false)

// 290559

This looks correct to me :)
Note: The trick of pageview >= 10 before checking other values for automated.

Per our conversation, we will take a look at:

translation requests
top pageview computation, like for example, this recent problem should disappear: T247085: clear bot spam-scraping [[en:United States Senate]] not being detected as a bot

*number of actors with nocookies set by access_type (desktop, mobile, mobile-app)

overall percentage of pages flagged as automated per project

Change 584602 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update actors for pageview labelling

https://gerrit.wikimedia.org/r/584602

JAllemandou moved this task from In Progress to In Code Review on the Analytics-Kanban board.Mar 30 2020, 1:46 PM

OO we should be careful with naming here. 'actor' from the Mediawiki perspective refers to the MW User performing a change. It is AKA a 'performer'.

See https://phabricator.wikimedia.org/T167246 and specifically https://phabricator.wikimedia.org/T167246#5093402

@Ottomata : I see. @Nuria : shall we rename before recomputing the features, or do we keep that name?

On my opinion I do not think we need to rename, the context is very different than the MW one.

The context is different, but you will be evaluating traffic as actors on mediawiki websites. There will also be 'actor*' table(s) that we sqoop from Mediawiki. I can imagine someday joining this data together. E.g. it might be interesting to compute some scores about what mediawiki user-actors generate more non-bot pageviews...a leaderboard for editors?

Is 'actor' a term usually used for classifying web traffic, or did we make this one up? Perhaps we can bikeshed a better name?

Per post-standup conversation:

The intent of this "actor" identifier is more inline with the "bad actor" semantics used when assessing security risk
Actor_id is confusing cause it is identical to the mw column that has an entirely different meaning
we agree that actor_signature is a better name for actor_id thus we will be changing the name on UDF and hive column accordingly

Change 584602 merged by Joal:
[analytics/refinery@master] Update actors for pageview labelling

https://gerrit.wikimedia.org/r/584602

• Nuria closed subtask T247342: Create UDF for actor id generation as Resolved.Apr 9 2020, 3:11 PM

Change 578373 merged by Joal:
[analytics/refinery@master] Add automated agent-type to pageview_hourly

https://gerrit.wikimedia.org/r/578373

Maintenance_bot removed a project: Patch-For-Review.Apr 29 2020, 2:10 PM

JAllemandou moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.Apr 29 2020, 3:22 PM

JAllemandou moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Apr 29 2020, 4:25 PM

closing, all documented here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection

• Nuria closed this task as Resolved.May 4 2020, 6:48 PM

• Nuria set the point value for this task to 13.

Vet high volume bot spike detection codeClosed, ResolvedPublic13 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Vet high volume bot spike detection code
Closed, ResolvedPublic13 Estimated Story Points
Actions

Related Objects
Search...