Page MenuHomePhabricator

Vet high volume bot spike detection code
Closed, ResolvedPublic13 Estimated Story Points

Description

We agreed we need to run the detection in silent mode for a bit, will do so in a "shadow" pageview hourly table

Event Timeline

Nuria renamed this task from Vet high volume bot spike detection in hungarian wikipedia to Vet high volume bot spike detection code.Nov 14 2019, 10:51 PM
Nuria updated the task description. (Show Details)

Change 578373 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add automated agent-type to pageview_hourly

https://gerrit.wikimedia.org/r/578373

Vetting heuristic -- One day of manually computed automated actors has the exact same number than the one in predictions.actor_label_hourly:

spark.sql("""
  SELECT
    md5(concat(ip, substr(user_agent,0,200), accept_language, uri_host, COALESCE(x_analytics_map['wmfuuid'],parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID'),''))) AS actor_id,
    COUNT(1) as pageview_count,
    cast((count(1)/(unix_timestamp(max(ts)) - unix_timestamp(min(ts))) * 60) as int) as pageview_ratio_per_min,
    sum(coalesce(x_analytics_map["nocookies"], 0L)) as nocookies,
    max(length(user_agent)) as user_agent_length
  FROM wmf.webrequest
  WHERE webrequest_source = 'text'
    AND year=2020
        AND month=1
        AND day=16
        AND is_pageview
        AND agent_type = "user"
        AND user_agent not like "%weblight%"
        AND COALESCE(pageview_info['project'], '') != ''
  GROUP BY md5(concat(ip, substr(user_agent,0,200), accept_language, uri_host, COALESCE(x_analytics_map['wmfuuid'],parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID'),'')))
  HAVING pageview_count > 800
    OR (pageview_count >= 10 AND (
      pageview_ratio_per_min >= 30
      OR nocookies > 10
      OR user_agent_length > 400
      OR user_agent_length < 25
    ))
""").count
// 290559 

spark.sql("select count(1) from predictions.actor_label_hourly where year = 2020 and month = 1 and day = 16 and hour = 23 and label = 'automated'").show(10, false)

// 290559

This looks correct to me :)
Note: The trick of pageview >= 10 before checking other values for automated.

Per our conversation, we will take a look at:

*number of actors with nocookies set by access_type (desktop, mobile, mobile-app)

  • overall percentage of pages flagged as automated per project

Change 584602 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update actors for pageview labelling

https://gerrit.wikimedia.org/r/584602

OO we should be careful with naming here. 'actor' from the Mediawiki perspective refers to the MW User performing a change. It is AKA a 'performer'.

See https://phabricator.wikimedia.org/T167246 and specifically https://phabricator.wikimedia.org/T167246#5093402

@Ottomata : I see. @Nuria : shall we rename before recomputing the features, or do we keep that name?

On my opinion I do not think we need to rename, the context is very different than the MW one.

The context is different, but you will be evaluating traffic as actors on mediawiki websites. There will also be 'actor*' table(s) that we sqoop from Mediawiki. I can imagine someday joining this data together. E.g. it might be interesting to compute some scores about what mediawiki user-actors generate more non-bot pageviews...a leaderboard for editors?

Is 'actor' a term usually used for classifying web traffic, or did we make this one up? Perhaps we can bikeshed a better name?

Per post-standup conversation:

  • The intent of this "actor" identifier is more inline with the "bad actor" semantics used when assessing security risk
  • Actor_id is confusing cause it is identical to the mw column that has an entirely different meaning
  • we agree that actor_signature is a better name for actor_id thus we will be changing the name on UDF and hive column accordingly

Change 584602 merged by Joal:
[analytics/refinery@master] Update actors for pageview labelling

https://gerrit.wikimedia.org/r/584602

Change 578373 merged by Joal:
[analytics/refinery@master] Add automated agent-type to pageview_hourly

https://gerrit.wikimedia.org/r/578373

Nuria set the point value for this task to 13.