Page MenuHomePhabricator

Hourly Feature extraction for bot detection from webrequest
Closed, ResolvedPublic8 Estimated Story Points

Description

We decided to extract features hourly and label data as "automated" using the last 24 hours of data. This data is calculated for computation only so it should probably be on wmf_raw rather than wmf.

Features to be computed:

sessionId
session_start 
session_end
session_length_secs
number_of_pageviews
pageview_ratio_per_min
nocookies 
user_agent_length

See: https://docs.google.com/document/d/1q14GH7LklhMvDh0jwGaFD4eXvtQ5tLDmw3UeFTmb3KM/edit#heading=h.eb32std206d

Event Timeline

Change 552943 had a related patch set uploaded (by Nuria; owner: Nuria):
[analytics/refinery@master] Create table to hold calculations of session features

https://gerrit.wikimedia.org/r/552943

Code is WIP but if @mforns and @JAllemandou could take a look would be very helpful

doc looks great, copy-edited a bit as I went through it

Nuria set the point value for this task to 8.

Change 552943 merged by Nuria:
[analytics/refinery@master] Table and workflow for features computations per actor per hour

https://gerrit.wikimedia.org/r/552943

Change 562957 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Fix oozie learning/features/actor/hourly

https://gerrit.wikimedia.org/r/562957

Change 562957 merged by Joal:
[analytics/refinery@master] Fix oozie learning/features/actor/hourly

https://gerrit.wikimedia.org/r/562957