Page MenuHomePhabricator

Hourly Feature extraction for bot detection from webrequest
Open, Needs TriagePublic

Description

We decided to extract features hourly and label data as "automated" using the last 24 hours of data. This data is calculated for computation only so it should probably be on wmf_raw rather than wmf.

Features to be computed:

sessionId
session_start 
session_end
session_length_secs
number_of_pageviews
pageview_ratio_per_min
nocookies 
user_agent_length

See: https://docs.google.com/document/d/1q14GH7LklhMvDh0jwGaFD4eXvtQ5tLDmw3UeFTmb3KM/edit#heading=h.eb32std206d

Details

Related Gerrit Patches:

Event Timeline

Nuria created this task.Thu, Nov 14, 7:29 PM
Ottomata moved this task from Incoming to Bots on the Analytics board.Mon, Nov 18, 4:44 PM
Nuria claimed this task.Fri, Nov 22, 4:21 PM

Change 552943 had a related patch set uploaded (by Nuria; owner: Nuria):
[analytics/refinery@master] Create table to hold calculations of session features

https://gerrit.wikimedia.org/r/552943

Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Tue, Nov 26, 5:08 PM
Nuria added a comment.Wed, Nov 27, 1:34 AM

Code is WIP but if @mforns and @JAllemandou could take a look would be very helpful

doc looks great, copy-edited a bit as I went through it