Unique devices, retrofit with bot detection code , the offset part of the metric can filter bots using the udf actor_signature and its classification on actor_label
|analytics/refinery||master||+70 -111||Update unique-devices jobs to use pageview_actor_hourly|
|Open||None||T138207 [Open question] Improve bot identification at scale|
|Resolved||None||T238357 Label high volume bot spikes in pageview data as automated traffic|
|Resolved||JAllemandou||T250744 Unique devices, retrofit with bot detection code|
|Resolved||JAllemandou||T255467 Create intermediate dataset: pageview with actor information|
Findings for a day of per-domain uniques, considering domain+country:
- No effect of removing bots traffic on offset, as offset is about actors having made a single call while bots are about recurring calls
- On uniques-global (offset+last-visit)
- 99.5% of domain+country show less than 1% variability by removing bots
- 0.09% of rows (69 over 78456) disappear (all instances where flagged as bots - only 1 of those had more than 1 actor, precisely 24)
Now, given the relatively small impact of removing bots, and the relatively big computational cost, I question whether we should do it or not :)
@Nuria : we change from user to bots on pageview table only, not webrequest. Then uniques is being computed with webrequest data as various PII fields are needed for fingerprinting and compute the offset.
We could split sources of computation using webrequest for offsets and pageview for underestimate (we'd need to push last-visit info to pageview, not complicated), but so far uniques have not changed at all.
The results above are from me recomputing one day of uniques removing bots.