In T373630#10226978, @Hghani wrote:
Problem Statement
Based on further analysis, we've identified that certain actor signatures identified in the findings below are generating highly suspicious traffic. Specifically, they are sending requests every few seconds and to a different domain with each request. Upon reviewing our automated filtering logic, it appears that we apply automata labelling at the actor_signature level. This means our current heuristics may fail to label traffic accurately when users frequently switch domains, as each new domain visit generates a different actor signature, resetting metrics like pages per minute or total pageviews.
This oversight is likely a major factor behind the nearly 2000% increase in unique devices from Singapore, as these actor signatures are bypassing our automata filters. In addition to the redirect problem discussed earlier in this ticket, we believe this pattern accounts for a significant portion of the overall increase in unique devices we've observed.
To address this, we propose applying automata labelling at the actor_signature_project_family level. This would continue to capture automata behavior at the domain level while also including actors who evade detection by switching domains. After discussing this with @JAllemandou , he suggested this approach sounds reasonable, but it would require a thorough impact analysis before implementation.
What needs to be done
- In the Automata pipeline, starting with webrequest_actor_metrics_hourly, replace line#39 get_actor_signature to get project family actor signature.
- This will ensure that automata logic is applied at the project family level instead of domain level. Project family level is a super set that will filter automated bots at both project and domain levels. So this logic change works for existing and desired use case
- Next the pageview actor table should use the actor_signature_per_project_family to join with automata label table to filter out automated actors at the project level pageview_actor.hql#74

