Page MenuHomePhabricator

Improve pageview automated traffic detection heuristics
Open, MediumPublic

Description

Our current detection of pageview automated traffic could use some improvement to providfe better data quality, noticeably in top metrics.

Event Timeline

fdans triaged this task as Medium priority.Apr 19 2021, 4:21 PM
fdans moved this task from Incoming to Data Quality on the Analytics board.

@JAllemandou There are a couple of other tickets (T270784, T274823) that might be resolved if the automated traffic detection heuristics are improved; should I add them as subtasks?

@JAllemandou There are a couple of other tickets (T270784, T274823) that might be resolved if the automated traffic detection heuristics are improved; should I add them as subtasks?

Please @kzimmerman - Thank you!

Another issue discovered recently T355608 which could benefit from improving automated bot detection.

Unique devices could also be made more reliable from better bot detection T373630

In November nl.wiktionary had 8 million pageviews from Singapore, which were not considered "spiders" or "bots". Something similar appears in the statistics of other Wiktionaries. The subdivision by user agent on Wiktionaries has now become disinformation, which is the opposite of what our movement is aiming for. If rapid improvement is impossible, this subdivision needs to be discontinued.

In November nl.wiktionary had 8 million pageviews from Singapore, which were not considered "spiders" or "bots". Something similar appears in the statistics of other Wiktionaries.

hi @MarcoSwart , we identified this issue with Singapore traffic and are currently working on fixing it in the tasks under T373630.

Note than we have made 2 significant changes to the automated traffic heuristics:

  • included Redirect pageviews T376196
  • applied the hueristics at the project family level T377257

these changes have shifted the pageview metrics a bit and we are seeing an increase in automated traffic since we're able to label automated and user traffic more correctly.