Our current detection of pageview automated traffic could use some improvement to providfe better data quality, noticeably in top metrics.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T280565 Improve pageview automated traffic detection heuristics | |||
| Open | None | T280011 Top read repeats | |||
| Open | None | T280844 Too many views to Skathi (moon) on enwiki | |||
| Resolved | BUG REPORT | Dbrant | T270784 App describing Spanish Wikipedia article "Cleopatra" as trending (due to default voice search test for Hello Google) | ||
| Resolved | cchen | T274823 Big increase in traffic for projects except 'wikipedia' family since Feb 14th | |||
| Restricted Task | |||||
| Restricted Task | |||||
| Resolved | SNowick_WMF | T328127 Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage | |||
| Declined | None | T327027 Massive spike in pageviews for a few enwiki pages beginning with "Index" | |||
| Declined | None | T328935 Spike: Pageview Anomaly Analysis |
Event Timeline
@JAllemandou There are a couple of other tickets (T270784, T274823) that might be resolved if the automated traffic detection heuristics are improved; should I add them as subtasks?
Another issue discovered recently T355608 which could benefit from improving automated bot detection.
In November nl.wiktionary had 8 million pageviews from Singapore, which were not considered "spiders" or "bots". Something similar appears in the statistics of other Wiktionaries. The subdivision by user agent on Wiktionaries has now become disinformation, which is the opposite of what our movement is aiming for. If rapid improvement is impossible, this subdivision needs to be discontinued.
hi @MarcoSwart , we identified this issue with Singapore traffic and are currently working on fixing it in the tasks under T373630.
Note than we have made 2 significant changes to the automated traffic heuristics:
these changes have shifted the pageview metrics a bit and we are seeing an increase in automated traffic since we're able to label automated and user traffic more correctly.