Page MenuHomePhabricator

Remove spider traffic from "top" results
Closed, DuplicatePublic

Description

There are some weird page titles appearing in top_articles. For 2 October, enwiki, all-platforms:

Special:BlankPage

Template:GeoTemplate
User:GoogleAnalitycsRoman/google-api

This suggests (a) our page-filtering needs some work (pretty sure GeoTemplate is a transclusion) and (b) our bot-filtering needs some work (the google-api traffic is clearly not humans).

Event Timeline

Ironholds raised the priority of this task from to Needs Triage.
Ironholds updated the task description. (Show Details)
Ironholds added a project: Analytics-Backlog.
Ironholds subscribed.

From digging into the individual instances of these requests it appears a lot are, in fact, accurately identified as spiders or automata...and then included in top_articles anyway.

Top articles is going to need to either not include spiders, or include user-type as a parameter in the same way every other API element does. This data is pretty much unuseable if random automata can distort it to the degree it is.

See also the top 200 list at T117945 :
These three pages are in the list of most viewed pages in pageview_hourly even when restricting it the list to agent_type = "user".

Nuria renamed this task from Weirdnesses in top_articles to Remove spider traffic from "top" results.Nov 18 2015, 4:30 PM
Nuria set Security to None.

But this change is not going to be backfilled, removing self-inflicted traffic is a never-ending task and thus we cannot recompute pageviews everytime we find one of these pages.

  • User:GoogleAnalitycsRoman/google-api

We likely need to include this one on special pages (see link above)

  • Bot traffic will be removed from top endpooiint and (see child task) backfilled as we can.

Actually IIRC from looking at that the mass of traffic lacked a UA, so eliminating automata and spiders should do it.