This issue is still open right now: T141506: Suddenly outrageous higher pageviews for main pages,
even though these pageview spikes have long ago been identified as artificial traffic that does not correspond to actual humans viewing content, and a workaround was implemented to prevent them from continuing to occur (T141786). And the incorrect data is persisting in the pageview_hourly table and the data sources derived from it, in particular projectview_hourly and the public pageview stats tools (example, with no annotation or warning). While some WMF staff and others with Hive access can add a custom condition to their queries to filter them out (T141506#2582628), this is quite clumsy to maintain indefinitely in future trend analyses etc. And community members and the public do not have that option and are thus left with the faulty data.
I already talked about this with @JAllemandou some time ago (I've been meaning to file this task for a while) and got a sense that it should be relatively easy to implement.
I imagine we could proceed as follows:
- Set view_count to zero for all rows in pageview_hourly that match the condition in T141506#2582628 and date from the weeks of the incident until the workaround was implemented (T141786#2558383 ) - it's not very many.
- Regenerate the data in projectview_hourly and other derived data sources for the affected timespan, emulating the routine aggregation processes.
I would be happy to assist with 1. if needed.