Page MenuHomePhabricator

Correct pageview_hourly and derived data for T141506
Closed, DeclinedPublic

Description

This issue is still open right now: T141506: Suddenly outrageous higher pageviews for main pages,
even though these pageview spikes have long ago been identified as artificial traffic that does not correspond to actual humans viewing content, and a workaround was implemented to prevent them from continuing to occur (T141786). And the incorrect data is persisting in the pageview_hourly table and the data sources derived from it, in particular projectview_hourly and the public pageview stats tools (example, with no annotation or warning). While some WMF staff and others with Hive access can add a custom condition to their queries to filter them out (T141506#2582628), this is quite clumsy to maintain indefinitely in future trend analyses etc. And community members and the public do not have that option and are thus left with the faulty data.

I already talked about this with @JAllemandou some time ago (I've been meaning to file this task for a while) and got a sense that it should be relatively easy to implement.
I imagine we could proceed as follows:

  1. Set view_count to zero for all rows in pageview_hourly that match the condition in T141506#2582628 and date from the weeks of the incident until the workaround was implemented (T141786#2558383 ) - it's not very many.
  2. Regenerate the data in projectview_hourly and other derived data sources for the affected timespan, emulating the routine aggregation processes.

I would be happy to assist with 1. if needed.

Event Timeline

We already discussed this issue on this ticket: https://phabricator.wikimedia.org/T141506#2575088 and I second @BBlack 's
opinion. In a gist: i do not think this traffic should be removed, it is real (if unintentional), we count real requests coming to our servers and these are very real requests. I understand that the magnitude of this event is large but it is really not the only one we have of this type (on 2015 there was a similar one that measured at some point 5% traffic overall) and as mentioned on ticket I am of the opinion that we should keep what we count as close to reality as possible as it is the best way to make sense of data.

And community members and the public do not have that option and are thus left with the faulty data.

This is most certainly not true, any dataset comes with caveats and issues and this is just one of several that you should be aware of: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly#Changes_and_known_problems_since_2015-06-16

And community members and the public do not have that option and are thus left with the faulty data.

This is most certainly not true, any dataset comes with caveats and issues and this is just one of several that you should be aware of: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly#Changes_and_known_problems_since_2015-06-16

You obviously misunderstood what "that option" referred to: The ability to correct these anomalies by using T141506#2582628 in a custom Hive query (please read the full task description including the preceding sentence). And, as also already mentioned in the task, even the warning that the data is faulty is missing in prominent places, leading users of this data astray.

We already discussed this issue on this ticket: https://phabricator.wikimedia.org/T141506#2575088 and I second @BBlack 's
opinion. In a gist: i do not think this traffic should be removed, it is real (if unintentional), we count real requests coming to our servers and these are very real requests.

This indicates an understanding of the purpose of this data that diverges from that of most of the people who actually use it. They are not interested in an abstract platonic ideal of "real traffic" as in "bits came down a wire", but in the best possible estimate of human readership.

I understand that the magnitude of this event is large but it is really not the only one we have of this type (on 2015 there was a similar one that measured at some point 5% traffic overall) and as mentioned on ticket I am of the opinion that we should keep what we count as close to reality as possible as it is the best way to make sense of data.

Of course we can't detect and filter every instance of spurious traffic, I understand your team's desire to limit the amount of work going into this area. But in this case, the impact is so large (desktop traffic inflated up to 30% or more ) and the fix is so readily available that I don't think that can be a justification for inaction here.

...

On the Operations end of things, we look at data that is much closer to the wire-level view of traffic in raw HTTP request terms. Analytics is looking at the more human view of things. However, analytics output of human pageviews still has to be anchored to some technical explanation and derivation or it begins to lose all meaning and become arbitrary. IMHO, over the long term, if the high level explanation of analytics pageviews reads like "This is HTTP requests to wiki content pages by known browser and mobileapp agents, but we've also expertly and silently applied a lot of other filtering and manipulation you'll never understand", the meaning becomes fuzzy to the data consumer.

The current web pageview definition is not based on a positive list of known browser agents, but to the contrary does actually already do "a lot of filtering" (https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L76 ) to arrive at the human (more precisely: non-spider, agent_type = "user") numbers that this task is about.

It's better that the baseline statistics have a solid and simple meaning,

As mentioned, that's not the case anyway for the numbers that this task is about.

and that other one-off filtering and manipulation is applied on top of that in a way that's transparent to the consumer

Over a year later, this still hasn't happened for several widely used reports on this data (probably for none except my own), and people are led to wrong conclusions because of it. It would create a lot of unnecessary extra work anyway, and - as mentioned in the present task - is impossible to do for most consumers of this data anyway, because they don't have the required access. Applying them at the common source (pageview_hourly) seems the only practical solution.

(e.g. flagging the requests as likely being related to particular persistent class of automation or abuse, or linking them to a particular short-duration incident). The bar for removing such a flagged class from the baseline data (where it becomes an invisible-to-the-consumer filter, unless you add more complexity to the explanation of the baseline data) should be pretty high.

The filter will of course be publicly documented, like the above mentioned spider filters are. I think most consumers will prefer having more accurate data and avoiding wrong conclusions in their analysis over the abstract capability of reconstructing the "real" count of bits that came down the wire.

While we do want to fix data when we have infrastructure problems, we want to approach this type of issue as a "Pageview definition" problem. So we are adding this to the broader category of improving bot identification and we will fix the pageview definition such that it excludes similar traffic in the future. But just like we didn't fix past pagecount data from 2007 - 2015 according to our new spider filtering, we didn't want to do it here either. We don't want to have different Pageview definitions being measured at the same time.

Nuria moved this task from Geowiki to Datasets on the Analytics board.