Page MenuHomePhabricator

Wikipedia.org Portal Dashboard: investigate increase in pageviews
Closed, ResolvedPublic4 Estimated Story Points

Description

Starting in the middle of June 2016, we have seen an increase in page view counts for the wikipedia.og portal site. This doesn't seem in correlate with anything that the Discovery team has released near that time frame.

I'd like to know more about what might have happened during the middle of June that caused our page views to go up and stay up (see image below).

wikipedia-portal-pageviews-increase.png (590×1 px, 123 KB)

There is another ticket T141506 that other teams are working on because of a different spike on / around July 20, 2016 that might be of interest.

Event Timeline

debt triaged this task as High priority.Aug 15 2016, 8:26 PM
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.
mpopov set the point value for this task to 4.

GOOD NEWS! We haven't been massively over-counting webrequests as pageviews: https://github.com/wikimedia-research/Discovery-Research-Portal/tree/master/Analyses/Pageviews%20Rise#part-1 but BAD NEWS! Still, we've been over-reporting the pageview counts by about 44% by not excluding known bots and by only including web requests with HTTP status codes of 200 and 304.

HOWEVER, even with the filtering, that doesn't explain/correct for the doubling of the pageviews that we've seen, so I have now started on part 2 of the analysis of digging in deeper to see if there are any particular users we're getting heavy traffic from.

Let's break out the 'weird' unknown IP's that are doing a large amount pageviews but aren't self-identified bots or other known entities like that.

That way, we can show what we're quite confident is a real user pageview and what is a bit of weird behavior that is inflating the overall count of pageviews but isn't something that we should filter out (because it's not a bot/spider).

Mhm! Want to make a ticket for that? And we can close this out because there's not much left to do due to lack of data. Part 2 summary:

  • There are over 50 IP addresses that are responsible for 15K-44K Wikipedia.org Portal pageviews a day.
  • They are responsible for 2%-4% of overall pageviews.
  • 99.58% of the IP addresses are responsible for less than 100 pageviews a day each, with the remaining 0.42% of IP addresses having more than 100 pageviews a day each, up to 44K PVs/day.
  • Not having pre-rise data from June and May makes it impossible to find out precisely why pageviews doubled.

Hi @mpopov - can you publish your findings on commons so we can link to it from the dashboard...to help explain why we don't really know why the increase in page views happened?

Also - please put a note on the dashboard on the pageviews page that links to the analysis done. :)

Thanks!

Change 306267 had a related patch set uploaded (by Bearloga):
Document PV events

https://gerrit.wikimedia.org/r/306267