Wikipedia.org Portal Dashboard: investigate increase in pageviews
Closed, ResolvedPublic4 Story Points

Description

Starting in the middle of June 2016, we have seen an increase in page view counts for the wikipedia.og portal site. This doesn't seem in correlate with anything that the Discovery team has released near that time frame.

I'd like to know more about what might have happened during the middle of June that caused our page views to go up and stay up (see image below).

There is another ticket T141506 that other teams are working on because of a different spike on / around July 20, 2016 that might be of interest.

debt created this task.Aug 15 2016, 8:26 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 15 2016, 8:26 PM
debt triaged this task as "High" priority.Aug 15 2016, 8:26 PM
mpopov claimed this task.
mpopov set the point value for this task to 4.
Tbayer added a subscriber: Tbayer.Aug 18 2016, 8:52 PM

GOOD NEWS! We haven't been massively over-counting webrequests as pageviews: https://github.com/wikimedia-research/Discovery-Research-Portal/tree/master/Analyses/Pageviews%20Rise#part-1 but BAD NEWS! Still, we've been over-reporting the pageview counts by about 44% by not excluding known bots and by only including web requests with HTTP status codes of 200 and 304.

HOWEVER, even with the filtering, that doesn't explain/correct for the doubling of the pageviews that we've seen, so I have now started on part 2 of the analysis of digging in deeper to see if there are any particular users we're getting heavy traffic from.

debt added a comment.Aug 19 2016, 8:31 PM

Let's break out the 'weird' unknown IP's that are doing a large amount pageviews but aren't self-identified bots or other known entities like that.

That way, we can show what we're quite confident is a real user pageview and what is a bit of weird behavior that is inflating the overall count of pageviews but isn't something that we should filter out (because it's not a bot/spider).

Mhm! Want to make a ticket for that? And we can close this out because there's not much left to do due to lack of data. Part 2 summary:

  • There are over 50 IP addresses that are responsible for 15K-44K Wikipedia.org Portal pageviews a day.
  • They are responsible for 2%-4% of overall pageviews.
  • 99.58% of the IP addresses are responsible for less than 100 pageviews a day each, with the remaining 0.42% of IP addresses having more than 100 pageviews a day each, up to 44K PVs/day.
  • Not having pre-rise data from June and May makes it impossible to find out precisely why pageviews doubled.
debt added a comment.Aug 22 2016, 8:52 PM

Hi @mpopov - can you publish your findings on commons so we can link to it from the dashboard...to help explain why we don't really know why the increase in page views happened?

Also - please put a note on the dashboard on the pageviews page that links to the analysis done. :)

Thanks!

Change 306267 had a related patch set uploaded (by Bearloga):
Document PV events

https://gerrit.wikimedia.org/r/306267

Change 306267 merged by Bearloga:
Document PV events

https://gerrit.wikimedia.org/r/306267

debt closed this task as "Resolved".