Page MenuHomePhabricator

Wikipedia.org Portal Dashboard: add "other" to pageviews page
Closed, ResolvedPublic8 Estimated Story Points

Description

As detailed in this ticket: T143045, we have a somewhat large amount of visits to the Wikipedia.org portal page that aren't actual individual users, self identified bots or malicious denial of service (dos) hits.

I'd like to track these 'other' pageviews on the Portal dashboard - as they're frequent enough (daily, weekly, monthly or some other 'set' timeframe that we don't know about) and have enough of an impact to our overall pageviews even though many of these thousands of pageviews come from several single IP's.

These pageviews appear to be some sort of speed test or 'are you alive' test - these types of tests are not uncommon, many large (and small corporations and individuals) hit google.com for that exact purpose. They're not harmful and we shouldn't block them but we should track their impact to our overall pageview counts.

We don't want to publish or otherwise identify these users and their IP's but we need to have the ability to identify them automagically via our event logging so that as new ones appear (and old ones drop off) we don't have to manually add them to our 'other' listing on the dashboard.

Event Timeline

debt triaged this task as Medium priority.Aug 22 2016, 8:37 PM

I previously found that 99.58% of the IPs accounted for less than 100 pageviews each (on 17 Aug 2016). The remaining 0.42% of the IP addresses include IPs that generated upwards of 44K pageviews, although in aggregate only account for about 1M in pageviews all together.

So one thing we can do is count up PVs by client IP and then split up those counts into 2 groups: "low-volume clients" group would be counts less than or equal to 100 and "high-volume clients" group would be counts greater than 100. Then:

Total pageviews = pageviews from low-volume clients + pageviews from high-volume clients

We can then have an option for the dashboard user to look at total pageviews or to look at pageviews for those groups separately.

Why 100? No particular reason; just seems like a reasonable threshold to set.

@chelsyx: What do you think?

@mpopov I think that make sense, although we have to wait till next time to see whether those high volume clients are responsible for the spike.

@debt wrote:

we need to have the ability to identify them automagically via our event logging so that as new ones appear (and old ones drop off) we don't have to manually add them to our 'other' listing on the dashboard.

@debt: FYI this is server-side web requests, which are separate from client-side event logging

mpopov set the point value for this task to 3.

Note to future self (and @chelsyx):

After the upcoming patch has been submitted, run the following R code on stat1002:

x <- readr::read_tsv("/a/aggregate-datasets/portal/portal_pageviews.tsv")
x$high_volume <- NA
x$low_volume <- NA
readr::write_tsv(x, "/a/aggregate-datasets/portal/portal_pageviews.tsv")

Because the patch will output 4 columns per day: "date", "pageviews" (total), "high_volume", and "low_volume", the above chunk of code will need to be run or there will be errors when golden/portal/pageviews.R tries to append the latest counts to the existing file.

@ksmith mentioned that if we could see a histogram of pageviews per IP address, that would help us come to consensus on the appropriate threshold to use (rather than almost arbitrarily going with 100), so here it is: https://plot.ly/~bearloga/2/distribution-of-wikipediaorg-pvs-per-ip-address-on-17-aug-2016/

It's bad about showing the two observations that had 44K PVs that day, but it is interactive :)

Also, here's a cumulative distribution table (for 17 August 2016) that we can use for determining the threshold for splitting clients into low-volume/high-volume groups:

threshold% of IPs with pageviews <= thresholdlow-volume client PVsproportion of total PVs accounted for by low-volume clientshigh-volume client PVsproportion of total PVs accounted for by high-volume clients
1086.88233%4.64M25.17%13.791M74.83%
10099.58767%12.017M65.20%6.413M34.80%
25099.89693%13.052M70.82%5.379M29.18%
50099.95511%13.497M73.23%4.934M26.77%
75099.96447%13.63M73.95%4.801M26.05%
1K99.96717%13.683M74.24%4.747M25.76%
2K99.98183%14.169M76.88%4.262M23.12%
5K99.98367%14.3M77.59%4.13M22.41%
10K99.99290%15.863M86.07%2.568M13.93%
15K99.99594%16.868M91.52%1.563M8.48%
20K99.99991%18.34M99.51%90.814K0.49%
40K99.99991%18.34M99.51%90.814K0.49%
50K100.00000%18.43M100.00%00.00%

In our meeting, @chelsyx suggested to dynamically figure out the threshold based on bottom 99.99% and top 0.01%. (Or even 99.9% vs 0.1%!)

This will be done as an experimental project, the same way that the "languages visited from portal" addition to the dashboard started out.

For my $0.02, I tend to think 99.99% sounds better as a threshold than 99.9%. There could be actual human outliers near that 99.9% mark (if you're awake 18 hours/day and average a pageview every 5 minutes).

mpopov changed the point value for this task from 3 to 8.Aug 25 2016, 5:24 PM

Change 309617 had a related patch set uploaded (by Bearloga):
Split Portal pageviews by high-volume & low-volume clients

https://gerrit.wikimedia.org/r/309617

Change 309617 merged by Bearloga:
Split Portal pageviews by high-volume & low-volume clients

https://gerrit.wikimedia.org/r/309617

Change 309707 had a related patch set uploaded (by Bearloga):
Document split-pageviews

https://gerrit.wikimedia.org/r/309707

Change 309707 merged by Bearloga:
Document split-pageviews

https://gerrit.wikimedia.org/r/309707

Looks good, @mpopov !

Adding a screenshot for reference.

Screen Shot 2016-09-12 at 1.46.04 PM.png (519×756 px, 87 KB)

Change 313870 had a related patch set uploaded (by Bearloga):
Deploy dashboard updates

https://gerrit.wikimedia.org/r/313870

Change 313870 merged by Bearloga:
Deploy dashboard updates

https://gerrit.wikimedia.org/r/313870