Page MenuHomePhabricator

Determine if XXX is valid Top Read and potentially add to blacklist
Closed, ResolvedPublic

Description

Its possible XXX was Top Read because of a recent movie release, but would like to check that there's no obvious bot looking traffic and consider adding this to our black list

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I wrote a couple of shell scripts (P5221) to see when it was showing up in trending, and also calculated the ratio between desktop and total views for each. First the disambiguation page XXX trended, then XXX_(film_series). When I looked at more details for a couple of dates I noticed that both trended on some days but were eliminated due to an overly skewed ratio. Most of the traffic seems to come from mobile web. The desktop traffic is fairly small. The list of top views (from the PageView API) has tons of entries about movies and some TV shows in the top 50.

02/12 XXX: desktop views / all views = 27608 / 282834 = 9.7%
03/03 XXX: desktop views / all views = 27805 / 256418 = 10.8%
03/04 XXX: desktop views / all views = 25093 / 268532 = 9.3%
03/11 XXX: desktop views / all views = 15757 / 196951 = 8.0%
03/20 XXX_(film_series): desktop views / all views = 14774 / 138275 = 10.6%
03/25 XXX_(film_series): desktop views / all views = 13641 / 113932 = 11.9%
03/26 XXX_(film_series): desktop views / all views = 12164 / 85631 = 14.2%
03/27 XXX_(film_series): desktop views / all views = 11137 / 78002 = 14.2%
04/02 XXX_(film_series): desktop views / all views = 12513 / 102541 = 12.2%

On 2/12, 3/4, and 3/11 the ratio was below the threshold of 10%. I don't have a solid explanation why the entry still showed up. When I run the endpoint on my local machine they are not included for theses days. We may need to add more debugging info for see why entries got included erroneously. I'm not sure where to best store this debugging info, though. One possible explanation is that the values from the PageView API got changed after the fact.
The entries after 03/11 pass the ratio filter (using current values, too).

The way I understand this task this is about a specific title (potentially including related titles linked from the disambig page) which trended long after the changes to resolve T143990 were deployed. If you just look at the XXX disambig page view ratios 3 out of the 4 occurrences so far this year should not have trended.

Oh, I was reading XXX as a placeholder rather than a specific title. Oops. ;) Thanks for explaining.

Correct @bearND. This is a specific case which looked like it slipped through and I wanted to:

  • Find out if it should have been caught by the ratio heuristic. Is so, is there a bug there, or some special case we missed?
  • If it wasn't "supposed" to be filtered by the ratio heuristic, discuss adding it to a manual blacklist.

It was basically, "I noticed this weird unexpected item, can you investigate if its actually "unexpected""?

@JMinor Do you think this requires more investigation?

@bearND I'd lean towards no for now, given the difficulty of recreating the exact circumstances. Maybe a little follow-up task or thinking about logging or debugging would be helpful if it comes up again, but everything seems fine now, so probably not worth more work for now.

bearND claimed this task.

Ok, then I resolve it for now. We can reopen it if needed.

I suppose it would aid my interpretation of bugs if I read the earlier comments ;)

On 2/12, 3/4, and 3/11 the ratio was below the threshold of 10%. I don't have a solid explanation why the entry still showed up. When I run the endpoint on my local machine they are not included for theses days. We may need to add more debugging info for see why entries got included erroneously.

This is troubling to me. Yes, could be updates to the totals in the Pageview API but I'd want to know for sure.

FWIW I think we could be a lot more aggressive in filtering. Statistically speaking, any one article with pageview sources skewed beyond, say, 70/30 in either direction seems highly suspect.

Not trying to reopen this for now but just getting down my thoughts in case this becomes more of a problem in the future.

FWIW I think we could be a lot more aggressive in filtering. Statistically speaking, any one article with pageview sources skewed beyond, say, 70/30 in either direction seems highly suspect.

Though this is not necessarily true for smaller wikis so we'd want to exercise some caution.

Edit to add: if we wanted to be data-driven about it, it wouldn't be hard in principle to calculate the per-wiki standard deviation on mobile vs. desktop pageviews for some period of time and then set the cutoff at something like 2 SD for that wiki.