Page MenuHomePhabricator

[Feed] Establish criteria for blacklisting likely bot-inflated most-read articles
Closed, ResolvedPublic

Description

Surprisingly often, a seemingly un-noteworthy article makes it into the top five or so articles by pageview count[1] and remains there. Currently, for example, AMGTV[2] has been on the Trending feed card for English Wikipedia for some time. The persistence of articles like this in the upper ranks supports a suspicion that its ranking is being inflated by bot traffic.

Earlier, we blacklisted two other articles about TV stations for the same reason: https://gerrit.wikimedia.org/r/302720

From the start, we've incorporated a blacklist inherited from the iOS app's feed prototype. See related discussion at T124716, especially beginning with T124716#2024574.

Of course, we shouldn't be getting into the realm of censorship. We should make these blacklisting decisions as objective as possible rather than doing them on an arbitrary, case-by-case basis. Let's come up with some standard blacklisting criteria.

[1] As reported by the Wikimedia Pageview API.
[2] https://en.wikipedia.org/wiki/AMGTV

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 26 2016, 11:35 AM
Mholloway renamed this task from [Feed] Establish a set of criteria for blacklisting likely bot-inflated most-read articles to [Feed] Establish criteria for blacklisting likely bot-inflated most-read articles.Aug 26 2016, 11:35 AM
Mholloway updated the task description. (Show Details)Aug 26 2016, 11:45 AM
Mholloway updated the task description. (Show Details)Aug 26 2016, 11:47 AM
Mholloway updated the task description. (Show Details)Aug 26 2016, 11:50 AM
Mholloway updated the task description. (Show Details)

@Dbrant is this more Open Questions/Discussions?

@Nuria, we discussed this briefly on the first day of the product/tech onsite meeting so I thought I'd follow up. Any ideas on what we can/should do here? We're consuming the Pageviews API and unfortunately (for us) it doesn't expose much info we could use for any more sophisticated filtering, just title, rank, and count.

Thank you!

Restricted Application added a project: Analytics. · View Herald TranscriptOct 14 2016, 6:19 PM

@JAllemandou
Wouldn't your pageviews/userAgent ratio help in identifying this kind of pages?

This is definitely the idea @mforns !
however last tests I did using this methodology were not positive: simple pageview_count/distinct_user_agent ratio is not doing a proper job.
It'll need some cleverer way to filter.

JMinor added a subscriber: JMinor.Oct 20 2016, 8:08 PM

Has anyone actually looked at using the community developed criteria I have been using in the iOS app?

https://phabricator.wikimedia.org/T124716#2080575

This is a simple metric which looks at the ratio of mobile vs. desktop views and is used by the Signpost and is how I have been determining items for iOS blacklist.

I'm not sure why this suggestion, generated by editors, and used effectively (though ad-hoc) in iOS client, has not been considered.

@JMinor Where are you finding info on mobile vs. desktop views?

Are you looking at the results for the various platform-specific Pageview API endpoints and then comparing the results?

JMinor added a comment.EditedOct 20 2016, 8:20 PM

I've been using the pageviews web tool to check things that look suspcious (long sustained high pageviews without any news/cultural reason) and comparing "web" and "mobile web" platforms. Looks like that UI passes the "platform=" parameter to switch between the two.

https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Cat|Dog

Nuria added a comment.Oct 20 2016, 8:29 PM

This is a simple metric which looks at the ratio of mobile vs. desktop views

As munch as I like low tech solutions this one seems kind of brittle. It is a matter of bots moving to the mobile uI and this criteria will no longer work....

I think if we exclude all requests marked as "nocookies=1" we might have a better shot.

Mholloway added a comment.EditedOct 20 2016, 8:32 PM

Thanks, @JMinor. One of the main concerns behind this task is that, practically speaking, relying on our own sense of titles that look suspicious really only helps out for English.

Looking at the various Pageview endpoints and your method, though, I think we could write this heuristic into the most-read pages endpoint in a way that would work for all languages. I'm a little concerned because that's already a fairly processing-intensive endpoint and this will make it much more so, but it's worth coding up to test out, at least.

Do you have a specific cutoff for how many views on one platform vs. another counts as suspicious? Probably we wouldn't expect much deviation from 50/50, and it seems like in practical terms the ones that look suspicious end up being closer to 100/0.

Thanks. Yes, my whole push here is exactly to have a heuristic that is not based on editorial intuition. I believe looking a this traffic split is something both the en wiki editors and iOS has used to reduce the potential for arbitrary blacklisting editorial interference.

I beleive the metric signpost uses is 95/5, though it might help to start by creating some kind of "gold standard" (maybe the existing blacklist) and then try different thresholds to measure the precision and recall against. I can help set that up in a google sheet. Maybe you and I can set up a call to chat about the easiest way to verify this is a valid heuristic.

One way to reductce the compuational cost would be to only run the comparison once a day against the top N items, then store any hits in a blacklist by language. This would also allow us to manually remove or add items which are missed or shown to be incorrectly blacklisted.

This is a simple metric which looks at the ratio of mobile vs. desktop views

As munch as I like low tech solutions this one seems kind of brittle. It is a matter of bots moving to the mobile uI and this criteria will no longer work....

I think if we exclude all requests marked as "nocookies=1" we might have a better shot.

Couple thoughts on this: any transparent system will eventually be gamed. We don't know why these people are doing this, so we don't know if just removing them from the API will even be noticed by them (I seriously doubt its about seeing their name in the apps for example). Like security, this will likely be a never ending cat and mouse.

All I'm suggesting is that we analyze this approach against our current data and see if it is helpful. In manually doing so both the en wiki editors and the iOS team have found it to be a useful, if incomplete, way to remove the worst offenders. That doesn't preclude other filters (such as nocookies). This traffic split may not carry over to smaller wikis or may have terrible recall. I'd just like to see it evaluated before we discard it.

The ideal solution here would be to create a "gold standard" of editor selected "bad apples" and then evaluate precision and recall of potential filters against them. In my wildest dreams, we would have a machine classifier trained using these types of heuristics as candidate features, and then let the classifier tell us who is suspicious. However, I've got users to keep happy, and, like you I love a cheap low tech 80% solution.

Change 317095 had a related patch set uploaded (by Mholloway):
[Demo] Filter most-read pages with <10% or >90% of total views on desktop

https://gerrit.wikimedia.org/r/317095

@JMinor I took a stab at incorporating your heuristic into the most-read endpoint with the patch above and it seems to work pretty well. It actually does the filtering on the fly so there's no need to maintain a blacklist.

This is (in principle) a request that only needs to run once per language per day and then get stored in RESTBase, and anyway came out less computationally complex than I was afraid it might be, so I'm not so worried about melting down the servers after all.

I used 90/10 as the cutoff since it knocks out AMGTV and 95/5 didn't. ;)

^ This is pretty neat

@Jdlrobson pointed out an interesting alternative approach on https://gerrit.wikimedia.org/r/#/c/317095/, namely, sorting by day-over-day changes. It is an interesting approach and probably better reflects the idea of "trending" than just taking yesterday's top-viewed articles and applying some filters, but doing that would represent a substantial change to what the card represents so I'm mentioning it here for product-related discussion.

Took a quick look at the results and this seems like a good first cut.

While @Jdlrobson is spot on that it would be better to look at velocity (of reads or edits) to support trending endpoint, this is not a trending endpoint, its just a ranked list of top read pages. I fully support implementing and testing a trending system based more on velocity, which would likely be less affected by these types of pageview bots, but as you said @Mholloway thats a discussion for another ticket.

@Mholloway what are you thinking about this patch? Is it viable - or do we want to still rely on the hand curated system for now?

Personally I think this is a much better way, just want to see if its ok for us to move forward and deploy.

I'm all for it if someone is willing to give it a +2!

I'll follow up with a separate patch removing the blacklist, either now or maybe once the new algorithm is validated if people are feeling cautious.

Change 317095 merged by jenkins-bot:
Filter most-read pages with <10% or >90% of total views on desktop

https://gerrit.wikimedia.org/r/317095

Mholloway closed this task as Resolved.Nov 3 2016, 8:56 PM
Mholloway claimed this task.
bearND added a subscriber: bearND.Nov 7 2016, 9:42 PM

deploy/2016-11-07/f276982

Nuria added a comment.Sep 3 2018, 11:41 PM

Picking this up again and actually i take back my earlier comment, the "eqaually-spread-views-on-mobile-and-desktop" seems a good basic heuristic.