Page MenuHomePhabricator

Trending articles is showing pages that had fake traffic
Closed, ResolvedPublic

Description

Spun from T232992: Manipulation of pageview statistics German Wikipedia

It appears that for the German Wikipedia, several not-so-notable articles on musicians are regularly listed under "trending", yet it is quite obviously fake traffic: https://tools.wmflabs.org/pageviews/?project=de.wikipedia.org&platform=all-access&agent=user&range=latest-120&pages=Tobias_Sammet|Avantasia|Edguy

Could this be a bad actor intentionally getting exposure through the app?

I see there is already some filtering logic in the apps; maybe it's excluding pages that had no edits during that time period (which is generally a safe assumption unless the page is protected)? In this case the false traffic was consistently to desktop and mobile-web. Mobile-app seems to be the only one showing genuine traffic, and the spike there is probably because visitors got to the articles from the "trending" card in the app: https://tools.wmflabs.org/pageviews/?project=de.wikipedia.org&platform=mobile-app&agent=user&start=2019-06-23&end=2019-10-21&pages=Tobias_Sammet%7CAvantasia|Edguy

I was told there was a promising solution for T123442: Pageview API: Better filtering of bot traffic on top enpoints, but I'm not sure if Analytics is working on it right now.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I'm not sure what the apps can do about this on the client side. The apps merely consume an API that tells us what the trending articles are. If we want to clean up trending results that are being artificially inflated, we should do this on the server side (in the task that you linked).

Closing this task, since the issue is not on the client side and will be resolved by the other tickets linked in the description.

Hello @Charlotte and @Dbrant! Allow me to raise this issue once again. It seems the German Wikipedia is getting a lot of complaints through OTRS and other venues about the apparent abuse of the "trending" list (T232992).

The current plan for T123442: Pageview API: Better filtering of bot traffic on top enpoints by the way would not necessarily catch these examples of false traffic.

Might I suggest querying for the "mobile-app" platform instead of "all-access"? The mobile app is used enough that the pageviews data is meaningful, and you're not showing any exact figures anyway. Most importantly, mobile-app is not subject to the same disruptive automated traffic, so I think it may be a better option for the "trending" list in the short-term.

Alternatively, you could use the crowd-sourced data from https://tools.wmflabs.org/topviews to exclude pages with artificially inflated traffic. The reports are verified by me personally, usually but not always by examining private referrer and location-based data from the analytics cluster. There is an undocumented API for this tool, e.g. https://tools.wmflabs.org/topviews/api.php?project=de.wikipedia&date=2019-09&platform=all-access, which returns the excluded pages for the given project, date range and platform. The date can be any day (YYYY-MM-DD), month (YYYY-MM) or year (YYYY). I wouldn't normally recommend using this service in production (or anything on Toolforge for that matter), but if you put a modest timeout on your GET request along with some simple fallbacks, it should be safe to use.

Hi @Charlotte and @Dbrant,

I strongly support the concern to change something about the mobile apps here, even temporarily.

If the filtering of the trending topics does not work in the short term, the above proposal or the removal of the section from the app should be considered. On the latter point, there is currently a survey in the German Wikipedia, in which the authors almost entirely agree with the removal of that section: https://de.wikipedia.org/wiki/Wikipedia:Umfragen/Umgang_mit_der_Anzeige_von_beliebten_Artikeln

This is due to increasing complaints from users of the app about the obvious and massive manipulations of the trending topics that have been taking place for months.

By the way, this also applies to the Iphone app, so please inform the responsible persons there. Thank you !

Just to clarify one thing:
Both the Android and iOS apps get the trending articles from a common API, namely wikifeeds:
https://de.wikipedia.org/api/rest_v1/feed/featured/2019/11/06
...so if we want to address this issue, it should be fixed at the level of that backend service (cc #product-infrastructure-team-backlog).

But I actually just had an idea for how we can filter out articles with inflated pageviews. For an article that has inflated pageviews, the daily numbers are suspiciously similar from one day to the next, i.e. within 2%. Whereas if an article is trending organically, the daily pageviews are much more variable. So, perhaps we could check the pageviews for the last ~three days or so, and if they're within <2%, then we exclude the page from the list?

Change 551627 had a related patch set uploaded (by Dbrant; owner: Dbrant):
[mediawiki/services/wikifeeds@master] Introduce additional heuristic filter for trending articles.

https://gerrit.wikimedia.org/r/551627

@Dbrant Do you think 2% will do it? Spammers could adapt. Or the traffic generated by the display in the "Trending Topics" could exceed 2%.

The patch I submitted above follows a similar but slightly different idea, which might be more fool-proof:
For an article to be considered "trending", its pageviews on the current day must be at least 1.5x greater than the pageviews on the previous day. I tested this scheme on several days' worth of trending lists over the last few days, and it seems to weed out the inflated items pretty effectively, while keeping the items that are trending for real.

Hey, this seems like a drastic change to the definition of trending @Dbrant

A few thoughts:

  1. Before we do this based on one engineer eye a few days in a language or two, lets define a broader acceptance test/plan.
  2. If this is an issue with the pageviews api, rather than hack up the service at REST layer, we should figure out the best place for this upstream, since we're not the only ones affected.
  3. We already have both a blacklist and another heuristic we defined four years ago (based on the split between mobile and web traffic) which maybe no longer relevant or should be considered along with this new redefinition.

Is the /api/rest_v1/feed/featured endpoint returning pre-stored data or is it doing the filtering in real-time? If the latter, I think a short-term fix is acceptable, where we don't necessarily change the definition of trending. Regardless, doing whatever you can to take out the fake items is probably still better than a feed that's missing a few genuinely trending articles.

T123442: Pageview API: Better filtering of bot traffic on top enpoints is the upstream bug where we might want a more robust, long-term solution.

Be mindful that some false traffic is isolated to a single day or two, e.g. List of awards and nominations received by Meryl Streep, which was the most-viewed for November 18. It seems the feed/featured endpoint already knows to filter this out, though. Also some genuine traffic can be sustained, say a recent death where it takes a few days before the pageviews start to taper off.

From a report in OTRS, looks like a similar issue on Hungarian Wikipedia. (Attached screenshots provided by Android app user.)

Screenshot_2019-11-17-01-00-38.png (960×540 px, 78 KB)

Screenshot_2019-11-19-23-08-43.png (960×540 px, 80 KB)

Change 551627 abandoned by Dbrant:
Introduce additional heuristic filter for trending articles.

https://gerrit.wikimedia.org/r/551627

Pageview data reported on top endpoints now excludes 'automated views', which should remove the majority of this problem. Please see: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection

LGoto claimed this task.