Page MenuHomePhabricator

Article on Carles Puigdemont has inflated pageviews in many projects
Open, Needs TriagePublic

Description

Hello! Some articles (especially [[:eu:Carles_Puigdemont]]) are giving anomalous visit patterns at euwiki: https://pageviews.toolforge.org/pageviews/?project=eu.wikipedia.org&platform=all-access&agent=user&redirects=0&start=2020-07-05&end=2020-09-25&pages=Carles_Puigdemont

I think there's a bot that is not being detected, creating false statistics for the project.

Event Timeline

MusikAnimal added a subscriber: MusikAnimal.

Not an issue with the Pageviews tool, rather the data that it serves. Anyway I agree that this is most surely a bot.

Mr. Puigdemont is also (supposedly) the second most read article in the Norwegian Nynorsk Wikipedia, the third most read in Icelandic, third most read in Afrikaans, second most read in Alemannic, most read in Western Armenian … and the list goes on and on and on (probably), I stopped checking there.

Could his article(s) be excluded from the lists, please? It really is polluting the results.

jhsoby renamed this task from Anomalous visit pattern on some articles at euwiki to Article on Carles Puigdemont has inflated pageviews in many projects.Mar 4 2021, 2:12 PM

@MusikAnimal What can be done about this? Soon 1.5 years since this was reported (probably even longer since the problem started), and this pageviews manipulation is still going on as rampant as ever. I just noticed that the article is given a quite prominent spot when you open the Wikipedia Android app because of the high pageview count – maybe that's one of the motivations for whoever is doing this? Either way, the article should be excluded from the most viewed lists in all languages.

+1

Even better would be exclude those robots, obviously, but it's really frustrating seeing it every time I use the Android app — currently the top 5 articles on the Welsh Wicipedia are Carles Puigdemont: 4k views, Microsoft 51 views, Katarzyna Kobro 47 views, Wyn Calvin 43 views, Cymraeg (the Welsh language) 40 views.

Can we please get Senyor Puigdemont excluded from our views across all these projects?

Yes, that would make it. I don't know how much visits are coming from bots, but if exclusion is impossible, then we should find another way.

What can be done about this?

This would fall onto the Analytics team. I don't believe there is a system to exclude pages from the /top/ ("most viewed pages") endpoint, and it would be up to them to add one.

I just noticed that the article is given a quite prominent spot when you open the Wikipedia Android app because of the high pageview count – maybe that's one of the motivations for whoever is doing this?

Possibly. Since it seems to have been a problem for so long now, with no end in sight, perhaps the Android team could remove it. You could create a task under Wikipedia-Android-App-Backlog requesting it, but I have no idea if they'll follow through. It would be better to remove the page from the actual underlying data, then all the clients (Android/iOS apps, Topviews, Hatnote, etc.) will all have the page removed.

Even better would be exclude those robots, obviously

The Analytics team improved bot detection a while back, and it seemed to take care of most of the artificially inflated data. Without having investigated the eu:Carles_Puigdemont pageviews data myself, my guess is that this bot is doing a very good job at appearing to be human.


When I first launched Topviews, I added a system to crowd-source a list of false positives. The idea was someone would review the reported false positive reports, examine the private analytics data, and once verified, remove the pages from Topviews. I did this all on my own for years, until I ran out of steam. The Analytics team will surely have the same problem. An "exclusion" list system just doesn't seem to scale well; or we need a better way to verify false positives than manually examining the pageviews data (IP address, etc.), since this ability is limited to a small group of people with privileged access to private data. If we want a crowd-sourced system to work, we need an easier way to verify the reports.

At any rate, I suspect TheDJ's edit at T263908#7656544 will be enough to grab the attention of the Analytics team. So for now, we'll just have to be patient. Sorry I cannot help any further!