Page MenuHomePhabricator

Article on Carles Puigdemont has inflated pageviews in many projects
Open, Needs TriagePublic

Description

Hello! Some articles (especially [[:eu:Carles_Puigdemont]]) are giving anomalous visit patterns at euwiki: https://pageviews.toolforge.org/pageviews/?project=eu.wikipedia.org&platform=all-access&agent=user&redirects=0&start=2020-07-05&end=2020-09-25&pages=Carles_Puigdemont

I think there's a bot that is not being detected, creating false statistics for the project.

Event Timeline

MusikAnimal subscribed.

Not an issue with the Pageviews tool, rather the data that it serves. Anyway I agree that this is most surely a bot.

Mr. Puigdemont is also (supposedly) the second most read article in the Norwegian Nynorsk Wikipedia, the third most read in Icelandic, third most read in Afrikaans, second most read in Alemannic, most read in Western Armenian … and the list goes on and on and on (probably), I stopped checking there.

Could his article(s) be excluded from the lists, please? It really is polluting the results.

jhsoby renamed this task from Anomalous visit pattern on some articles at euwiki to Article on Carles Puigdemont has inflated pageviews in many projects.Mar 4 2021, 2:12 PM

@MusikAnimal What can be done about this? Soon 1.5 years since this was reported (probably even longer since the problem started), and this pageviews manipulation is still going on as rampant as ever. I just noticed that the article is given a quite prominent spot when you open the Wikipedia Android app because of the high pageview count – maybe that's one of the motivations for whoever is doing this? Either way, the article should be excluded from the most viewed lists in all languages.

+1

Even better would be exclude those robots, obviously, but it's really frustrating seeing it every time I use the Android app — currently the top 5 articles on the Welsh Wicipedia are Carles Puigdemont: 4k views, Microsoft 51 views, Katarzyna Kobro 47 views, Wyn Calvin 43 views, Cymraeg (the Welsh language) 40 views.

Can we please get Senyor Puigdemont excluded from our views across all these projects?

Yes, that would make it. I don't know how much visits are coming from bots, but if exclusion is impossible, then we should find another way.

What can be done about this?

This would fall onto the Analytics team. I don't believe there is a system to exclude pages from the /top/ ("most viewed pages") endpoint, and it would be up to them to add one.

I just noticed that the article is given a quite prominent spot when you open the Wikipedia Android app because of the high pageview count – maybe that's one of the motivations for whoever is doing this?

Possibly. Since it seems to have been a problem for so long now, with no end in sight, perhaps the Android team could remove it. You could create a task under Wikipedia-Android-App-Backlog requesting it, but I have no idea if they'll follow through. It would be better to remove the page from the actual underlying data, then all the clients (Android/iOS apps, Topviews, Hatnote, etc.) will all have the page removed.

Even better would be exclude those robots, obviously

The Analytics team improved bot detection a while back, and it seemed to take care of most of the artificially inflated data. Without having investigated the eu:Carles_Puigdemont pageviews data myself, my guess is that this bot is doing a very good job at appearing to be human.


When I first launched Topviews, I added a system to crowd-source a list of false positives. The idea was someone would review the reported false positive reports, examine the private analytics data, and once verified, remove the pages from Topviews. I did this all on my own for years, until I ran out of steam. The Analytics team will surely have the same problem. An "exclusion" list system just doesn't seem to scale well; or we need a better way to verify false positives than manually examining the pageviews data (IP address, etc.), since this ability is limited to a small group of people with privileged access to private data. If we want a crowd-sourced system to work, we need an easier way to verify the reports.

At any rate, I suspect TheDJ's edit at T263908#7656544 will be enough to grab the attention of the Analytics team. So for now, we'll just have to be patient. Sorry I cannot help any further!

@JArguello-WMF : Any chance of an update sometime soon, please? Even an update as simple as “This is on our backlog at priority X and is unlikely to be addressed before January 2024" or something would be more than we currently know 😅

I started a Google doc with pageviews data and stats from more than 20 WP language versions that you can access here. Please, feel free to contribute.

Greetings from Bulgaria,
Pelajanela

Thank you @Pelajanela; I have added screenshots for cywiki (Welsh)

Thank you @Pelajanela; I have added screenshots for cywiki (Welsh)

Thank you for your quick reaction, contribution, and involvement, @OwenBlacker!
I am ambitious for all of us here to move this case towards some solution :)

Warmly,
Pelajanela

@OwenBlacker thanks for the ping. We are currently going through the process of reviewing requests. That is why we removed the tag analytics and passed it to Data Engineering. By Wednesday the 7th I'll provide an answer with future steps.

This is a bot that was crawling this page and realized that it was detected as a bot and it shifted its pattern a bit: (notice these are automated pageviews a while back)

Screen Shot 2022-09-01 at 3.53.45 PM.png (1×2 px, 2 MB)

Now is detected as non automated traffic, if you see its traffic pattern is trying to move pageviews up bit by bit trying to figure out at what point it will go into automated again:

Screen Shot 2022-09-01 at 3.53.25 PM.png (1×2 px, 2 MB)

The reality of it is that this bot spammy pageviews are only a few hundreds per day and that is why it can only affect small wikipedias with low number of pageviews

See how in icelandic it hovers arround 800-1000 pageviews as well: https://pageviews.wmcloud.org/pageviews/?project=is.wikipedia.org&platform=all-access&agent=user&redirects=0&start=2020-07-05&end=2020-09-25&pages=Carles_Puigdemont

Overall, the effort to detect bot traffic in this small quantities (less than 1 pageview per minute) is quite big as it needs strategies that are fundamentally different from the ones used on large traffic wikipedias (bugs on current algorithm aside)

@JArguello-WMF: Any update? :)

*still interested in the topic and following developments as well*

This is my first WM-posting ever - pls tell me if something is wrong or in the wrong place.
Regarding the inflated pageviews of e.g. https://sv.wikipedia.org/wiki/Carles_Puigdemont - according to what I can read above there is not yet any solution, and it is argued above that it is unfeasible to detect spammy pageviews of this quantity.
I have the following idea to do something in the analytics- or pageviews- part of the process:
At least regarding sv:Carles Puigdemont the relation of views from "Computer", "Mobile app" and "Mobile web" is very different from pages which does not suffer from obvious inflation of viewing statistics.
Looking at the period february 2023 the [https://pageviews.wmcloud.org/pageviews/?project=sv.wikipedia.org&platform=all-access&agent=user&redirects=0&start=2023-02-01&end=2023-02-28&pages=Carles_Puigdemont total number of views] is 11034, where [https://pageviews.wmcloud.org/pageviews/?project=sv.wikipedia.org&platform=desktop&agent=user&redirects=0&start=2023-02-01&end=2023-02-28&pages=Carles_Puigdemont Computer] stands for 10988, [https://pageviews.wmcloud.org/pageviews/?project=sv.wikipedia.org&platform=mobile-app&agent=user&redirects=0&start=2023-02-01&end=2023-02-28&pages=Carles_Puigdemont mobile app] for 2, and [https://pageviews.wmcloud.org/pageviews/?project=sv.wikipedia.org&platform=mobile-web&agent=user&redirects=0&start=2023-02-01&end=2023-02-28&pages=Carles_Puigdemont mobile webb] for 44 views. This is a very odd distribution of viewing methods, with more than 99% from "computer". Looking for the somewhat related page https://sv.wikipedia.org/wiki/Joan_Laporta the corresponding viewing numbers for february 2023 are: Total 140, Computer 28, mobile app 1, and mobile web 111, i.e. appr 25% is made from computer. Just from occasional viewing of statistics I consider this distribution much more "normal"
I have an idea if viewing figures with more than, say 75% from "Computer" could be considered anomalous and filtered out, or at least tagged as "unreliable", in the pageviews.wmcloud.org software. My proposed cutoff could of course be adjusted based on a more rigorous analysis than my ad-hoc-observations. / Regards ~~~~