Page MenuHomePhabricator

Special characters showing up as question marks in /pageviews/top endpoint
Closed, DuplicatePublic

Description

Originally reported here: https://phabricator.wikimedia.org/T128295#2074796 but I guess this was an unrelated bug.

I'm still seeing Sp?cial:Search instead of Spécial:Search, for instance: https://wikimedia.org/api/rest_v1/metrics/pageviews/top/fr.wikipedia/all-access/2016/02/22

This makes it difficult to show only mainspace pages, whereby you might loop through the results and remove any pages that start with a namespace specified in the siteinfo: https://fr.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The request for that page is real, confirmed from Hive:

SELECT
    page_title,
    SUM(view_count) AS c
FROM pageview_hourly
WHERE year = 2016
    AND month = 2
    AND day = 22
    AND project = 'fr.wikipedia'
GROUP BY page_title
ORDER BY c DESC
LIMIT 5;

Wikipédia:Accueil_principal	1639451
Sp?cial:Search	226331
Spécial:Recherche	219676
-	151233
Annie_Girardot	69793

Interesting thing is:

SELECT
    user_agent_map,
    SUM(view_count) AS c
FROM pageview_hourly
WHERE year = 2016
    AND month = 2
    AND day = 22
    AND project = 'fr.wikipedia'
    AND page_title = 'Sp?cial:Search'
GROUP BY user_agent_map
ORDER BY c DESC
LIMIT 10
;

{"browser_major":"4","os_family":"Mac OS X","os_major":"10","device_family":"Other","browser_family":"Safari","os_minor":"6","wmf_app_version":"-"}	226331

--> All the pages for Sp?cial:Search come from the same user-agent, making me thing it is a non-declared bot.

Idea: Filter from top list the pages having a ratio (# views / # distinct user_agents) too high:

SELECT
    page_title,
    SUM(view_count) AS views,
    COUNT(DISTINCT user_agent_map) as dua,
    (SUM(view_count) / CAST(COUNT(DISTINCT user_agent_map) AS DOUBLE)) AS ratio
FROM pageview_hourly
WHERE year = 2016
    AND month = 2
    AND day = 22
    AND project = 'fr.wikipedia'
GROUP BY page_title
ORDER BY views DESC
LIMIT 50
;

page_title	views	dua	ratio
Wikipédia:Accueil_principal	1639451	7855	208.71432208784213
Sp?cial:Search	226331	1	226331.0
Spécial:Recherche	219676	2375	92.49515789473685
-	151233	4823	31.356624507567904
Annie_Girardot	69793	3675	18.9912925170068
Special:Search	67964	1625	41.824
Spécial:Recherche_de_lien	34943	10	3494.3
Le_Secret_d'Élise	33910	2230	15.20627802690583
Alain_Delon	24149	2090	11.554545454545455
Spécial:Connexion	23420	1746	13.413516609392898
Pour_la_peau_d'un_flic	22680	1815	12.49586776859504
Christiane_Taubira	21212	266	79.74436090225564
Anne_Parillaud	20106	1880	10.69468085106383
Wallis-et-Futuna	19612	1749	11.213264722698685
Sophie_Marceau	19520	1526	12.791612057667104
Samuel_Étienne	19381	1767	10.968307866440295
Spécial:Livre	18977	461	41.164859002169194
Bataille_de_Verdun_(1916)	18211	1498	12.156875834445929
Spécial:Liste_de_suivi	18166	282	64.41843971631205
Myriam_El_Khomri	17848	1757	10.15822424587365
Accueil	15766	1106	14.25497287522604
Julien_Lepers	15114	1719	8.792321116928447
Fastlane_(2016)	15011	1630	9.20920245398773
Template:GeoTemplate	14580	1	14580.0
Trapped_(série_télévisée)	14049	1377	10.202614379084967
Saison_6_de_The_Walking_Dead	13872	1129	12.286979627989371
The_Walking_Dead_(série_télévisée)	13705	1089	12.584940312213039
Andrzej_Żuławski	13684	1177	11.626168224299066
Philippe_Etchebest	13661	1403	9.736992159657875
Andrija_Mohorovičić	12793	12	1066.0833333333333
Bernard_Fresson	12658	1427	8.870357393132446
Umberto_Eco	12480	1217	10.254724732949876
Claude_Michaud	12427	1011	12.291790306627101
Deadpool	12016	1311	9.165522501906942
Renato_Salvatori	11818	1452	8.139118457300276
Persona_non_grata	11772	65	181.1076923076923
Donald_Trump	10903	1239	8.799838579499596
Franck_Gastambide	10650	1154	9.228769497400346
Julia_Piaton	10338	1199	8.622185154295247
Bob_Decout	10041	1273	7.887666928515318
France	9988	1458	6.850480109739369
Kalidou_Koulibaly	9980	1124	8.87900355871886
Twilight	9574	1183	8.092983939137785
Monsieur_Léon_(téléfilm)	9357	1076	8.696096654275093
Traitement_de_choc_(film,_1973)	9343	1149	8.131418624891209
Julie_de_Bona	9170	1130	8.11504424778761
Première_Guerre_mondiale	9095	1125	8.084444444444445
Giulia_Salvatori	8614	1123	7.670525378450579
Z_(film,_1969)	8493	1029	8.253644314868804
Deadpool_(film)	8358	1018	8.210216110019646

By putting a limit at 100 for pages havimng more than 1000 views, we could really provide better quality tops.
Maybe we could provide 2 endpoints, one filtered, one not?

Interestingly, it also provides a not-too bad result for that task: T144715:

SELECT
    page_title,
    SUM(view_count) AS views,
    COUNT(DISTINCT user_agent_map) as dua,
    (SUM(view_count) / CAST(COUNT(DISTINCT user_agent_map) AS DOUBLE)) AS ratio
FROM pageview_hourly
WHERE year = 2016
    AND month = 8
    AND day = 16
    AND project = 'en.wikipedia'
GROUP BY page_title
ORDER BY views DESC
LIMIT 50
;


page_title  views   dua ratio
Main_Page   56513868    45878   1231.8293735559528
Special:Search  2145420 19995   107.29782445611403
-   1138825 31739   35.88093512713066
Usain_Bolt  320416  14126   22.68271272830242
2016_Summer_Olympics    297641  13278   22.41610182256364
AMGTV   219511  458 479.2816593886463
Okto    196279  254 772.7519685039371
Proyecto_40 193646  179 1081.8212290502793
XHamster    189509  5198    36.45806079261254
Suicide_Squad_(film)    180701  10614   17.024778594309403
Michael_Phelps  179218  8881    20.179934692039186
Special:Book    168696  1176    143.44897959183675
Simone_Biles    167429  8087    20.703474712501546
Laura_Trott 160713  4188    38.37464183381089
Jason_Kenny 157506  3945    39.9254752851711
Calc    130973  20  6548.65
Stranger_Things_(TV_series) 130296  5288    24.639939485627835
Rustom_(film)   123595  9362    13.201773125400555
Allyson_Felix   113832  5062    22.48755432635322
2012_Summer_Olympics_medal_table    110527  6414    17.232148425319615
Keirin  104904  4906    21.382796575621686
Special:CreateAccount   101427  3561    28.48272957034541
Elvis_Presley   99985   5132    19.48265783320343
Deaths_in_2016  95849   3776    25.38373940677966
Template:GeoTemplate    93990   41  2292.439024390244
Olympic_Games   93139   8504    10.952375352775165
2016_Summer_Olympics_medal_table    92303   4595    20.087704026115343
Mohenjo_Daro_(film) 91105   8321    10.948804230260786
Joanna_Rowsell  89652   2834    31.634438955539874
Omnium  89462   4056    22.056706114398423
K._M._Nanavati_v._State_of_Maharashtra  84921   7080    11.99449152542373
Mohenjo-daro    84818   7862    10.788349020605445
Killing_of_Harambe  79925   5086    15.714707038930397
The_Get_Down    78830   4112    19.170719844357976
Sausage_Party   78774   4395    17.923549488054608
All-time_Olympic_Games_medal_table  77804   5701    13.647430275390283
404.php 77226   4574    16.88369042413642
Athletics_at_the_2016_Summer_Olympics   76644   4754    16.122002524190155
List_of_Olympic_Games_host_cities   75995   5093    14.921460828588259
Special:Watchlist   74267   789 94.12801013941699
User:GoogleAnalitycsRoman/google-api    73554   20  3677.7
Philip_J._Fry   73143   1610    45.43043478260869
2020_Summer_Olympics    72667   5426    13.392370070033174
Laurie_Hernandez    72165   3957    18.237300985595148
SummerSlam_(2016)   70477   6695    10.526811053024645
File:Donald_Trump_August_19,_2015_(cropped).jpg 70095   125 560.76
List_of_Steven_Universe_episodes    69418   3750    18.511466666666667
India_at_the_2016_Summer_Olympics   68641   5735    11.968788142981692
Aly_Raisman 67745   3657    18.52474706043205
Wikipedia:Contact_us    67406   582 115.81786941580756

@Tbayer: Comments / ideas?

It looks like a very interesting idea! Thoughts on it:

  1. The mobile pageviews generated in countries whose ISP proxies all requests, which end up having the same user agent. It could be that for certain pages, there's no traffic except from those mobile proxied pageviews? Would it be possible to whitelist those?
  1. Some pages may be partially affected by those bot anomalies, but still, if we removed the requests generated by those, it would make sense for them to be in the top. See Main_Page in @JAllemandou's previous comment. If we filter them completely, we are removing a legit top member, no? Maybe the top algorithm can filter only the pageviews belonging to the most frequent user agent?
  1. Regarding having 2 tops, one raw and one filtered: If we can solve 1) and 2), I think we do not need the two versions of the top. I think we would be asking the users to understand a non-trivial technical concept, which is not ideal. But if we can not solve 1) and 2), then I think it will be the best to have 2 tops.

whose ISP proxies all requests, which end up having the same user agent. I

mmm... an example for this? ISPs might proxy requests but (while i have seen ISPs changing cookies) I doubt changing UAS is common practice.

FWIW, https://fr.wikipedia.org/wiki/Sp?cial:Search was already among the top viewed pages in November 2015, see T117945

Milimetric triaged this task as Medium priority.

I have another question. Would it be possible to include logic that ensures the page actually exists at the time the pageview is recorded? In addition Sp?cial:Search, that would take care of many of the false positives we see, for instance User:GoogleAnalitycsRoman1/google-api. There are about 10 variations of some nonexistent google-api pages consistently showing up in the enwiki /top endpoint.

I have another question. Would it be possible to include logic that ensures the page actually exists at the time the pageview is recorded?

In theory we only record pageviews for requests that come back with a 200/304 error code, See: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java

Making this issue its own ticket: https://phabricator.wikimedia.org/T145922

Judging by the editing activity and mobile vs desktop views, it seems many of the most-viewed lately have been false positives on enwiki. For desktop, AMGTV, Okto, Proyecto 40, XHamster, and In vino veritas all dominate the top 10 but on mobile they are very low if at all present in the top 1,000. I do not have database access, but I'm interested to know what the ratio of distinct user agents vs views are for these pages, following @JAllemandou's trick: https://phabricator.wikimedia.org/T145043#2618517

If filtering based on this idea proves successful I think we should adopt it as the standard for the primary /top endpoint, perhaps moving the data we see now to /top/raw. Currently significant effort is put into determining false positives (at least for enwiki) when compiling publications like The Signpost, and again many of the pages that are being recorded do not exist, and should be excluded. I see value in knowing what the raw most-viewed pages are, whether it be an undeclared bot or not, existing page or not, but I believe most people are taking the raw data to heart, scratching their heads.

Judging by the editing activity and mobile vs desktop views, it seems many of the most-viewed lately have been false positives on enwiki. For desktop, AMGTV, Okto, Proyecto 40, XHamster, and In vino veritas all dominate the top 10 but on mobile they are very low if at all present in the top 1,000. I do not have database access, but I'm interested to know what the ratio of distinct user agents vs views are for these pages, following @JAllemandou's trick: https://phabricator.wikimedia.org/T145043#2618517

Good points. There is already a bug about this: T144715, it's best to continue this part of the discussion there.

Any issue like " many of the pages that are being recorded do not exist, and should be excluded. " should be fixed on the mediawiki end, not analytics end. We should drive fixes closest to the root cause when possible and in this case the root cause is that non-existing pages need to return 404 not 200. This is alredy the case in mediawiki for the most part.

See bugfix going into mediawiki next week regarding some requests that were returning 200 when they should return 404: https://gerrit.wikimedia.org/r/#/c/312561/

Filter from top list the pages having a ratio (# views / # distinct user_agents) too high:

This might run into false positives, for example "trending pages visited by many users having the same device, say an IPhone6". Without doing some testing as to how common false positives are I do not think we can go ahead with this type of filtering.

At least for the examples I provided, this should be resolved by T146496, I believe

At least for the examples I provided, this should be resolved by T146496, I believe

Yes, but the original issue that prompted this ticket (special characters ...) still remains.