Page MenuHomePhabricator

Why top views data of different sources is not the same?
Closed, ResolvedPublic

Description

see https://wikimedia.org/api/rest_v1/metrics/pageviews/top/zh.wikipedia/all-access/2017/08/02
and
https://zh.wikipedia.org/api/rest_v1/feed/featured/2017/08/02

in metrics, Top 1 and 2 of article is "wiki" and "維基媒體基金會", but in feed they are not.

Event Timeline

Restricted Application added subscribers: Stang, Aklapper. · View Herald Transcript
elukey subscribed.

Because they are different APIs showing different content :)

Please re-open if you need more info (or just follow up in the #wikimedia-analytics IRC channel)

Because they are different APIs showing different content :)

Please re-open if you need more info (or just follow up in the #wikimedia-analytics IRC channel)

Why top views data is different?

I'm afraid there's bugs that result RESTbase datas not synced well in Pageviews-API

I'm afraid there's bugs that result RESTbase datas not synced well in Pageviews-API

Also, @elukey I would point to you that both was same before, err, 1 month ago.

mobrovac subscribed.

AFAIK, MCS manipulates the data received from the Pageviews API, so perhaps that's why there is a difference? Could MCS people take a look?

MCS does some filtering.

  • It removes the main page since that would always be the same.
  • It removes pages that are not in the main-namespace. (We mainly want to get rid of Special pages. One could argue that Category pages might be OK. That might be up for discussion.)
  • It removes pages where we think the traffic is bot-inflated.

Also I should mention that MCS responds with the previous days data. So, you'd want to compare https://zh.wikipedia.org/api/rest_v1/feed/featured/2017/08/02 with https://wikimedia.org/api/rest_v1/metrics/pageviews/top/zh.wikipedia/all-access/2017/08/01. That is because it's part of the feed data calculated on 2017/08/02. At that time (shortly after 00:00 UTC) it only knows what the top pages of the previous days were. It cannot predict the current day's top viewed articles.

I think @bearND explained it pretty well, data from pageview API is subjected to fluctuations due to bot traffic (not self-reported as such and thus labelled as "user") and also it includes Main_Page, which is not an interesting bit . Android app massages that data and creates a more interesting feed.

Shizhao reopened this task as Open.EditedAug 8 2017, 2:33 AM

MCS does some filtering.

  • It removes the main page since that would always be the same.

"Wiki" and "維基媒體基金會" not main page

  • It removes pages that are not in the main-namespace. (We mainly want to get rid of Special pages. One could argue that Category pages might be OK. That might be up for discussion.)

"Wiki" and "維基媒體基金會" is main-namespace

  • It removes pages where we think the traffic is bot-inflated.

only user data

Also I should mention that MCS responds with the previous days data. So, you'd want to compare https://zh.wikipedia.org/api/rest_v1/feed/featured/2017/08/02 with https://wikimedia.org/api/rest_v1/metrics/pageviews/top/zh.wikipedia/all-access/2017/08/01. That is because it's part of the feed data calculated on 2017/08/02. At that time (shortly after 00:00 UTC) it only knows what the top pages of the previous days were. It cannot predict the current day's top viewed articles.

https://zh.wikipedia.org/api/rest_v1/feed/featured/2017/08/02:
No1.. "楚乔传": 2017-08-01Z views 53127
No2. "賭城群英會": 2017-08-01Z views 22306
......

https://wikimedia.org/api/rest_v1/metrics/pageviews/top/zh.wikipedia/all-access/2017/08/01
No1. "Wiki": 66134
No2. "維基媒體基金會": 61413
No3. "楚乔传": 53127
No4. "賭城群英會": 22306
......

but in "https://zh.wikipedia.org/api/rest_v1/feed/featured/2017/08/02", not "Wiki" and "維基媒體基金會" data?

and see

https://zh.wikipedia.org/api/rest_v1/feed/featured/2017/08/03
https://wikimedia.org/api/rest_v1/metrics/pageviews/top/zh.wikipedia/all-access/2017/08/02
https://wikimedia.org/api/rest_v1/metrics/pageviews/top/zh.wikipedia/all-access/2017/08/03

same problem.

see https://tools.wmflabs.org/pageviews/?project=zh.wikipedia.org&platform=all-access&agent=user&range=latest-10&pages=Wiki

"Wiki" only user data (no bot), in 20170801 views data also 66134

https://tools.wmflabs.org/pageviews/?project=zh.wikipedia.org&platform=all-access&agent=user&range=latest-10&pages=%E7%B6%AD%E5%9F%BA%E5%AA%92%E9%AB%94%E5%9F%BA%E9%87%91%E6%9C%83

"維基媒體基金會" views data also 61413 in 20170801.

@Shizhao but where tells you that both should be the same?

@Liuxinyu970226 The views data of the same article is the same in same day, but one of the sources lacks data of "Wiki" and "維基媒體基金會".

@Liuxinyu970226 The views data of the same article is the same in same day, but one of the sources lacks data of "Wiki" and "維基媒體基金會".

If the answers above can't help you, then I'm not sure that what's your aim. Here is a @MusikAnimal tool called Tool-Pageviews where its FAQ says

Why can't I view data for today's date?

The Wikimedia pageviews API generally takes a full 24 hours to populate, sometimes longer. In some situations you may see data missing for yesterday's date as well, which will be left blank rather than showing a count of zero views.

Since you 2nd re-open, should I consider that you want to make such populate dynamically?

@Liuxinyu970226 I mean same data of the same day, but lacks data of "Wiki" and "維基媒體基金會".

https://zh.wikipedia.org/api/rest_v1/feed/featured/2017/08/02 is 20170801 data (not "Wiki" and "維基媒體基金會" data), "楚乔传" is 53127

https://wikimedia.org/api/rest_v1/metrics/pageviews/top/zh.wikipedia/all-access/2017/08/01 also is 20170801 data, "楚乔传" also is 53127

Although date of the url is not the same.(This is not yesterday's data! Yesterday is 20170807 )

@Shizao: Some pageviews that "appear" to be from users are actually from bots, some of those are removed on feeds endpoint but not on pageview api endpoint. That is the case for "wiki" and likely for "維基媒體基金會" (wikimedia foundation)

So those diffs are probably the reasons that both can't shown on feeds, but shown on pageviews (note that "維基媒體基金會", a zh-hant title, is a redirect to zh-hans "维基媒体基金会"):

diffusername/bot name
39703037Antigng-bot (@Antigng)
4289115059.126.69.221
42891194星耀晨曦(@RazeSoldier)
43675263小躍(@EagerLin, afair)
438315792607:fb90:405:f60e:e7c3:9984:ce81:60a5
43831659@Tegel
44175497@Zhanglide
44294167@RabbitMeow
45046375@Fxqf
45231054InternetArchiveBot (@Cyberpower678)
45298628@John123521
45298778@John123521
45463533126.212.84.78
45463675@A2093064
42658624Yinweichen-bot (@Yinweichen)
44703655InternetArchiveBot (@Cyberpower678)
45047212@南极熊

At least one of non-bot edits that listed here are considered as (fake?) bot edits due to some reasons (they use AWB? Wikiplus? 3rd party scripts?)

Is there such this situation in other languages?

Is there such this situation in other languages?

Actually yes, except those which are rarely active. As using 2017-08-07 datas for example

...
Due to those examples, I don't think that this task should remain open. It's expected that main pages and special pages are considered as articles in pageviews but not in feeds, and there's nothing to be resolved.

Also @Shizhao your comments are still foggy, at least

It removes pages where we think the traffic is bot-inflated.

only user data

Haven't you seen T172379#3510994? there's at least Antigng-bot and InternetArchiveBot available and you say those are "only user data"?!

Also @Shizhao your comments are still foggy, at least

It removes pages where we think the traffic is bot-inflated.

only user data

Haven't you seen T172379#3510994? there's at least Antigng-bot and InternetArchiveBot available and you say those are "only user data"?!

see https://tools.wmflabs.org/pageviews/?project=zh.wikipedia.org&platform=all-access&agent=user&range=latest-10&pages=Wiki

rankarticleviews data (source 1)views data (source 2)views data (source 3)
#1Wiki66134none66134
#2維基媒體基金會61431none61431
#3楚乔传531275312753127
#4賭城群英會223062230622306
#5河伯的新娘_2017187161871618716
#6三生三世十里桃花_(电视剧)170941709417094
#7超時空男臣166361663616636
#8劉文雄_(基隆市)162481624816248
#9浪漫醫生金師傅142821428214282
#10稍息立正我愛你138721387213872
#11終極三國_(2017年)124131241312413
#12孤單又燦爛的神-鬼怪110401104011040

Why just source 2 lacks data of #1 and #2, and all other data is the same? At least source 2 and source 3 are data of only user.

Also @Liuxinyu970226

..... ....

Your comparison is wrong. The first link data from 20170807, and the second link data from 20170806.

@Shizhao I can only reply one question for now

Why just source 2 lacks data of #1 and #2, and all other data is the same?

T172379#3509744 from nuria

@Shizhao I can only reply one question for now

Why just source 2 lacks data of #1 and #2, and all other data is the same?

T172379#3509744 from nuria

if just remove some bot data. see "wiki", the views data is 66134 ( pageview api endpoint rank #1, include some bot data), "六四事件" is 6621 (feeds endpoint TOP30 rank #30, remove some bot data, "wiki" not in TOP30). so 66134 - 6621 = 59513. Does this mean that there bots has visited at least 59k times "wiki" page in 20170801? Is it really? 59k bot views data, I really can't believe it. Who can provide details of this 59k bot data? Just a bot, or a lot of bot?

I doubt the feed algorithm considers pageviews when removing a page and very likely pages that show up a lot (over and over) regardless of traffic will not get displayed.

Now, that being said the "wiki" article has a lot of bot traffic not identified as such, this user agent below is responsible for a lot of pageviews (~60.000) on August 1st.

{"browser_major":"9","os_family":"Windows 7","os_major":"-","device_family":"Other","browser_family":"IE","os_minor":"-","wmf_app_version":"-"}

I doubt the feed algorithm considers pageviews when removing a page and very likely pages that show up a lot (over and over) regardless of traffic will not get displayed.

Now, that being said the "wiki" article has a lot of bot traffic not identified as such, this user agent below is responsible for a lot of pageviews (~60.000) on August 1st.

{"browser_major":"9","os_family":"Windows 7","os_major":"-","device_family":"Other","browser_family":"IE","os_minor":"-","wmf_app_version":"-"}

thx! It seems that "wiki" data was not normal for the last month: https://tools.wmflabs.org/pageviews/?project=zh.wikipedia.org&platform=all-access&agent=user&range=latest-60&pages=Wiki

"維基媒體基金會" seems to be?

Sorry but I do not understand what is the question, I think we have provided quite a bit of info about why things are different in those endpoints and will be closing ticket shortly.