Page MenuHomePhabricator

Count Wikidata page views per page type
Closed, ResolvedPublic

Description

Motivation
Right now, we know how many edits are made in Wikidata, but we don't really know how many times pages are actually being viewed by type

Task

  • Get the total number of page views per day for
    • item pages
    • property pages
    • lexeme pages
    • entity schemas
  • Get the average number of views per day for
    • item pages
    • property pages
    • lexeme pages
    • entity schemas
  • Split all numbers by device type (desktop / mobile)

Notes
If done on grafana, this should be a query in the daily* namespace, which means that we don't get minutely but daily data, which is persisted for(ever?)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 2 2018, 8:55 AM
Lea_WMDE changed the task status from Open to Stalled.Nov 2 2018, 8:55 AM

We need to define the dashboard output first

Lea_WMDE updated the task description. (Show Details)Nov 2 2018, 8:57 AM
Addshore moved this task from incoming to in progress on the Wikidata board.Nov 3 2018, 10:20 AM
Lea_WMDE changed the task status from Stalled to Open.Nov 7 2018, 9:29 AM
Lea_WMDE triaged this task as Normal priority.
Lea_WMDE updated the task description. (Show Details)
Lea_WMDE updated the task description. (Show Details)
Addshore removed GoranSMilovanovic as the assignee of this task.Jan 29 2019, 9:21 AM
Addshore moved this task from Data Analytics to Product on the WMDE-Analytics-Engineering board.
Addshore added a subscriber: GoranSMilovanovic.
Lea_WMDE moved this task from Backlog to Other on the Wikidata-Termbox board.Apr 12 2019, 11:43 AM
Lea_WMDE updated the task description. (Show Details)Jul 3 2019, 11:02 AM

@Lea_WMDE I guess 640 is the EntitySchema namespace (figured this out from this Gerrit patch, since it is not documented in the Wikidata namespaces), right?

  • data set production - completed.
  • Next steps:
    • orchestrate Pyspark from an R environment where post-processing will take place;
    • prepare data for visualizations and export to published data sets;
    • visualizations + dashboard;
    • test, deploy;
  • data set review requested from Analytics in T227905;
  • next steps:
    • visualizations + dashboard;
    • test, deploy.

@Lea_WMDE You can now test your new dashboard: http://wmdeanalytics.wmflabs.org/WD_pageviewsPerNamespace/

  • Next steps:
    • introduce client-side dependency, and
    • put on a regular daily update as soon as
    • T227905 (data review) is resolved.

@GoranSMilovanovic great, thanks! Would it be possible to add decimal points to the labels? I always have a tough time deciding whether I look at hundred thousands or millions :)

@Lea_WMDE I am on it, putting the dashboard on regular updates + fixing the labels to include decimal points.

@Lea_WMDE

  • The dashboard is now running a regular daily update;
  • fixing the axis labels now.

@Lea_WMDE

  • On the vertical axes the dashboards now uses K, M, and B for thousands, millions, and billions of pageviews, respectively.

@GoranSMilovanovic thanks! We currently have quite differing numbers from https://stats.wikimedia.org/v2/#/wikidata.org/reading/total-page-views/normal|bar|2-year|~total|daily
And I think that is because they focus on human traffic, whereas we have the total number, right?
Would it be possible to show seperate lines for human and bot traffic per graph?

I like that one can see the split, but I do miss the average number you displayed before :)

I also tried to check the numbers in comparison with stats2, and for the sample day I looked at numbers look super different :/
June 13 was ~5.7 mio page views on stats 2, but if we add up all user page views from our boards, we have about 10 times less page views, with ~ 545 K page views.

GoranSMilovanovic added a comment.EditedAug 1 2019, 3:12 PM

@Lea_WMDE So that is one order of magnitude and looks straightforward impossible to happen. Please let me check.
I guess the difference of this magnitude could not be a consequence of the fact that we have picked only four namespaces (Entity, Property, Lexeme, and EntitySchema)? Please confirm.

GoranSMilovanovic added a subscriber: Milimetric.EditedAug 1 2019, 3:49 PM

@Lea_WMDE Ok, here is a direct test (Pyspark code against the wmf.pageviews_hourly table):

pw = sqlContext.sql('SELECT namespace_id, access_method, agent_type, SUM(view_count) AS pageviews \
                        FROM wmf.pageview_hourly\
                        WHERE  year = ' + str(d.year) + ' AND month = ' + str(d.month) + ' AND day = ' + str(d.day) + \
                        ' AND project = "wikidata" \
                        AND (namespace_id = 0 OR namespace_id = 120 OR namespace_id = 146 OR namespace_id = 640) \
                        GROUP BY namespace_id, access_method, agent_type ORDER BY namespace_id, access_method, agent_type')

where d is June 13, 2019:

In [31]: d
Out[31]: datetime.datetime(2019, 6, 13, 15, 29, 14, 874165)

The query results in the pw DataFrame:

[Row(namespace_id=0, access_method='desktop', agent_type='spider', pageviews=3713136),
 Row(namespace_id=0, access_method='desktop', agent_type='user', pageviews=413537),
 Row(namespace_id=0, access_method='mobile web', agent_type='spider', pageviews=408138),
 Row(namespace_id=0, access_method='mobile web', agent_type='user', pageviews=115864),
 Row(namespace_id=120, access_method='desktop', agent_type='spider', pageviews=7084),
 Row(namespace_id=120, access_method='desktop', agent_type='user', pageviews=11586),
 Row(namespace_id=120, access_method='mobile web', agent_type='spider', pageviews=1418),
 Row(namespace_id=120, access_method='mobile web', agent_type='user', pageviews=3193),
 Row(namespace_id=146, access_method='desktop', agent_type='spider', pageviews=938),
 Row(namespace_id=146, access_method='desktop', agent_type='user', pageviews=179),
 Row(namespace_id=146, access_method='mobile web', agent_type='spider', pageviews=167),
 Row(namespace_id=146, access_method='mobile web', agent_type='user', pageviews=8),
 Row(namespace_id=640, access_method='desktop', agent_type='spider', pageviews=1086),
 Row(namespace_id=640, access_method='desktop', agent_type='user', pageviews=133),
 Row(namespace_id=640, access_method='mobile web', agent_type='spider', pageviews=3)]

which matches exactly what we get for June 13, 2019 from our new Dashboard.

Moreover, let's have a look at the total number of pageviews for user (i.e. spiders are excluded like in Wikistats2) for June 13, 2019:

pw = sqlContext.sql('SELECT SUM(view_count) AS pageviews \
                            FROM wmf.pageview_hourly\
                            WHERE  year = ' + str(d.year) + ' AND month = ' + str(d.month) + ' AND day = ' + str(d.day) + \
                        ' AND project = "wikidata" \
                        AND agent_type = "user"')

results in

Row(pageviews=1420740)

which is far bellow the number reported on Wikistats2 for June 13, 2019, which is: 5,764,558. This is also the number reported by the Pageviews API (no wonder, since it powers Wikistats2); API call:

https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/wikidata/all-access/all-agents/daily/2019061300/2019061300

result:

{"items":[{"project":"wikidata","access":"all-access","agent":"all-agents","granularity":"daily","timestamp":"2019061300","views":5764558}]}

@Milimetric I am looking at the pageviews data from Wikidata for June 13, 2019, at: https://stats.wikimedia.org/v2/#/wikidata.org/reading/total-page-views/normal|bar|1-month|~total|daily and I can't seem to be able to reproduce it. Could you let me know what could be the possible source of differences? Thank you.

GoranSMilovanovic added a comment.EditedAug 1 2019, 8:39 PM

@Milimetric Thanks for the clarification, Dan.
@Lea_WMDE This implies that

  • the difference between the numbers reported on our Pageviews per namespace Dashboard and Wikistats2
  • is a consequence of our narrowing down the selection of namespaces to
(namespace_id = 0 OR namespace_id = 120 OR namespace_id = 146 OR namespace_id = 640)

Everything seems to fall in place now. Please review and let me know if any additional work is needed here. N.B. I will see to get the average number of pageviews (not the user/spider split averages) back, no worries.

@GoranSMilovanovic cool, thanks! Could you add the info why wikistats2 data differs from these graphs to the explanatory text?
Apart from that I cannot see the total average yet, it is overlapped by the detailed info about the currently focussed data point

@Lea_WMDE

Could you add the info why wikistats2 data differs from these graphs to the explanatory text?

Done: dashboard.

Apart from that I cannot see the total average yet

I can see it; maybe CTRL+F5 is what you need.

it is overlapped by the detailed info about the currently focussed data point.

I am not sure if I understand: what that you cannot see yet is overlapped by the detailed info?
Also, the detailed info on particular points does not overlap with anything on my screen, as you can see from the following screenshot:

so it could be your screen.

Apart from that I cannot see the total average yet

I can see it; maybe CTRL+F5 is what you need.

it is overlapped by the detailed info about the currently focussed data point.

I am not sure if I understand: what that you cannot see yet is overlapped by the detailed info?

Interesting. For me the dashboard looks e.g. like this


and the position of data point info is always the same, no matter which point in the graph I highlight

@Lea_WMDE

Strange.
Lea, please let me know what browser are you using. I have tested the dashboard on Chromium and Mozilla Firefox under Ubuntu; the WDCM system, using the same front-end technology (RStudio Shiny) was tested over an even broader range of browsers (including macOS), and at this point I cannot really tell what is causing the problem - but I will do my best to figure it out.

@Lea_WMDE Hm, this might be the solution - dygraph se to dylegend(show = 'follow'), please check: http://wmdeanalytics.wmflabs.org/WD_pageviewsPerNamespace/
Note. This was the initial solution and there is one thing I don't like about it in spite of the fact that it solves the problem that you were facing (focus overlapping title).
Let me know if you like this approach better, please.

Lea_WMDE closed this task as Resolved.Aug 9 2019, 6:19 PM

yes, this solves it for me! (For reference, I am using Mozilla Firefox under macOS). I'm putting the ticket to resolved since from my side there is nothing to do here anymore :) Thanks @GoranSMilovanovic !