Page MenuHomePhabricator

Add pageviews total counts to WDQS
Open, MediumPublic


At this point, the only way to rank various Wikidata results is to order them by sitelink-count. This offers a fairly good indicator of how many different languages/cultures are interested in a topic, but is not very accurate, especially when a topic is mostly related to a single language

I propose we introduce a new type of entries to WDQS:

# Naming is TBD
<>   prefix:total_page_views   [integer] .
<>   prefix:last_24h_page_views   [integer] .

Some script would download files from dumps, and increment the counters once an hour. The updates should happen in bulk (stackoverflow). Each file is about 5 million entries (<40MB gz). See also dump info.

Additionally, we may want to keep the running total for the last 24 hours - a bit trickier, but also doable - e.g. by keeping the totals of the last 24 files in memory, and uploading the deltas every hour. On restart, the service would re-download the last 24 files, delete all existing 24h totals, and re-upload them.

P.S. I am hacking on it at the moment (python). Need naming suggestions for the predicate.

Event Timeline

Yurik created this task.Sep 5 2017, 3:23 AM
Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptSep 5 2017, 3:23 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Yurik updated the task description. (Show Details)Sep 5 2017, 3:23 AM
Yair_rand rescinded a token.
Yair_rand awarded a token.
Yurik updated the task description. (Show Details)Sep 5 2017, 3:35 AM
Yurik updated the task description. (Show Details)Sep 5 2017, 5:48 AM

I would prefer if we concentrate on T143424 instead and have that include page views as one indicator.

Esc3300 added a subscriber: Esc3300.Sep 5 2017, 9:55 AM

The simplicity of this approach seems convincing. It could complement number of sitelinks and statements.

Yurik added a comment.Sep 5 2017, 11:28 AM

@Lydia_Pintscher, having a built in ranking system is awesome, but that's a problem of search optimization - just like the other ticket suggests, it will be a part of the search drop-down.

Exposing raw views value via wdqs is very different - it allows query authors to correlate data, e.g. find entities of the same subclass whose viewing is about the same, or give me a geodistance to the closest item that gets 50% more views than the current ones. I'm sure users will come up with much better examples too.

I'm almost done with the simple Python importer implementation, and will test it tomorrow

Personally I am not convinced this is a good match for a graph database. This looks like something that is better as a generic API/database. But I am not sure I understand the use-case properly as of yet.

At this point, the only way to rank various Wikidata results is to order them by sitelink-count

Not sure what you mean here. Which results - do you mean SPARQL query results? Search results? In any case, you can rank them by several criteria, the question is what is the purpose of the ranking.

We also have the "popularity score" field in Elastic index which is already auto-updated afaik. That is used for ranking already.

Yurik added a comment.Sep 6 2017, 6:18 AM

I would like to solicit more community feedback on how useful this would be. Perhaps this is not needed at all, or not worth the hassle As an already working example on a test server, here is a query that lists Wikidata items without French labels but with French articles, ordered by the popularity of the French articles.

Also you may view other examples related to "pageviews" (in the examples list)

mforns moved this task from Incoming to Radar on the Analytics board.Sep 28 2017, 3:48 PM
Smalyshev triaged this task as Medium priority.Feb 12 2018, 8:06 AM
Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Mar 5 2018, 4:15 PM
Bovlb added a subscriber: Bovlb.Sep 20 2018, 8:32 PM
Bovlb added a comment.Sep 20 2018, 8:46 PM

@Yurik asked:

I would like to solicit more community feedback on how useful this would be.

I would find this extremely useful. What can I do to help make this happen?

Adding page views to WDQS would be highly beneficial to the GLAMwiki community. Exploring the impact of cultural partnerships with galleries, libraries, archives and museums with a tool beyond the capability of treeviews could give us an edge in negotiating with institutional partners. Successful projects and content donations depend on meaningful metrics that are quantifiable and can measure the impact of our efforts.

Is this proposal form @Yurik technically possible and if yes, what hurdles do we have to overcome to get it done?

Elya added a subscriber: Elya.Jan 3 2020, 8:26 PM

I'm still not sure what possibilities can be achieved here, but it looks like a whole lot of new things to explore and analyse. One thing of course is the GLAM cooperations @christophbraun mentioned, but I'm sure there will be a lot of ideas … edit-a-thons, etc.

Possible downside: Wikimedia organisations will use this for our "beloved" metrics which could make community projects even more a "countable" and to-be-controlled thing instead of quality, sustainability and – last but not least – fun.

Wikidata currently gets ~660k edits per day.

This proposal - if I understand it properly - requires an additional ~5 Million edits per day, or, perhaps 5 Millon edits per hour ("and increment the counters once an hour") ... who knows.

And that gives us a couple of nuggets of data per sitelined item. It provides no support for users who want some other time-base of Pageviews.

It is ... not ... a happening proposal. Even if there were only a weekly update, that'd still more than double the number of wikidata edits, with all the concomitant lag problems.

Perhaps worth noting another route: provision of a facility to make calls from WDQS to the REST API - - in much the same way as WDQS facilitates (some?) MWAPI calls? This would, ideally, be much more general purpoe and give WDQS users access to the wide array of metadata available there, not least edit dates & editor IDs.

Yurik added a comment.Jan 4 2020, 3:02 AM

@Tagishsimon this proposal would not edit wikidata. Instead, as part of the WDQS import process, it would upload pageviews in bulk from the pageview dump files directly into the Blazegraph index. It could do it every hour, and computation-wise it will be relatively inexpensive (i ran it as part of Sophox a few times).

Thanks for your input @Elya, @Yurik and @Tagishsimon.
Do you know who has to greenlight/authorise the upload to the Blazegraph index?
Assuming there is a huge backlog for this kind of requests, where can I find it and how is it prioritised?

Yurik added a comment.Jan 4 2020, 6:52 PM

I would guess this is mostly a devops task - orchestrate execution of an updating script. Here's the working implementation -

Simply run it locally near the Blazegraph server.

Nuria added a subscriber: Nuria.Jan 5 2020, 7:20 PM

Updating WDQS (a relational query engine) with metadata about pageviews (per definition a timeseries) seems not the best idea from a data modeling standpoint. The GLAM use case is much better served by an API that returns pageviews across time, I would put the engineering effort on building such API.

Thanks for your comment @Yurik and @Nuria.
The GLAM use case applies to queries beyond the scope of existing tools like or
WDQS would allow us to select a specific set of information that would be difficult to get by just intersecting categories (e.g. all paintings by a painter that were painted in a certain timeframe and are now part of the collection of a particular GLAM). Even more so for information that is not reflected in our category system (e.g. the provenance of a painting).
Can we achieve this with an API or is WDQS mandatory for this?

Nuria added a comment.Jan 5 2020, 10:59 PM

@christophbraun I think it would help to start a ticket describing your use case in detail. Have in mind that pageviews (defined as content consumed by humans) do not really "apply" to wikidata items. The bulk of the activity on the site around http requests has a lot to do with bots creating/consuming content. The percentage of bots requests is about 70/80%

Zache added a subscriber: Zache.Tue, Feb 4, 7:13 AM