Page MenuHomePhabricator

Add pageviews total counts to WDQS
Closed, DeclinedPublic

Description

At this point, the only way to rank various Wikidata results is to order them by sitelink-count. This offers a fairly good indicator of how many different languages/cultures are interested in a topic, but is not very accurate, especially when a topic is mostly related to a single language

I propose we introduce a new type of entries to WDQS:

# Naming is TBD
<https://en.wikipedia.org/wiki/Albert_Einstein>   prefix:total_page_views   [integer] .
<https://en.wikipedia.org/wiki/Albert_Einstein>   prefix:last_24h_page_views   [integer] .

Some script would download files from dumps, and increment the counters once an hour. The updates should happen in bulk (stackoverflow). Each file is about 5 million entries (<40MB gz). See also dump info.

Additionally, we may want to keep the running total for the last 24 hours - a bit trickier, but also doable - e.g. by keeping the totals of the last 24 files in memory, and uploading the deltas every hour. On restart, the service would re-download the last 24 files, delete all existing 24h totals, and re-upload them.

P.S. I am hacking on it at the moment (python). Need naming suggestions for the predicate.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Yair_rand rescinded a token.
Yair_rand awarded a token.

I would prefer if we concentrate on T143424 instead and have that include page views as one indicator.

The simplicity of this approach seems convincing. It could complement number of sitelinks and statements.

@Lydia_Pintscher, having a built in ranking system is awesome, but that's a problem of search optimization - just like the other ticket suggests, it will be a part of the search drop-down.

Exposing raw views value via wdqs is very different - it allows query authors to correlate data, e.g. find entities of the same subclass whose viewing is about the same, or give me a geodistance to the closest item that gets 50% more views than the current ones. I'm sure users will come up with much better examples too.

I'm almost done with the simple Python importer implementation, and will test it tomorrow

Personally I am not convinced this is a good match for a graph database. This looks like something that is better as a generic API/database. But I am not sure I understand the use-case properly as of yet.

At this point, the only way to rank various Wikidata results is to order them by sitelink-count

Not sure what you mean here. Which results - do you mean SPARQL query results? Search results? In any case, you can rank them by several criteria, the question is what is the purpose of the ranking.

We also have the "popularity score" field in Elastic index which is already auto-updated afaik. That is used for ranking already.

I would like to solicit more community feedback on how useful this would be. Perhaps this is not needed at all, or not worth the hassle As an already working example on a test server, here is a query that lists Wikidata items without French labels but with French articles, ordered by the popularity of the French articles.

http://tinyurl.com/y7sr9j5g

Also you may view other examples related to "pageviews" (in the examples list)

Smalyshev triaged this task as Medium priority.Feb 12 2018, 8:06 AM

@Yurik asked:

I would like to solicit more community feedback on how useful this would be.

I would find this extremely useful. What can I do to help make this happen?

Adding page views to WDQS would be highly beneficial to the GLAMwiki community. Exploring the impact of cultural partnerships with galleries, libraries, archives and museums with a tool beyond the capability of treeviews could give us an edge in negotiating with institutional partners. Successful projects and content donations depend on meaningful metrics that are quantifiable and can measure the impact of our efforts.

Is this proposal form @Yurik technically possible and if yes, what hurdles do we have to overcome to get it done?

I'm still not sure what possibilities can be achieved here, but it looks like a whole lot of new things to explore and analyse. One thing of course is the GLAM cooperations @christophbraun mentioned, but I'm sure there will be a lot of ideas … edit-a-thons, etc.

Possible downside: Wikimedia organisations will use this for our "beloved" metrics which could make community projects even more a "countable" and to-be-controlled thing instead of quality, sustainability and – last but not least – fun.

Wikidata currently gets ~660k edits per day.

This proposal - if I understand it properly - requires an additional ~5 Million edits per day, or, perhaps 5 Millon edits per hour ("and increment the counters once an hour") ... who knows.

And that gives us a couple of nuggets of data per sitelined item. It provides no support for users who want some other time-base of Pageviews.

It is ... not ... a happening proposal. Even if there were only a weekly update, that'd still more than double the number of wikidata edits, with all the concomitant lag problems.

Perhaps worth noting another route: provision of a facility to make calls from WDQS to the REST API - https://wikimedia.org/api/rest_v1/#/Pageviews%20data - in much the same way as WDQS facilitates (some?) MWAPI calls? This would, ideally, be much more general purpoe and give WDQS users access to the wide array of metadata available there, not least edit dates & editor IDs.

@Tagishsimon this proposal would not edit wikidata. Instead, as part of the WDQS import process, it would upload pageviews in bulk from the pageview dump files directly into the Blazegraph index. It could do it every hour, and computation-wise it will be relatively inexpensive (i ran it as part of Sophox a few times).

Thanks for your input @Elya, @Yurik and @Tagishsimon.
Do you know who has to greenlight/authorise the upload to the Blazegraph index?
Assuming there is a huge backlog for this kind of requests, where can I find it and how is it prioritised?

I would guess this is mostly a devops task - orchestrate execution of an updating script. Here's the working implementation - https://github.com/Sophox/sophox/blob/master/osm2rdf/updatePageViewStats.py

Simply run it locally near the Blazegraph server.

Updating WDQS (a relational query engine) with metadata about pageviews (per definition a timeseries) seems not the best idea from a data modeling standpoint. The GLAM use case is much better served by an API that returns pageviews across time, I would put the engineering effort on building such API.

Thanks for your comment @Yurik and @Nuria.
The GLAM use case applies to queries beyond the scope of existing tools like https://tools.wmflabs.org/glamtools/treeviews/ or https://tools.wmflabs.org/glamtools/glamorgan.html
WDQS would allow us to select a specific set of information that would be difficult to get by just intersecting categories (e.g. all paintings by a painter that were painted in a certain timeframe and are now part of the collection of a particular GLAM). Even more so for information that is not reflected in our category system (e.g. the provenance of a painting).
Can we achieve this with an API or is WDQS mandatory for this?

@christophbraun I think it would help to start a ticket describing your use case in detail. Have in mind that pageviews (defined as content consumed by humans) do not really "apply" to wikidata items. The bulk of the activity on the site around http requests has a lot to do with bots creating/consuming content. The percentage of bots requests is about 70/80%

Adding this amount of data to WDQS does not seem to be a good idea. We might want to redefine the higher level problem that we are trying to address here, and maybe implement it in a different way.

@Gehel lets define this amount of data, just for clarity. My back-of-the-envelope calculations:

  • each pageview statistics statement is a counter (8 bytes), a reference to the name of the article (8 bytes), and property (8 bytes). In reality it might be a bit less (blazegraph uses 7-bit packing), but we can ignore that for the sake of simplicity.
  • I do not count page name as part of the statement because the page page name is already stored for other statements - so no additional space is needed for it.
  • Blazegraph needs to store the same statement in 3 indexes.
  • Total - about 24*3*1M ~= 68MB per one million of pages. And again, I suspect this number is about 4 times higher than the actual need because of the bitpacking. Thus, if we just include the articles (rather than all redirects and talk pages), and assume there are about 25 million articles, we are looking at less than 2GB of disk space at most.

How significant is 2GB increase of blazegraph index?

I think before talking about bytes you need a use case, what is the use case here? As we mentioned earlier the GLAM folks care about human pageviews (real eye balls) on media files and pages, both cases are (and will be better) satisfied by existing analytics APIs. What is the use CASE for this request?

This is not about disk size, or number of bytes, it is about adding complexity to a system that already isn't stable. As @Nuria was saying, if we go back to a use case, we might find a way to provide a solution. I'm pretty sure that WDQS isn't the solution here.

One clear use case for Wikimedia editors who aren't coder but who can write/modify SPARQL queries is to sort and filter Petscan and Listeria results.

Most query results sets meant for human consumption would benefit from having the results sorted by pageviews. Needing to filter for a certain level of prominence is very common, and using the API isn't a workable solution for most people who would benefit from this.

@Nuria WDQS is currently used by the GLAM community to create queries that are beyond the scope of existing tools for a specific purpose as mentioned above. Page views for Wikidata items as well as page views for media files and articles linked to Wikidata items, selectable by time, would enable us to communicate the impact of our projects and efforts to institutional stakeholders. I don't have the technical know-how to debate the pros and cons between a WDQS based and an API based solution. I think @Zache made a great point - in the end what matters most is a decent usability for the non-tech-savvy user. From my perspective the best approach for highly individualised page view queries remains unclear. Are pagepiles a suitable approach for this problem? Can they be churned through the API?

Ordering by relevancy is a good use case. But I believe relevancy is about more than page views. We have T143424 to come up with a good measure for relevancy that can be used in queries as well.

Perhaps the QRank signal might be helpful here? The signal is computed in the Wikimedia cloud infrastructure (Toolforge) and gets periodically refreshed. It’s just aggregated pageviews, but I found it pretty useful in my own projects, which is why I contributed it to Toolforge.