Page MenuHomePhabricator

Get baseline measurements/expectations for splitting scholarly articles from Wikidata
Open, Needs TriagePublic

Description

As a product manager for Wikidata and WDQS, I want to know what quantifiable benefits to service reliability and quality I might expect to gain (or lose) by splitting scholarly articles out from the Wikidata graph, so that I can decide whether to move ahead with this plan and how to communicate it.

In order to move ahead with splitting out scholarly articles from WD, communicate this decision, and set expectations around the benefits of implementing this change, we should get some baseline measurements of the current state of scholarly articles in Wikidata and WDQS, and estimates about the effects of splitting them off.

AC:
Get the numbers for the following metrics:

  • percentage, number of Wikidata entities that are scholarly articles
  • percentage, number of WDQS queries per month that involve scholarly articles (including authors and publications)
    • percentage, number of the above queries that only involve scholarly articles (including authors and publications)
  • percentage, number of scientific papers that are connected to non-scientific paper items in WD (not including authors and publications)
  • given the current rate of growth of Wikidata, approximately how much time it would take for Wikidata to grow back to its current size if we removed scholarly articles
  • rate of growth of scholarly articles

Event Timeline

PKM added a subscriber: PKM.

Can I get clarification about what is meant, practically, by "splitting scholarly articles out"?
Does this mean something in the backend that is about how that content is stored/accessed by the query system (but is otherwise invisible to the general reader of Wikidata). Or, does it mean removing these items from WD completely?

@LWyatt , "splitting scholarly articles out" here refers to separating out the subgraph of scholarly articles -- possibly copying over directly relevant items like authors -- from the larger Wikidata graph so that they would be independent graphs. They would exist independent from WD, and queries that require connecting articles to non-articles would require functional federation. This would definitely affect some known workflows and use cases (i.e. Scholia), but part of this ticket is to also to assess what percentage of queries might be affected by this change.

For larger context, this is not to say we're committed to this split yet, but we are exploring strategies for scaling Wikidata (and mitigating catastrophic failure) that are directed related to the max size that Blazegraph is able to handle.

For larger context, this is not to say we're committed to this split yet, but we are exploring strategies for scaling Wikidata (and mitigating catastrophic failure) that are directed related to the max size that Blazegraph is able to handle.

What is that max size?

Also, there are lots of other relevant parameters that are often interdependent — we tried to start documenting them here — help most welcome.

Some of the statistics that is wanted are listed on Scholia, currently on the frontpage: https://scholia.toolforge.org/

"percentage, number of Wikidata entities that are scholarly article":
37.246.721 Scholarly articles, so 37/97 ~ 40% are scholarly articles.

"percentage, number of WDQS queries per month that involve scholarly article (including authors and publication)"
For Scholia, we have recently turned a number of queries into more templated queries and now automatically add "# tool: scholia" as a comment to the queries, so it should be possible for Wikimedia employees to count the number of Scholia queries (perhaps that was possible before by the referer field?). I have had the impression that Scholia's queries were a low number compared to Magnus Manske's tools.

"percentage, number of scientific papers that are connected to non-scientific paper items in WD"
Quite a lot of scholarly papers are connected to a journal item, to one or more topic items, to a language item, some to a notable author, that is in Wikipedia (so we need item in Wikidata). Currently, according to the statistics on Scholia there are 14.211.431 topic links. Works may have multiple links so perhaps only <10.000.000 works have one or more topics, - we should target for most works to have a topic, so I suspect this would grow.

"rate of growth of scholarly articles"
wikicite.org updates this statistics: http://wikicite.org/statistics.html I suppose that is Jakob Voß (@nichtich) that updates these numbers? The graph shows a bit of plateauing recently for publications, while there is a recent increase in citations. I would think that James Hare is doing the citations? As far as I remember, the citations have been mentioned as a issue of concern with respect to Wikidata data size. They are probably a good deal of the number of triples.

Thanks everyone. For context: this is just one of many options we are currently investigating to create an overview of our options. We think it is important to have a larger discussion about how to move forward with the Query Service but we need to know more about each of the options we have. We are currently trying to determine for each option what it actually means in terms of how much breathing room it buys us, how many people would be affected, etc. That's one of the tasks for this. We hopefully have the larger overview for discussion soon.