Page MenuHomePhabricator

Get baseline measurements/expectations for splitting scholarly articles from Wikidata
Closed, ResolvedPublic

Description

As a product manager for Wikidata and WDQS, I want to know what quantifiable benefits to service reliability and quality I might expect to gain (or lose) by splitting scholarly articles out from the Wikidata graph, so that I can decide whether to move ahead with this plan and how to communicate it.

In order to move ahead with splitting out scholarly articles from WD, communicate this decision, and set expectations around the benefits of implementing this change, we should get some baseline measurements of the current state of scholarly articles in Wikidata and WDQS, and estimates about the effects of splitting them off.

AC:
Get the numbers for the following metrics:

  • percentage, number of Wikidata entities that are scholarly articles
    • number of triples in Wikidata related to scholarly articles
  • percentage, number of WDQS queries per month that involve scholarly articles (including authors and publications)
    • percentage, number of the above queries that only involve scholarly articles (including authors and publications)
  • percentage, number of scientific papers that are connected to non-scientific paper items in WD (not including authors and publications)
  • given the current rate of growth of Wikidata, approximately how much time it would take for Wikidata to grow back to its current size if we removed scholarly articles
  • rate of growth of scholarly articles
  • Identify number of authors that were probably added solely for the purpose of mentioning in scholarly articles. (i.e separating schoalrly articles would also mean these authors items become isolated)
    • Number of authors connected to other subgraphs in Wikidata vs only connected to scholarly articles

Event Timeline

PKM added a subscriber: PKM.

Can I get clarification about what is meant, practically, by "splitting scholarly articles out"?
Does this mean something in the backend that is about how that content is stored/accessed by the query system (but is otherwise invisible to the general reader of Wikidata). Or, does it mean removing these items from WD completely?

@LWyatt , "splitting scholarly articles out" here refers to separating out the subgraph of scholarly articles -- possibly copying over directly relevant items like authors -- from the larger Wikidata graph so that they would be independent graphs. They would exist independent from WD, and queries that require connecting articles to non-articles would require functional federation. This would definitely affect some known workflows and use cases (i.e. Scholia), but part of this ticket is to also to assess what percentage of queries might be affected by this change.

For larger context, this is not to say we're committed to this split yet, but we are exploring strategies for scaling Wikidata (and mitigating catastrophic failure) that are directed related to the max size that Blazegraph is able to handle.

For larger context, this is not to say we're committed to this split yet, but we are exploring strategies for scaling Wikidata (and mitigating catastrophic failure) that are directed related to the max size that Blazegraph is able to handle.

What is that max size?

Also, there are lots of other relevant parameters that are often interdependent — we tried to start documenting them here — help most welcome.

Some of the statistics that is wanted are listed on Scholia, currently on the frontpage: https://scholia.toolforge.org/ (UPDATE: now here: https://scholia.toolforge.org/statistics)

"percentage, number of Wikidata entities that are scholarly article":
37.246.721 Scholarly articles, so 37/97 ~ 40% are scholarly articles.

"percentage, number of WDQS queries per month that involve scholarly article (including authors and publication)"
For Scholia, we have recently turned a number of queries into more templated queries and now automatically add "# tool: scholia" as a comment to the queries, so it should be possible for Wikimedia employees to count the number of Scholia queries (perhaps that was possible before by the referer field?). I have had the impression that Scholia's queries were a low number compared to Magnus Manske's tools.

"percentage, number of scientific papers that are connected to non-scientific paper items in WD"
Quite a lot of scholarly papers are connected to a journal item, to one or more topic items, to a language item, some to a notable author, that is in Wikipedia (so we need item in Wikidata). Currently, according to the statistics on Scholia there are 14.211.431 topic links. Works may have multiple links so perhaps only <10.000.000 works have one or more topics, - we should target for most works to have a topic, so I suspect this would grow.

"rate of growth of scholarly articles"
wikicite.org updates this statistics: http://wikicite.org/statistics.html I suppose that is Jakob Voß (@nichtich) that updates these numbers? The graph shows a bit of plateauing recently for publications, while there is a recent increase in citations. I would think that James Hare is doing the citations? As far as I remember, the citations have been mentioned as a issue of concern with respect to Wikidata data size. They are probably a good deal of the number of triples.

Thanks everyone. For context: this is just one of many options we are currently investigating to create an overview of our options. We think it is important to have a larger discussion about how to move forward with the Query Service but we need to know more about each of the options we have. We are currently trying to determine for each option what it actually means in terms of how much breathing room it buys us, how many people would be affected, etc. That's one of the tasks for this. We hopefully have the larger overview for discussion soon.

This seems like an arbitrary way to cut up Wikidata. It very much smacks of "let's take the largest subset of our dataset and evict it," without consideration to why the dataset should be cut up this way.

What are the boundaries of these new projects? Is Wikidata a graph for everything except scholarly articles? What about books, or other forms of citable media (i.e., any and all media)? What about scholarly articles that are relevant to the Wikidata graph in ways other than WikiCite's massive citation graph?

I am very interested in the subgraph conversation and how we can envision Wikidata as part of a massive linked data ecosystem without itself being overly burdened. I think evicting arbitrary subsets of the data is just not good strategy.

If I were to suggest a change, perhaps we could divide the graph along "media" and "not media". (We can subsequently decide if we want to split "not media" further.) This I think would draw lines that are coherent and not arbitrary. The scholarly articles would be a part of the media graph project. And there would be free cross-referencing between the sites. Do you think this would achieve your goals?

Thanks for your thoughts here @Harej; it's really helpful to have these insights from someone closer to the (Wiki)data content itself.

You are correct that this specific ticket is identifying the largest subset of data to split off from the Wikidata graph. The primary rationale behind this is to explore mitigation strategies for a worst case scenario of catastrophic failure of Blazegraph, and to understand what options we have available to preserve limited functionality of WD(QS) rather than have no functionality in this scenario. In that regard, identifying the largest subset of data seems reasonable as it would directly address the potential problem of hitting Blazegraph's max size constraint.

To your other point though, it is only one way of splitting the graph. I am definitely interested in exploring other more reasonable ways we could divide the graph and the potential benefits it may have. For your "(not) media" suggestion, did you have a clear heuristic in mind for identifying this distinction? If so, it'd be great to start a new ticket to investigate the possible benefits of splitting the graph along the lines you suggest -- hopefully in the case that we do need/want to split the graph, that solution we would know that solution would work (better) for everyone!

I'm interested in others' opinions as well because I am far from the only perspective in the room.

First: at what levels would this graph division take place? Would this be something largely behind the scenes, not visible to the Wikidata community unless you're working directly with the graph query API? Or would this be a highly visible change, on the level of splitting Wikidata into distinct new Wikimedia projects? That I think could affect to what extent getting the details right "matters".

I am with @Harej here. Focusing on the largest data set is not the right approach. As I have indicated in similar discussions elsewhere, there will be a next large subset and this one will also be large. From the field chemistry, 60M items is nothing. The number of species every observed is millions. There are many things that easily go into the millions. At this moment, we have a small subset of chemicals in Wikidata (~1.2 million), because of the growing pains this is artificially low (real chemical databases have >102 M records of chemicals experimentally studied). I regularly run into missing content (even just looking at the English Wikipedia), and am very selective in what i add at this moment.

As soon as you remove one big blob, all that will happen is that the void will be very quickly filled by another big blob. Now, if a single database is not possible, then the overall design must just change, and everything should become a separate namespace and make sure the federation works extremely well: the reason why Wikidata works so awesome, is that I can move from one topic to underlying data sources because everything is integrated. Please take that into consideration.

In fact, it the sake is just to split out a blog and see what happens, then plz focus on something more volatile than the knowledge about reality, and remove for example things that changes every year. For example, remove all humans, all of them, and organizations. There will be a new human tomorrow. When it comes to facts, who care who did or studied it, but just focus on what happened or what was discovered.

Wikidata is not Facebook.

[update] this is related to the issue question: "given the current rate of growth of Wikidata, approximately how much time it would take for Wikidata to grow back to its current size if we removed scholarly articles"

I would not call it evicting scholarly articles. Scholarly articles are currently a major driving force for Wikidata, however, its size is problematic because it is becoming more difficult to see other topics (sometimes unrelated to scholarly articles). I have thought about and working towards a federated landscape of linked wikibases and other semantic web resources for a while now. Building such a federated landscape is already easy peasy. We have wbstack, wikibase docker, but also platforms like GraphDB, Virtuoso, Stardog (to just mention a few). It would take a simple hackathon and some motivated users to build a nice prototype.

But setting up such a federated landscape is the easy part. What is more difficult is to be able to map between the different endpoints (wikidata, wikibases, other rdf stores),
Givens its size the subgraph of scholarly articles simply deserves its own metal to excel beyond the current limitations. The main question then becomes how to align this new subgraph with the other parts of Wikidata, to which it intrinsically links (as @Daniel_Mietchen says.).

So I am actually in favour of separating the subgraph of scholarly articles from Wikidata (the incubator) to a node in Wikidata (the linked knowledge graph) and the global semantic web,

I indeed said: Moving away from Wikidata to Wikidata :) We need a new term for the knowledge graph where the current Wikidata is an index or sort of DNS to other (semantic web) nodes.

Regarding the question of the "growth of scientific literature", there is a good bit of literature on this, and sometimes conflated with the topic of "growth of science". I started collecting some knowledge about this: https://scholia.toolforge.org/topic/Q107292942

@Andrawaag "it is becoming more difficult to see other topics (sometimes unrelated to scholarly articles)" Do you have concrete examples on this? It may sometimes be difficult to find out what is a topic and what is a scientific articles, but once a few scientific articles about the topic has been annotated with the "main topic" property then the topic usually shows up on the top.

Hi folks, please stick to the Phabricator etiquette as described at https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette . This is not the place to discus if these items should be moved out or not. @MPhamWMF don't see these comments as any indicator of the community view.

This is not the place to discus if these items should be moved out or not.

This is a confusing statement seeing as the task is explicitly about "splitting scholarly articles from Wikidata". If this is just from a backend perspective the task should be clarified as such.

Going back to the quantifiable: "percentage, number of scientific papers that are connected to non-scientific paper items in WD (not including authors and publications)"

We would hope that every scientific paper has a topic annotation with one or more of the Wikidata items - either non-scientific paper items or - in rare instances - scientific paper items. Currently we "only" have around 15 million of these links. "Links from works to their main subjects": https://scholia.toolforge.org/statistics

All scientific papers could also have the language set.

"percentage, number of WDQS queries per month that involve scholarly articles (including authors and publications)"

It is unclear for us Scholia people how much load we are putting on WDQS. We have a tendency to do multiple SPARQL queries on each page and that might not be a problem or it might be a very bad thing. I recall a statistics on the WDQS query that it was mostly Magnus Manske tools that put load on WDQS, - but I might remember it wrongly.

In Scholia, we now add a "# tool: scholia" on top of most of our queries. I am not aware of other tools doing that. Perhaps it was an idea to do that, so that Wikimedia Foundation people could more easily do statistics wrt. the tools. (perhaps there should not be a space between "#" and "tool".

This is not the place to discus if these items should be moved out or not.

This is a confusing statement seeing as the task is explicitly about "splitting scholarly articles from Wikidata". If this is just from a backend perspective the task should be clarified as such.

No it's not, please have a look at the task description. This is about getting metrics.

@Multichill the opening says "so that I can decide whether to move ahead with this plan and how to communicate it." -- it would help if that linked to a separate task, whose implementation depended on the outcome of this one. In the absence of that, this seems like the best and only? place in Phab to discuss the impacts of the split.

@MPhamWMF Is this being evaluated as a one-off / one-time split, or is it a more general eval of the performance considerations from switching from a monolithic WD graph to a set of graph shards, with some max size (what's the rough range you imagine beyond which things stop scaling)? Any thoughts on performance implications of the latter may also be of interest to many of the large wikibase users, who regularly want to query a combination of at least one specialist base and WD itself, mediated by some query interface.

@Multichill the opening says "so that I can decide whether to move ahead with this plan and how to communicate it." -- it would help if that linked to a separate task, whose implementation depended on the outcome of this one. In the absence of that, is this not the best / only place in Phab to discuss the impacts of the split?

@MPhamWMF Is this being evaluated as a one-off / one-time split, or is it a more general eval of the performance considerations from switching from a monolithic WD graph to a set of graph shards, with some max size (what's the rough range you imagine beyond which things stop scaling)? Any thoughts on performance implications of the latter may also be of interest to many of the large wikibase users, who regularly want to query a combination of at least one specialist base and WD itself, mediated by some query interface.

No it's not, please have a look at the task description. This is about getting metrics.

Can you elaborate on the "this plan" in that description?

@Sj This is primarily being evaluated as a last resort mitigation in the case of catastrophic failure, specifically having to do with max size limitations of Blazegraph. The primary aim is to determine the best way of keeping WD/QS minimally functional in the event of this undesired scenario -- basically we're measuring out a parachute we hope to not have to use, and if we did, would intend it to be a temporary state while we resolve the underlying larger issues. If we discover along the way a better way of splitting the graph that improves both the technical performance of the machines and how users use it, we will consider incorporating these learnings into a more permanent scaling strategy.

With regard to a forum for discussion, we are in the process of preparing more official communications that provides an overview of the situation, and a more dedicated venue of discussion than phab tickets. We appreciate everyone's patience as we work on finalizing things.

There is a recent request to make items for scholarly articles more stand-alone, i.e.

would ensure that items could be used without resolving author items. This would simplify storing them in a separate Wikibase.

I still have to go through T282139 in detail, but it seems it has mostly become an analysis over (the somewhat static corpus of) scholarly articles in Wikidata rather than Wikidata, given the numbers involved.

Some of the statistics that is wanted are listed on Scholia, currently on the frontpage: https://scholia.toolforge.org/ (UPDATE: now here: https://scholia.toolforge.org/statistics)

"percentage, number of Wikidata entities that are scholarly article":
37.246.721 Scholarly articles, so 37/97 ~ 40% are scholarly articles.

Could I get an idea of what the 97 was and where the number was listed maybe?

CBogen renamed this task from Get baseline measurements/expectations for splitting scholarly articles from Wikidata to [EPIC] Get baseline measurements/expectations for splitting scholarly articles from Wikidata.Aug 5 2021, 1:35 PM
CBogen added a project: Epic.
CBogen renamed this task from [EPIC] Get baseline measurements/expectations for splitting scholarly articles from Wikidata to Get baseline measurements/expectations for splitting scholarly articles from Wikidata.Aug 5 2021, 1:38 PM
CBogen removed a project: Epic.

"percentage, number of Wikidata entities that are scholarly article":
37.246.721 Scholarly articles, so 37/97 ~ 40% are scholarly articles.

Could I get an idea of what the 97 was and where the number was listed maybe?

Hmmm... Maybe I meant 94. On the Danish frontpage of Wikidata it states 94.564.779 data elements.

37321680 / 94564779 = 0.39466787100512335 ~ 39%

Wikicite.org (Jakob Voß) http://wikicite.org/statistics.html states 39 994 937 = 43% for 2021-06-28. The Scholia statistics is only for the "scholarly article" item. I think Voß counts instances of scholarly + non-scholarly publications.

Wikicite.org uses an extremely broad definition of publication that includes far more than scholarly sources. There are some thousands of classes that are counted as subclasses of “publication”.

@AKhatun_WMF, when you write "authors connected to other subgraphs", do you mean subgraphs within Wikidata (so, excluding external identifiers), or also graphs from other resources part of, for example, the Linked Open Data Cloud?

@AKhatun_WMF, when you write "authors connected to other subgraphs", do you mean subgraphs within Wikidata (so, excluding external identifiers), or also graphs from other resources part of, for example, the Linked Open Data Cloud?

I mean within wikidata.

Thanks everyone. For context: this is just one of many options we are currently investigating to create an overview of our options. We think it is important to have a larger discussion about how to move forward with the Query Service but we need to know more about each of the options we have. We are currently trying to determine for each option what it actually means in terms of how much breathing room it buys us, how many people would be affected, etc. That's one of the tasks for this. We hopefully have the larger overview for discussion soon.

Is there a public version of that overview of the different options? There is WikiProject Limits of Wikidata for such purposes, and it would certainly welcome some more detailed information about the various known or suspected limits and how they interact with each other and with potential solutions.

1,939,738 authors -> https://w.wiki/3o2i

trying to get all unique properties of these times out.

Samples 50k authors for properties with an author as subject, https://w.wiki/3o3C, results:

  • 96% is linked to a profession (P106)
  • 94% is linked to country of citizenship (P27)
  • 90% is linked to a place of birth (P19)
  • 36% is linked to an employer (P108)
  • 17% is linked to a notable work (P800)
  • 9% is linked to their doctoral advisor (P184)
  • 8% is linked to the political party they are member of (P102)

These specific properties can be used to calculate the overall statistics. The inverse properties (where the author is the object) seems a bit more trickier and I'm running into time outs there. I hope this helps.

We now have author names a detailed strings, so queries to P50 wont necessarily need to be considered.

The overall situation is comparable to Commons, where "depicts" statements link to Wikidata items.

Gehel added a subscriber: Gehel.

I'm closing this as the statistics have been collected and published. The larger discussion on should probably continue on this talk page: https://www.wikidata.org/wiki/Wikidata:Query_Service_scaling_update_Aug_2021

"percentage, number of scientific papers that are connected to non-scientific paper items in WD"
Quite a lot of scholarly papers are connected to a journal item, to one or more topic items, to a language item, some to a notable author, that is in Wikipedia (so we need item in Wikidata). Currently, according to the statistics on Scholia there are 14.211.431 topic links. Works may have multiple links so perhaps only <10.000.000 works have one or more topics, - we should target for most works to have a topic, so I suspect this would grow.

Since I wrote ItemSubjector the number of links to topics via (main subject) are increasing by about ½ mio. per week. Because of a time out we don't know how many articles are currently missing at least one "main subject", but according to the data in QLever it was 27M out of 37M a few months ago when they updated last.