Page MenuHomePhabricator

Get estimates for size of astronomical objects and queries in Wikidata graph
Closed, ResolvedPublic

Description

As a user, I want to know the potential cost/benefit estimates for splitting off astronomical objects from the Wikidata graph, so that WDQS can scale to my needs.

This ticket is a part of WDQS disaster planning, and reflects research into mitigation strategies for catastrophic failure of Blazegraph: specifically in the case that the Wikidata graph becomes too big for Blazegraph to continue supporting. This is not a commitment to a long term state of WDQS or Wikidata, but part of the disaster mitigation playbook in a worst case scenario.

For astronomical objects in Wikidata (https://www.wikidata.org/wiki/Q6999):

  • Determine the size in Blazegraph (actual, percentage), and how much runway time deleting it gains us at current Wikidata growth rate.
  • Estimate how many queries reference/touch astronomical objects in Wikidata graph

Event Timeline

Astronomical objects are structured hierarchically and so not everything is direct instance of Q6999 (unlike scholarly articles).

Considering all subclasses of Q6999, the number of astronomical objects form ~9% of all Wikidata entities. (sparql query)
And an approximation of the number of triples 'related to' these entities is 7.5% (~1B) of all Wikidata triples. Approximated from top 10 subclasses (which are 7% of all entities)

Counts of queries and triples for astronomical objects were done here: Wikidata_Subgraph_Query_Analysis, along with the top ~300 large subgraphs.
For the specific case of Astronomical objects (and only astronomical objects), a list of all its subclasses was obtained and manually inspected for relevance to astronomical objects. The subclass list also consists of subclasses of subclasses and so on.

  • Percent of triples: 8.7%
  • Percent of entities: 8.9%
  • Days to recover: 245
  • Query count: 2.5M
  • Percent of queries: 1.3%
  • Percent time of all queries: 0.5%