As a Wikidata user, I want Wikidata to function with limited data in the case of Blazegraph failure due to reaching maximum graph size, rather than being completely non-functional.
This ticket is a part of WDQS disaster planning, and reflects research into mitigation strategies for catastrophic failure of Blazegraph: specifically in the case that the Wikidata graph becomes too big for Blazegraph to continue supporting. This is not a commitment to a long term state of WDQS or Wikidata, but part of the disaster mitigation playbook in a worst case scenario.
In the case of Blazegraph reaching the maximum number of triples it can store, we may need to prioritize which data to keep and which to delete (temporarily, until we can reload it back in from dumps later, when our graph backend can support it). This task is to determine how much space we can save by deleting the candidates below. Note that these candidates are unordered in priority, and this list does not take into account anything beyond the size of these data in Blazegraph.
For each candidate in the list, determine its size in Blazegraph (actual, percentage), and how much runway time deleting it gains us at current Wikidata growth rate.
Vertical data:
[] All non-English labels
[] All labels that are covered by fallbacks (names duplicated across 200 languages etc)
[] All labels
[] all descriptions
[] All aliases
[] All labels, description and aliases
[] External identifiers
Horizontal data:
[] Scholarly papers: http://wikicite.org/statistics.html
[] Scholarly papers + authors + scientific journals + identifiers
[] Astronomical objects: https://www.wikidata.org/wiki/Q6999
[] Items that don’t have 3 backlinks
[] Look at the distribution of number of backlinks, and use that to determine how many backlinks might make more sense
[] All statements of a specific datatype: monolingual text (not important for querying says Lydia)
[] Non-normalized values (units, dates, external ids)
[] non-top-ranked statements https://grafana.wikimedia.org/d/000000175/wikidata-datamodel-statements?orgId=1&refresh=30m
[] Every item with ORES quality score lower than X or no ORES score
[] Items (without hot properties) that are being queried
Other pieces that would be good to know even if we can't drop them:
[] All Properties
[] All sitelinks
[] All classifying statements (aka the ontology) - this would be all statements using the Property "subclass of", "instance of", "part of", "has part" or "parent taxon"
[] All humans