As a Wikidata user, I want Wikidata to function with limited data in the case of Blazegraph failure due to reaching maximum graph size, rather than being completely non-functional.
This ticket is a part of WDQS disaster planning, and reflects research into mitigation strategies for catastrophic failure of Blazegraph: specifically in the case that the Wikidata graph becomes too big for Blazegraph to continue supporting. This is not a commitment to a long term state of WDQS or Wikidata, but part of the disaster mitigation playbook in a worst case scenario.
In the case of Blazegraph reaching the maximum number of triples it can store, we may need to prioritize which data to keep and which to delete (temporarily, until we can reload it back in from dumps later, when our graph backend can support it). This task is to determine how much space we can save by deleting the candidates below. Note that these candidates are unordered in priority, and this list does not take into account anything beyond the size of these data in Blazegraph.
For each candidate in the list, determine its size in Blazegraph (actual, percentage), and how much runway time deleting it gains us at current Wikidata growth rate.
Vertical data: No subtask ticket for this completed task; results here https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Vertical_Analysis#Additional_Info
 All non-English labels
 All labels that are covered by fallbacks (names duplicated across 200 languages etc)
 All labels
 all descriptions
 All aliases
 All labels, description and aliases
 External identifiers
 Scholarly papers: http://wikicite.org/statistics.html
* Scholarly papers + authors + scientific journals + identifiersers
* Astronomical objects: https://www.wikidata.org/wiki/Q6999
 Items that don’t have 3 backlinksts
 Look at the distribution of number of backlinks, and use* Items that to determine how manydon’t have 3 backlinks might make more senseks
 All statements of a specific datatype: monolingual text (not important for querying says Lydia)* Non-normalized values
 Non-normalized values (units, dates, external ids)* non-top-ranked statements
 non-top-ranked statements https://grafana.wikimedia.org/d/000000175/wikidata-datamodel-statements?orgId=1&refresh=30m* Every item with ORES quality score lower than X or no ORES score
 Every item with ORES quality score lower than X or no ORES score* All statements of a specific datatype: monolingual text (not important for querying says Lydia)
* Items (without hot properties) that are being queried
OtherCrucial pieces that would be good to know even if we can't drop them:of data we can't drop, but may be community-curated if necessary
 All Properties
 All sitelinks
 All classifying statements (aka the ontology) - this would be all statements using the Property "subclass of"* properties, "instance of"sitelinks, "part of"ontology, "has part" or "parent taxon"
 All humans