Page MenuHomePhabricator

[EPIC] Get estimates for dropping data from Wikidata in case of Blazegraph catastrophic failure
Closed, ResolvedPublic

Description

As a Wikidata user, I want Wikidata to function with limited data in the case of Blazegraph failure due to reaching maximum graph size, rather than being completely non-functional.

This ticket is a part of WDQS disaster planning, and reflects research into mitigation strategies for catastrophic failure of Blazegraph: specifically in the case that the Wikidata graph becomes too big for Blazegraph to continue supporting. This is not a commitment to a long term state of WDQS or Wikidata, but part of the disaster mitigation playbook in a worst case scenario.

In the case of Blazegraph reaching the maximum number of triples it can store, we may need to prioritize which data to keep and which to delete (temporarily, until we can reload it back in from dumps later, when our graph backend can support it). This task is to determine how much space we can save by deleting the candidates below. Note that these candidates are unordered in priority, and this list does not take into account anything beyond the size of these data in Blazegraph.

For each candidate in the list, determine its size in Blazegraph (actual, percentage), and how much runway time deleting it gains us at current Wikidata growth rate.

Vertical data: No subtask ticket for this completed task; results here https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Vertical_Analysis#Additional_Info

  • All non-English labels
  • All labels that are covered by fallbacks (names duplicated across 200 languages etc)
  • All labels
  • all descriptions
  • All aliases
  • All labels, description and aliases
  • External identifiers

Horizontal data:

  • Scholarly papers
  • Astronomical objects
  • Items that don’t have 3 backlinks
  • Non-normalized values
  • non-top-ranked statements
  • Every item with ORES quality score lower than X or no ORES score
  • All statements of a specific datatype: monolingual text (not important for querying says Lydia)
  • Items (without hot properties) that are being queried

Crucial pieces of data we can't drop, but may be community-curated if necessary

  • properties, sitelinks, ontology, humans

Related Objects

StatusSubtypeAssignedTask
ResolvedAKhatun_WMF
OpenNone
OpenNone
OpenNone
ResolvedAKhatun_WMF
OpenNone
OpenNone
OpenNone
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF

Event Timeline

Some of the vertical analyses were done as a part of familiarizing with wikidata. See the findings in Wikidata_Vertical_Analysis. Will get back to this ticket when done with T282139.

Thanks @AKhatun_WMF! The vertical analysis is helpful

CBogen renamed this task from Get estimates for dropping data from Wikidata in case of Blazegraph catastrophic failure to [EPIC] Get estimates for dropping data from Wikidata in case of Blazegraph catastrophic failure.Aug 5 2021, 1:37 PM
CBogen added a project: Epic.

Couldn't we save a lot by simplifying descriptions? (saving which template applies instead of storing the actual string for every language).

Also, dropping items for disambiguation might save a lot.

Closing this epic as the Blazegraph failure playbook has been published. Will leave remaining subtickets open in case we want to investigate them later