Page MenuHomePhabricator

Create aggregate list of potential Blazegraph data deletion sources in case of catastrophic failure
Closed, ResolvedPublic

Description

As a product manager, I want to know a source of truth for all current candidates for data deletion from Blazegraph in the case of catastrophic failure, so that I can prioritize what is deleted.

So far we've investigated several potentials for deleting Wikidata data from Blazegraph in the case of a catastrophic failure: i.e. lexemes, scholarly articles, labels, etc. These have been documented across various analysis write-ups, but it would be helpful to have them in a single table to be able to look at them all at once.

AC:

  • create a table including all current deletion candidates we've looked into so far
  • each candidate should include (approximations ok): number/% of entities, number/% of triples, number of days for Blazegraph to recover at current rate of growth, number/% of queries potentially affected

For reference, all the required information for scholarly articles is available at: https://wikitech.wikimedia.org/w/index.php?title=User:AKhatun/Wikidata_Scholarly_Articles_Subgraph_Analysis#TL;DR . This ticket is to tabulate all this information into a single table that includes the same information for all other data deletion candidates we've investigated.

Event Timeline

MPhamWMF triaged this task as High priority.Nov 5 2021, 5:57 PM
MPhamWMF created this task.
This comment was removed by AKhatun_WMF.

Sources:

Reference Stats:

  • Total number of entities: 95M
  • Total number of triples: 13B
  • Total number of queries: 220M
Name% of entities% of triplesnumber of days for Blazegraph to recover at current rate of growth% of queries potentially affected (monthly)
descriptionN/A2051812
external idN/A923930
labelN/A410448
altLabelN/A0.82116
nameN/A0.6168
lexicographical entities8-100.09
scholarly article405013700.7
astronomical object992380.2
human10720031
Wikimedia category561570.6
taxon3.437725
family name0.51.4402.5
Wikimedia disambiguation page1.51.4371.7
gene1.30.9250.3
Wikimedia template0.90.9230.1
chemical compound1.30.7190.6
film0.30.4102
Wikimedia list article0.40.370.6
business0.20.131.8
language0.0110.0130.30.8

The numbers were rounded. Only the top 10 subgraphs were listed, both in terms of subgraph size and number of queries. More can be found here: Table_of_top_50_subgraph_information and Query count analysis for the subgraphs

Thanks, @AKhatun_WMF , I know you're still working on some of these other estimates (i.e. query analysis). It'd be nice to have some more estimates filled in by end of month if at all possible.