Analysis: Property usage by items' P31
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JAllemandou
	Sep 16 2021, 4:53 PM

Description

As a WDQS user, I don't want a large unrelated subgraph to affect the performance of my query when there is a shared property, so that my seemingly simple queries don't time out or take a long time to complete.

It is interesting to understand how properties are used by different content subgraphs (for instance humans, scholarly articles etc). It would allow us to better understand how properties used in a certain query context can be affected performance-wise by their usage in other contexts. For instance, the main-topic property when used for books could suffer from the property being widely used for scholarly-articles (a huge subgraph).
This analysis would use the P31 values of items to try to cluster items into groups (maybe we could even be better in using P279?), and we would count property usage by group to do further analysis.

Related Objects
Search...

Status	Assigned	Task
Resolved	AKhatun_WMF	T282790 [EPIC] Get estimates for dropping data from Wikidata in case of Blazegraph catastrophic failure
Resolved	AKhatun_WMF	T293628 Get baseline measurements/expectations for splitting various subgraphs from Wikidata
Resolved	AKhatun_WMF	T291205 Analysis: Property usage by items' P31

Event Timeline

JAllemandou created this task.Sep 16 2021, 4:53 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 16 2021, 4:53 PM

JAllemandou updated the task description. (Show Details)Sep 16 2021, 4:53 PM

Maintenance_bot added a project: Wikidata.Sep 16 2021, 5:45 PM

MPhamWMF moved this task from Incoming to Analysis on the Wikidata-Query-Service board.Sep 20 2021, 2:01 PM

Isn't that what property suggestor does?

AKhatun_WMF updated the task description. (Show Details)Sep 24 2021, 11:55 AM

MPhamWMF updated the task description. (Show Details)Sep 24 2021, 6:32 PM

AKhatun_WMF claimed this task.Sep 27 2021, 10:27 AM

AKhatun_WMF moved this task from Analysis to Current work on the Wikidata-Query-Service board.

AKhatun_WMF added a parent task: T293628: Get baseline measurements/expectations for splitting various subgraphs from Wikidata.Oct 18 2021, 1:53 PM

Some analysis was done here:

Property usage across subgraphs: Predicates_across_subgraphs
Top predicates also used in scholarly articles: Top_properties_used_in_other_subgraphs

Suggested analysis:

Categorize usage type of properties:
- Similar distribution of use across subgraphs
- Have X% usage in Y subgraphs
- Used in lots of small subgraphs, used in small quantity in all subgraphs
- Entropy over the power-law distribution of the property across subgraphs (spark udf entropy)
  - This will give us a single number to represent the distribution of a property
  - WIll incorporate the distribution as well as the variability of property usage
The entropy distribution will tell us what kinds of properties we have on hand

The suggested analysis could be done through a new ticket if required later on.

AKhatun_WMF moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.Nov 9 2021, 1:27 AM

Gehel closed this task as Resolved.Nov 15 2021, 2:14 PM

MPhamWMF mentioned this in T295779: Automatically identify WD properties that are potentially risky based on subgraph distribution.Nov 16 2021, 2:56 PM

Analysis: Property usage by items' P31Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Analysis: Property usage by items' P31
Closed, ResolvedPublic
Actions

Related Objects
Search...