Page MenuHomePhabricator

WDCM Structure Dashboard: Refining the WDCM Taxonomy
Closed, InvalidPublic

Description

Refine the existing WDCM taxonomy:

  • A WDCM Taxonomy is a selection of semantic categories, each defined by a respective SPARQL query, that is used as a main categorical tool in WDCM analyses;
  • The current WDCM Taxonomy suffers from many atomic problems, e.g. works of art including reference manuals since every book is a work of art, and similar;
  • TASK: "clean up" the existing Taxonomy by figuring out SPARQL queries that result in the selection of categories that are as close to some intuitive human semantics as possible.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 3 2017, 11:01 PM
  • We need to address this thing with the community.

@Lydia_Pintscher I have had enough time to work with the WDCM now and inspect many empirical findings on Wikidata usage.

This task is absolutely critical. I highly suspect that most of the statistical results we obtain are heavily influenced by a huge overlap in the semantic categories that we currently use. Please let me know when you have some time to discuss this. Thanks.

Would a call next week Thursday or Friday work?

@Lydia_Pintscher Anytime, just let me know when it is best for you.

In the meanwhile, I will try to solve the problem in the general way (if a general solution is possible at all - depends upon the current structure of Wikidata), so that you can decide what do we want to do and I can start implementing.

GoranSMilovanovic added a comment.EditedNov 15 2017, 1:12 PM

@Lydia_Pintscher To start, I will try to clean up what we already have, just to remove the most pressing issues (for example, removing many books from Work of Art, removing Geographical Objects from Organization). I will do that before running the first productionized update which is almost ready to run.

GoranSMilovanovic triaged this task as High priority.Nov 16 2017, 1:14 AM

Fix for (1) Work of Art MINUS Book and (2) Organization MINUS Geographical Object applied (note: that would not be the SPARQL MINUS, but an internal WDCM, R operation).

Leaving the ticket opened for some better times when the fundamental solution to the problem, if possible at all, will receive a proper discussion.

GoranSMilovanovic added a comment.EditedNov 26 2017, 2:37 PM

@Lydia_Pintscher

Start figuring out about the semantic nature of the Wikidata items that are currently not-covered by the WDCM taxonomy - I am on this as of November 26/2017 (today). Goal: to formulate the categories that would encompass as much Wikidata items as possible (WDCM now works with some 14M approximately).

@Lydia_Pintscher Ok, first thing

  • I have been using ?item wdt:P31/wdt:P279* wd:Q5. in WDCM
  • in place of ?item (wdt:P31|(wdt:P31/wdt:P279*)) wd:Q5.
  • wrongly assuming that (at least a waste majority of) items are sub-classes of something less abstract than what use in our taxonomy (Human, Geographical Object, etc).

While the first one returns approximately 3.7M Q5 items, the second (which also includes direct instances of Q5) returns 7.4M items.

Next steps:

  • introducing the change to the WDCM Search Module
  • re-running the current update T181035
  • inspecting what remains out of scope in WDCM while the update runs.

As of the November 30. WDCM update:

  • Books are not encompasses by Work of Art anymore;
  • Geographical Objects are removed from Organizations;
  • and Architectural Structure needs to be removed from Geographical Objects.

Out of 39,674,683 Wikidata items (estimated as the number of content pages from the Wikidata wiki), the WDCM now searches for 31,269,756, or almost 79% of all items.

The remaining 8.something million items will now be sampled, as previously discussed, and studied in an attempt to figure out the rest of the hierarchical organization of Wikidata through P279 relations to see if it makes sense to incorporate any of them. If those items tend to cluster (in logical, relational, Wikidata meaning of "to cluster"), then it also makes sense to include them into the WDCM. If they don't, it is questionable whether the WDCM distributional semantics machinery would discover any sensible regularity in their usage patterns or simply spit out the best possible semantic models which would not necessarily offer a meaningful interpretation.

GoranSMilovanovic lowered the priority of this task from High to Medium.Dec 1 2017, 2:12 PM
GoranSMilovanovic added a comment.EditedDec 4 2017, 11:16 AM

@Lydia_Pintscher

I've developed an R function that provides a recurrent search for P279 "subclass of" relations across the Wikidata structure top-down, starting from entity (Q35120), for a given search depth (stopping criterion, e.g. after how many P279 steps from entity top-down should the recursive call break and return the items discovered).

Currently, the function is running against WDQS for more than 48h, reporting back the following:

[1] "I'm going deeper underground... depth: 1... and with 45 classes..."
[1] "I'm going deeper underground... depth: 2... and with 416 classes..."
[1] "I'm going deeper underground... depth: 3... and with 3486 classes..."
[1] "I'm going deeper underground... depth: 4... and with 11049 classes..."
[1] "I'm going deeper underground... depth: 5... and with 476358 classes..."

The run is experimental and meant to determine the growth in the number of items discovered across the p279 search tree, so no stopping criterion is imposed and I will break the function's run if it doesn't hit depth = 6 soon. In any case, with 11049 items discovered at depth = 5 from entity (Q35120) I think we might have enough insight to start figuring out what is the structure that the WDCM now misses - the approx. 8 million items that it does not run for.

However, if it turns out that the concepts of our intuitive taxonomy - the one we currently use - are found scattered somewhere far away from entity (Q35120), it might also turn out that we will need a different method to understand where do the items that we're looking for live.

GoranSMilovanovic added a comment.EditedDec 5 2017, 10:06 AM

@Lydia_Pintscher Some things simply make me cry:

  • German Academy of Sciences Leopoldina Q543804 is a P279 of
  • academy of sciences Q414147, which is a P279 of
  • academy Q162633, which is a P279 of
  • higher education institution Q38723, which is a P279 of
  • educational institution Q2385804, which is a P279 of
  • facility Q13226383, which is a P279 of
  • geographical object Q618123,

making Leopoldina a Geographical Object in WDCM.

There are other similar examples.

Maybe we should simply enlist what we want to have as a Geographical Object in WDCM: (1) a country, (2) a city, (3) a lake, (4) a mountain, (5) a river, (6) an island, etc.

@Lydia_Pintscher Please take a look at: http://wdcm.wmflabs.org/WDCM_Structure/ - I still need to document it and write up a concise description, so it is not yet included to the WDCM Dashboards.

Status:

  • We have the WDCM Structure Dashboard now to help us navigate through the classes that we are interested the most;
  • I have left an option for a user to produce a P31|P279 upward paths graph for any desired Wikidata item; I find that handy;
  • Next step, fetching cumulative class item counts. This is, essentially, the most important information to select what undergoes analyses in WDCM; the Wikidata toolkit will be employed to do this, most probably, because WDQS could not process some of the operations.
GoranSMilovanovic renamed this task from WDCM Taxonomy: Refine to WDCM Structure Dashboard: Refining the WDCM Taxonomy.Feb 26 2018, 10:26 AM
GoranSMilovanovic closed this task as Invalid.Oct 28 2019, 7:18 AM
  • This will be dealt with by ShEx in the near future;
  • closing the task as invalid.