Page MenuHomePhabricator

[Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including instances of subclasses)
Closed, ResolvedPublic

Description

Problem:
As Wikidata PMs we need to understand better how much of Wikidata's graph consists of the scholarly articles subgraph, to make a good decision about splitting the Blazegraph database.

Questions:

  1. Is the ontology clean enough to include all subclasses of Q13442814 (scholarly article) or does that lead to unexpected results?
  1. What is the size of the instances of Q13442814 (scholarly article) including all instances of only the subclasses that AKhatun used? (out of scope, see T342123#9076994)
  1. What is the size of the instances of Q13442814 (scholarly article) including all instances of all direct (wdt:P279) subclasses?
    • # of triples
    • % of triples
    • # of Items (optional)
    • % of Items (optional)

How the data will be used:

What difference will these insights make:

Notes:

  • The most recent numbers that we can get will do.

Assignee Planning

Information below this point is filled out by WMDE Analytics and specifically the assignee of this task.

Sub Tasks

Full breakdown of the steps to complete this task:

  • Derive # of triples
    • Total Wikidata triples in discovery.wikibase_rdf: 15,043,483,216
    • Total direct SA and subclass triples: 7,196,453,128
    • Total SA and subclass triples with vals and refs that are NOT unique to SAs and subclasses: 7,529,369,734
    • Total SA and subclass triples with vals and refs that ARE unique to SAs and subclasses: 7,529,130,686
  • Derive % of triples
    • Percent direct SA and subclass triples: 47.8377%
    • Percent SA and subclass triples with vals and refs that are NOT unique to SAs and subclasses: 50.0507%
    • Percent SA and subclass triples with vals and refs that ARE unique to SAs and subclasses: 50.0491%
  • Derive # of Items
    • Total distinct QIDs: 108,265,975
    • Total SA and subclass QIDs: 40,403,721
  • Derive % of Items
    • Percent SA and subclass QIDs: 37.3189%

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

  • The discovery.wikibase_rdf table will be used for this for the aggregate and percent of triples and items

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

Event Timeline

@Manuel, based on the query provided in https://w.wiki/77FU (I took out the French comment at the end and regenerated the short link), it looks like the ontology is relatively clean if we keep it to the base subclasses with wdt:P279, but not if we go beyond that to the full graph with wdt:P279*. A summary:

  • In the case of wdt:P279 the only outliers that we're getting are QIDs that are actually scholarly articles in themselves as they've had P279 applied to them rather than P31.
    • Including these should be fine in that they would have been included anyway?
  • In the case of wdt:P279* we're getting all kinds of QIDs that do not fit the subgraph we're trying to describe. Examples include:

I'll report back in this issue on how the subgraph was defined in the original analysis and then we can make a decision on this :)

Manuel renamed this task from [Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including all instances of subclasses) to [Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including instances of subclasses).Jul 24 2023, 3:54 PM
Manuel updated the task description. (Show Details)
Manuel updated the task description. (Show Details)
Manuel updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)

@Lydia_Pintscher: Instead of using the sub classes that Aisha Khatun used in her research, we could also use suggestions from Scholia. Do we have any input from them yet?

@Manuel, @dcausse: I have the classes from AKhatun and the subclasses of scholarly article listed in the task now. I figured it'd be good to get them all here so we know what we're talking about :)

Looking at this further, it seems that AKhatun focussed more on scholarly articles and was just listing subclasses in the report itself as examples. Reference for this is this part of the report.

Scholarly articles have the largest count (37M) while everything else combined is in the thousands (~130K, excluding those that are included in scholarly articles), therefore the analysis is more focused on scholarly articles than others.

Of the ones that were listed, scientific journal (Q5633421), scholarly conference abstract (Q58632367) and conference paper: (Q23927052) are not direct subclasses of scholarly article, so maybe what we can focus on is the direct subclasses of scholarly article and the three that are not included as well? I'd say that scientific journals would be needed for the new graph as well 🤔

Hi AndrewTavis_WMDE, thank you, this helped a lot!

Comparing the two lists, we can see that AKhatun's classes are not solely subclasses of scholarly articles. This means that we do not need to look at them for this task. I have edited the task accordingly.

Manuel updated the task description. (Show Details)

@dcausse, do you have an idea why we're not getting that direct triples for SAs and its subclasses and direct triples for non-SAs and subclasses add to the same amount? Was working out for the last notebook as you saw. Only major change I've made is now it's .where(col("object").isin(sa_and_sasc_qids)) rather than the equality where sa_and_sasc_qids is the hard coded QIDs from above including scholarly article's (I was getting some papers back when directly querying subclasses).

The important snippets from the code:

df_wikidata_rdf = (
    spark.table("discovery.wikibase_rdf")
    .where("wiki='wikidata' AND date = '20230717'")
    .alias("df_wikidata_rdf")
)

sa_and_sasc_ids = (
    df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
    .where(col("predicate") == P31_DIRECT_URL)
    .where(col("object").isin(sa_and_sasc_qids))
    .alias("sa_and_sasc_ids")
)

sa_and_sasc_direct_triples = (
    df_wikidata_rdf.join(
        other=sa_and_sasc_ids, 
        on=(sa_and_sasc_ids["sa_and_sasc_qids"] == df_wikidata_rdf["context"]), 
        how="inner"
    )
    .select("df_wikidata_rdf.*")
    .cache()
)

non_sa_and_sasc_direct_triples = (
    df_wikidata_rdf.join(
        other=sa_and_sasc_ids, 
        on=(sa_and_sasc_ids["sa_and_sasc_qids"] == df_wikidata_rdf["context"]), 
        how="leftanti"
    )
    .select("df_wikidata_rdf.*")
    .cache()
)

print_num_str_with_commas(total_triples)
# 15,043,483,216

print_num_str_with_commas(sa_and_sasc_direct_triples.count())
# 7,778,494,249

print_num_str_with_commas(non_sa_and_sasc_direct_triples.count())
# 7,847,030,088

print_num_str_with_commas(total_sa_and_sasc_direct_triples + total_non_sa_and_sasc_direct_triples)
# 15,625,524,337

Is there something going in with the relationship between the multiple classes? Do we need to switch the joins up for this one?

At a glance I suspect that now you might get duplicated QIDs in

sa_and_sasc_ids = (
    df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
    .where(col("predicate") == P31_DIRECT_URL)
    .where(col("object").isin(sa_and_sasc_qids))
    .alias("sa_and_sasc_ids")
)

Which could be explained by entities being tagged with multiple entries found in sa_and_sasc_qids.
What happens if you apply a distinct here:

sa_and_sasc_ids = (
    df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
    .where(col("predicate") == P31_DIRECT_URL)
    .where(col("object").isin(sa_and_sasc_qids))
    .disctinct()
    .alias("sa_and_sasc_ids")
)

Is what we were thinking too, @dcausse :) I'm realizing that where I had the .distinct() was incorrect though. Edit: never mind the prior comment. Not sure why it wasn't working within the parentheses at first...

Thanks for checking in!

Minor question on this, @dcausse: why aren't we caching df_wikidata_rdf and sa_and_sasc_ids above? My assumption is that we should given that we're using them in multiple later calculations, but then I just tried to cache them and then a calculation that normally would finish then lost resources and stalled with three separate stages running. Did you explicitly choose not to cache them, and if so why not? :)

Minor question on this, @dcausse: why aren't we caching df_wikidata_rdf and sa_and_sasc_ids above? My assumption is that we should given that we're using them in multiple later calculations, but then I just tried to cache them and then a calculation that normally would finish then lost resources and stalled with three separate stages running. Did you explicitly choose not to cache them, and if so why not? :)

I don't remember having such problems nor thinking too much about what to cache. Generally speaking caching comes with an extra cost and it's not always obvious that you'll get a net benefit but here I tend to agree that sa_and_sasc_ids might sound like a good candidate for caching (single column, relatively few rows) and I'm not sure to understand why it could fail... have you tried multiple times? Might possibly be unrelated to caching. If your notebook has had its kernel open for a long time (several days) and that the spark session was still open during that time I would not be surprised that hadoop had tried to cleanup some things in the meantime making spark unhappy... just making random guesses here. If after retrying on a fresh spark session (by killing your kernel) it still does not work please feel free to upload your code somewhere and I'll give it a try.

Thanks a lot for this, @dcausse! The reasoning of singe column, relatively few rows for caching makes a lot of sense. I think that the problems I faced were from trying to cache df_wikidata_rdf. Just ran things through again with just sa_and_sasc_ids cached and it did seem to run through a bit better. With that being said, I did end up running the notebook multiple times and saving the outputs to variables as I went along before then restarting the kernel.

Will update the task with the final values now!

@Manuel, @dcausse: the metrics increased, but only by a very marginal amount where we're now over a bit over 50% rather than a bit below. Let me know if anything else is needed!

Thank you again, @dcausse, for all of your support! :)