[Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including instances of subclasses)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Manuel
	Jul 18 2023, 2:11 PM

Description

Problem:
As Wikidata PMs we need to understand better how much of Wikidata's graph consists of the scholarly articles subgraph, to make a good decision about splitting the Blazegraph database.

Questions:

Is the ontology clean enough to include all subclasses of Q13442814 (scholarly article) or does that lead to unexpected results?
- See https://w.wiki/77FU

~~What is the size of the instances of Q13442814 (scholarly article) including all instances of only the subclasses that AKhatun used?~~ (out of scope, see T342123#9076994)

What is the size of the instances of Q13442814 (scholarly article) including all instances of all direct (wdt:P279) subclasses?
- # of triples
- % of triples
- # of Items (optional)
- % of Items (optional)

How the data will be used:

see T337799

What difference will these insights make:

see T337799

Notes:

The most recent numbers that we can get will do.

Assignee Planning

Information below this point is filled out by WMDE Analytics and specifically the assignee of this task.

Sub Tasks

Full breakdown of the steps to complete this task:

Define tables to be used below
Base investigation of the subclasses of Q13442814
Find the subclasses that were considered by AKhatun and add them to the task's description
List the subclasses of scholarly article so we can have an overview of what's being included
- Note that there are at times papers listed as subclasses of scholarly article, but these errors are fixed quickly
- Results from https://w.wiki/7DJX on 8/8/23:

Derive aggregate and percentage data for all direct (wdt:P279) subclasses of Q13442814

Derive # of triples
- Total Wikidata triples in discovery.wikibase_rdf: 15,043,483,216
- Total direct SA and subclass triples: 7,196,453,128
- Total SA and subclass triples with vals and refs that are NOT unique to SAs and subclasses: 7,529,369,734
- Total SA and subclass triples with vals and refs that ARE unique to SAs and subclasses: 7,529,130,686
Derive % of triples
- Percent direct SA and subclass triples: 47.8377%
- Percent SA and subclass triples with vals and refs that are NOT unique to SAs and subclasses: 50.0507%
- Percent SA and subclass triples with vals and refs that ARE unique to SAs and subclasses: 50.0491%
Derive # of Items
- Total distinct QIDs: 108,265,975
- Total SA and subclass QIDs: 40,403,721
Derive % of Items
- Percent SA and subclass QIDs: 37.3189%

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

The discovery.wikibase_rdf table will be used for this for the aggregate and percent of triples and items

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

Related task: T337021: [Analytics] Find out size of term subgraph
Related task: T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article)
Prior related task: T281854: Get baseline measurements/expectations for splitting scholarly articles from Wikidata
Prior related analysis: Wikidata_Scholarly_Articles_Subgraph_Analysis
See: https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Subgraph_Analysis.
See: Preparations for WDQS graph-splittig

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Manuel	T337799 [EPIC] Analytics support around splitting the WDQS graph [up to milestone 3]
		Resolved		AndrewTavis_WMDE	T342123 [Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including instances of subclasses)

Event Timeline

Manuel created this task.Jul 18 2023, 2:11 PM

Manuel mentioned this in T337799: [EPIC] Analytics support around splitting the WDQS graph [up to milestone 3].

Manuel updated the task description. (Show Details)Jul 18 2023, 2:34 PM

AndrewTavis_WMDE claimed this task.Jul 24 2023, 2:52 PM

AndrewTavis_WMDE moved this task from Incoming to In progress on the Wikidata Analytics (Kanban) board.

AndrewTavis_WMDE mentioned this in T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article).Jul 24 2023, 3:24 PM

AndrewTavis_WMDE updated the task description. (Show Details)

AndrewTavis_WMDE updated the task description. (Show Details)Jul 24 2023, 3:27 PM

AndrewTavis_WMDE updated the task description. (Show Details)

@Manuel, based on the query provided in https://w.wiki/77FU (I took out the French comment at the end and regenerated the short link), it looks like the ontology is relatively clean if we keep it to the base subclasses with wdt:P279, but not if we go beyond that to the full graph with wdt:P279*. A summary:

In the case of wdt:P279 the only outliers that we're getting are QIDs that are actually scholarly articles in themselves as they've had P279 applied to them rather than P31.
- Including these should be fine in that they would have been included anyway?
In the case of wdt:P279* we're getting all kinds of QIDs that do not fit the subgraph we're trying to describe. Examples include:
- Q1519850 - Health Certificate
- Q2992277 - women in the Victorian era
- Many different QIDs referring to types of Covid passports, Covid ids, etc
- etc

I'll report back in this issue on how the subgraph was defined in the original analysis and then we can make a decision on this :)

Manuel renamed this task from [Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including all instances of subclasses) to [Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including instances of subclasses).Jul 24 2023, 3:54 PM

Manuel updated the task description. (Show Details)

Sounds good!

Manuel updated the task description. (Show Details)Jul 24 2023, 4:02 PM

Manuel updated the task description. (Show Details)

AndrewTavis_WMDE updated the task description. (Show Details)Jul 24 2023, 4:28 PM

AndrewTavis_WMDE updated the task description. (Show Details)

dr0ptp4kt subscribed.Jul 26 2023, 3:17 PM

Manuel updated the task description. (Show Details)Jul 27 2023, 1:13 PM

@Lydia_Pintscher: Instead of using the sub classes that Aisha Khatun used in her research, we could also use suggestions from Scholia. Do we have any input from them yet?

Manuel mentioned this in T342974: [Analytics] Wikidata edits: Wikibase UI vs Wikidata API edits.Jul 28 2023, 1:40 PM

Manuel mentioned this in T342975: [Analytics] Wikidata UI edits: Proportion of edits from devices with small screens .

Manuel mentioned this in T336361: [Analytics] Identify access from mobile vs. desktop devices.Jul 28 2023, 1:57 PM

Manuel mentioned this in T342991: [Analytics] Wikidata edits from devices with small screens: Types of actions done using the desktop UI.Jul 28 2023, 2:05 PM

Manuel mentioned this in T342997: [Analytics] Wikidata edits: Geographic distribution of Wikidata edits by client type.Jul 28 2023, 2:11 PM

Manuel mentioned this in T343032: [Analytics] Wikidata edits: Wikibase UI vs Wikidata API edits by content segment.Jul 28 2023, 6:25 PM

Manuel mentioned this in T343033: [Analytics] Wikidata UI edits: Proportion of edits from devices with small screens by content segment.Jul 28 2023, 6:36 PM

AndrewTavis_WMDE updated the task description. (Show Details)Aug 8 2023, 10:51 AM

AndrewTavis_WMDE updated the task description. (Show Details)Aug 8 2023, 12:35 PM

@Manuel, @dcausse: I have the classes from AKhatun and the subclasses of scholarly article listed in the task now. I figured it'd be good to get them all here so we know what we're talking about :)

Looking at this further, it seems that AKhatun focussed more on scholarly articles and was just listing subclasses in the report itself as examples. Reference for this is this part of the report.

Scholarly articles have the largest count (37M) while everything else combined is in the thousands (~130K, excluding those that are included in scholarly articles), therefore the analysis is more focused on scholarly articles than others.

Of the ones that were listed, scientific journal (Q5633421), scholarly conference abstract (Q58632367) and conference paper: (Q23927052) are not direct subclasses of scholarly article, so maybe what we can focus on is the direct subclasses of scholarly article and the three that are not included as well? I'd say that scientific journals would be needed for the new graph as well 🤔

Hi AndrewTavis_WMDE, thank you, this helped a lot!

Comparing the two lists, we can see that AKhatun's classes are not solely subclasses of scholarly articles. This means that we do not need to look at them for this task. I have edited the task accordingly.

Manuel updated the task description. (Show Details)Aug 8 2023, 1:04 PM

Manuel updated the task description. (Show Details)

AndrewTavis_WMDE updated the task description. (Show Details)Aug 8 2023, 1:08 PM

AndrewTavis_WMDE updated the task description. (Show Details)

AndrewTavis_WMDE updated the task description. (Show Details)Aug 8 2023, 2:04 PM

AndrewTavis_WMDE updated the task description. (Show Details)Aug 8 2023, 2:07 PM

AndrewTavis_WMDE updated the task description. (Show Details)

@dcausse, do you have an idea why we're not getting that direct triples for SAs and its subclasses and direct triples for non-SAs and subclasses add to the same amount? Was working out for the last notebook as you saw. Only major change I've made is now it's .where(col("object").isin(sa_and_sasc_qids)) rather than the equality where sa_and_sasc_qids is the hard coded QIDs from above including scholarly article's (I was getting some papers back when directly querying subclasses).

The important snippets from the code:

df_wikidata_rdf = (
    spark.table("discovery.wikibase_rdf")
    .where("wiki='wikidata' AND date = '20230717'")
    .alias("df_wikidata_rdf")
)

sa_and_sasc_ids = (
    df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
    .where(col("predicate") == P31_DIRECT_URL)
    .where(col("object").isin(sa_and_sasc_qids))
    .alias("sa_and_sasc_ids")
)

sa_and_sasc_direct_triples = (
    df_wikidata_rdf.join(
        other=sa_and_sasc_ids, 
        on=(sa_and_sasc_ids["sa_and_sasc_qids"] == df_wikidata_rdf["context"]), 
        how="inner"
    )
    .select("df_wikidata_rdf.*")
    .cache()
)

non_sa_and_sasc_direct_triples = (
    df_wikidata_rdf.join(
        other=sa_and_sasc_ids, 
        on=(sa_and_sasc_ids["sa_and_sasc_qids"] == df_wikidata_rdf["context"]), 
        how="leftanti"
    )
    .select("df_wikidata_rdf.*")
    .cache()
)

print_num_str_with_commas(total_triples)
# 15,043,483,216

print_num_str_with_commas(sa_and_sasc_direct_triples.count())
# 7,778,494,249

print_num_str_with_commas(non_sa_and_sasc_direct_triples.count())
# 7,847,030,088

print_num_str_with_commas(total_sa_and_sasc_direct_triples + total_non_sa_and_sasc_direct_triples)
# 15,625,524,337

Is there something going in with the relationship between the multiple classes? Do we need to switch the joins up for this one?

At a glance I suspect that now you might get duplicated QIDs in

sa_and_sasc_ids = (
    df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
    .where(col("predicate") == P31_DIRECT_URL)
    .where(col("object").isin(sa_and_sasc_qids))
    .alias("sa_and_sasc_ids")
)

Which could be explained by entities being tagged with multiple entries found in sa_and_sasc_qids.
What happens if you apply a distinct here:

sa_and_sasc_ids = (
    df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
    .where(col("predicate") == P31_DIRECT_URL)
    .where(col("object").isin(sa_and_sasc_qids))
    .disctinct()
    .alias("sa_and_sasc_ids")
)

Is what we were thinking too, @dcausse :) I'm realizing that where I had the .distinct() was incorrect though. Edit: never mind the prior comment. Not sure why it wasn't working within the parentheses at first...

Thanks for checking in!

AndrewTavis_WMDE updated the task description. (Show Details)Aug 9 2023, 3:43 PM

Minor question on this, @dcausse: why aren't we caching df_wikidata_rdf and sa_and_sasc_ids above? My assumption is that we should given that we're using them in multiple later calculations, but then I just tried to cache them and then a calculation that normally would finish then lost resources and stalled with three separate stages running. Did you explicitly choose not to cache them, and if so why not? :)

In T342123#9081490, @AndrewTavis_WMDE wrote:

Minor question on this, @dcausse: why aren't we caching df_wikidata_rdf and sa_and_sasc_ids above? My assumption is that we should given that we're using them in multiple later calculations, but then I just tried to cache them and then a calculation that normally would finish then lost resources and stalled with three separate stages running. Did you explicitly choose not to cache them, and if so why not? :)

I don't remember having such problems nor thinking too much about what to cache. Generally speaking caching comes with an extra cost and it's not always obvious that you'll get a net benefit but here I tend to agree that sa_and_sasc_ids might sound like a good candidate for caching (single column, relatively few rows) and I'm not sure to understand why it could fail... have you tried multiple times? Might possibly be unrelated to caching. If your notebook has had its kernel open for a long time (several days) and that the spark session was still open during that time I would not be surprised that hadoop had tried to cleanup some things in the meantime making spark unhappy... just making random guesses here. If after retrying on a fresh spark session (by killing your kernel) it still does not work please feel free to upload your code somewhere and I'll give it a try.

Thanks a lot for this, @dcausse! The reasoning of singe column, relatively few rows for caching makes a lot of sense. I think that the problems I faced were from trying to cache df_wikidata_rdf. Just ran things through again with just sa_and_sasc_ids cached and it did seem to run through a bit better. With that being said, I did end up running the notebook multiple times and saving the outputs to variables as I went along before then restarting the kernel.

Will update the task with the final values now!

AndrewTavis_WMDE updated the task description. (Show Details)Aug 10 2023, 10:40 AM

AndrewTavis_WMDE updated the task description. (Show Details)

AndrewTavis_WMDE moved this task from In progress to Needs product input on the Wikidata Analytics (Kanban) board.

@Manuel, @dcausse: the metrics increased, but only by a very marginal amount where we're now over a bit over 50% rather than a bit below. Let me know if anything else is needed!

Manuel moved this task from Needs product input to Product verification on the Wikidata Analytics (Kanban) board.Aug 10 2023, 11:20 AM

Thank you again, @dcausse, for all of your support! :)

[Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including instances of subclasses)Closed, ResolvedPublicActions

Description

Assignee Planning

Sub Tasks

Data to be used

Notes and Questions

Related ObjectsSearch...

Event Timeline

[Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including instances of subclasses)
Closed, ResolvedPublic
Actions

Related Objects
Search...