[Analytics] Find out the size of direct instances of Q13442814 (scholarly article)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Manuel
	Jul 18 2023, 12:47 PM

Description

Problem:
As Wikidata PMs we need to better understand how much of Wikidata's graph consists of the scholarly articles subgraph, in order to make a good decision about how to split the Blazegraph database.

Questions:
What is the size of the direct instances of Q13442814 (scholarly article) in Wikidata

# of triples
% of triples
# of Items (optional)
% of Items (optional)

How the data will be used:

see T337799

What difference will these insights make:

see T337799

Notes:

The most recent numbers that we can get will do.

Open questions:

The triple table follows a different logic than Wikibase tables. What is the exact definition of triples that we should include in the counts?

Assignee Planning

Information below this point is filled out by WMDE Analytics and specifically the assignee of this task.

Sub Tasks

Full breakdown of the steps to complete this task:

Define tables to be used below
Derive aggregate and percentage data

Date for the below metrics is 20230717 within the discovery.wikibase_rdf table. An HTML of the work for this task can be found here.

Derive # of triples
- Total Wikidata triples in discovery.wikibase_rdf: 15,043,483,216
- Total direct SA triples: 7,188,746,257
- Total SA triples with vals and refs that are NOT unique to SAs: 7,521,423,558
- Total SA triples with vals and refs that ARE unique to SAs: 7,521,225,975
Derive % of triples
- Percent direct SA triples: 47.7864%
- Percent SA triples with vals and refs that are NOT unique to SAs: 49.9979%
- Percent SA triples with vals and refs that ARE unique to SAs: 49.9966%
Derive # of Items
- Total distinct QIDs: 108,265,975
- Total SA QIDs: 40,300,769
Derive % of Items
- Percent SA QIDs: 37.2239%

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

The discovery.wikibase_rdf table will be used for this for the aggregate and percent of triples
wmf.wikidata_entity can then be used for the aggregate and percentage values for items

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

Related task: T337021: [Analytics] Find out size of term subgraph
Related task: T342123: [Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including instances of subclasses)
Prior related task: T281854: Get baseline measurements/expectations for splitting scholarly articles from Wikidata
Prior related analysis: Wikidata_Scholarly_Articles_Subgraph_Analysis
See: https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Subgraph_Analysis.
See: Preparations for WDQS graph-splittig

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Manuel	T337799 [EPIC] Analytics support around splitting the WDQS graph [up to milestone 3]
		Resolved		AndrewTavis_WMDE	T342111 [Analytics] Find out the size of direct instances of Q13442814 (scholarly article)

Event Timeline

Manuel renamed this task from Size of direct instances of Q13442814 (scholarly article) to [Analytics] Find out the size of direct instances of Q13442814 (scholarly article).Jul 18 2023, 12:47 PM

Manuel created this task.

Manuel updated the task description. (Show Details)

Manuel mentioned this in T337799: [EPIC] Analytics support around splitting the WDQS graph [up to milestone 3].

Manuel moved this task from Incoming to Prioritized backlog on the Wikidata Analytics (Kanban) board.Jul 18 2023, 12:49 PM

AndrewTavis_WMDE claimed this task.Jul 18 2023, 12:53 PM

AndrewTavis_WMDE subscribed.

AndrewTavis_WMDE updated the task description. (Show Details)Jul 18 2023, 1:14 PM

AndrewTavis_WMDE updated the task description. (Show Details)Jul 18 2023, 1:16 PM

AndrewTavis_WMDE updated the task description. (Show Details)Jul 18 2023, 1:19 PM

AndrewTavis_WMDE updated the task description. (Show Details)

AndrewTavis_WMDE moved this task from Prioritized backlog to In progress on the Wikidata Analytics (Kanban) board.

AndrewTavis_WMDE updated the task description. (Show Details)Jul 18 2023, 1:51 PM

AndrewTavis_WMDE updated the task description. (Show Details)

AndrewTavis_WMDE updated the task description. (Show Details)Jul 18 2023, 2:33 PM

@Manuel, I'm wondering how restrictive we want the relationship to Q13442814 to be in calculating distinct triples? Are we just looking for all such triples where Q13442814 is referenced in the object? All seem to be P31 either directly or via a statement, but I just wanted to check if you had a specific plan for this. From 100 results the only predicates are:

['<http://www.wikidata.org/prop/statement/P31>',  '<http://www.wikidata.org/prop/direct/P31>']

Do both of the above qualify as being scholarly articles?

Hi Andrew! It's not "distinct triples", it's "direct instances"! :)

So the steps would be:

get all "instances of" (P31) Q13442814 (e.g. Q93641011)
for the resulting Items get all triples for these Items (these should consist of e.g. statements for title, author, etc.)

I think the prop/direct URI vs prop/statement URI is related to claims vs statements (see Glossary). But I would suggest you confirm with the Wikidata team what to use here.

Just to clarify:

First find all subjects "items" that have a predicate P31 and an object Q13442814
Then find all rows in the table that have these subjects

Steps for now:

I need to check to see if there are also references and such in the discover.wikibase_rdf
- There are
Query a simple scholarly article and see what's included
- Looking at Q1074013 (an academic journal article)
Also check AKhatun's work on scholarly articles (it's not in the main notebook, but maybe another)
- I have access to her scholarly article research notebooks now as well

Based on the discussion in SWE Data, we should be using <http://www.wikidata.org/prop/direct/P31> :)

AndrewTavis_WMDE updated the task description. (Show Details)Jul 19 2023, 1:04 PM

Manuel mentioned this in T337021: [Analytics] Find out size of term subgraph.Jul 19 2023, 4:18 PM

Based on your notebook (see T337021#9028845) it will not be as easy after all. Only querying for Q-Ids in subjects would underestimate the size of the subgraph in question. Let's set up a meeting with a Wikidata engineer that can explain the table to us!

AndrewTavis_WMDE updated the task description. (Show Details)Jul 20 2023, 12:14 PM

AndrewTavis_WMDE updated the task description. (Show Details)Jul 24 2023, 3:24 PM

AndrewTavis_WMDE mentioned this in T342123: [Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including instances of subclasses).

@Manuel, could you give a bit more context to "# of Items" above? Is this all distinct Wikidata entities (QIDs and PIDs), or just QIDs? The wmf.wikidata_entity table for this only has those two entity types in it, so if we're looking for other parts of the graph we'll need to look in other places.

Is this all distinct Wikidata entities (QIDs and PIDs), or just QIDs?

This is distinct Wikidata QIDs only.

AndrewTavis_WMDE awarded a token.Jul 25 2023, 7:41 AM

dr0ptp4kt subscribed.Jul 26 2023, 3:17 PM

Manuel updated the task description. (Show Details)Jul 27 2023, 1:13 PM

Manuel updated the task description. (Show Details)Jul 27 2023, 2:38 PM

@Manuel, looking into cases where Q13442814 (scholarly article) is either the subject or object of a triple, it looks like we can verify that the relationships are only being saved in one way as they should be:

shared_predicates = [l for l in subject_is_sa_predicate_labels if l in object_is_sa_predicate_labels]
shared_predicates
# ['cites work', 'main subject']

I'd assume that 'main subject' is shared because some papers are about the example paper itself, which is Q1895685 (Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid).

shared_entities = [l for l in subject_is_sa_object_labels if l in object_is_sa_subject_labels]
shared_entities
# ['scholarly article']

'scholarly article' being shared was a bit concerning for a moment, but only because I coincidentally picked a scientific paper that's a P5869 (model item) of Q13442814 (scholarly article).

I'll still check the aggregate triples where a paper is an object, but focusing on when they're subject only can also be included.

Thank you Andrew, for clarifying this!

A question to the WDQS Stabilisation Team (@Lydia_Pintscher, @Gehel, @dcausse): We are unclear, what count of triples is most relevant to our question at hand. Should we count triples a) only where the "direct instance of scientific paper" is the subject in the triple, or b) also in case where it is only the object? My initial assumption was a) but we have seen implementations of b) in previous work. I am now wondering if b) is a better representation of what we need, or not.

I believe that at first we are interested in knowing the number of triples that would be moved out if all item that verifies the condition: ?s wdt:P31 Q13442814 are moved out with all the triples belonging to these items.
The triples that belongs to an entity (e.g. Q1895685) are the ones visible via https://www.wikidata.org/wiki/Special:EntityData/Q1895685.ttl?flavor=dump with the additional complexity of shared values and references that have to be treated separately because they might be shared by other entities.
In here you'll notice that for instance the triple s:Q1895685-9a482323-4d57-acf2-b6b7-bc36d578bd57 ps:P478 "171" does not reference the QID of the paper but this triple must be counted as well.
This makes knowing the triples that belongs to ?s a bit tricky but we could leverage the structure of the wikibase_rdf table for this:

first count the number of triples that are not shared with other entities using the context column, this column (not a data available in WDQS can help to group the triples by entity)
count the number of triples attached to shared references and values, here we should also count the ones that are shared between S and not S because these will have to be duplicated in both graph

Working with the wikibase_rdf table is not entirely obvious so I started a short notebook to demonstrate how to use it, please see https://people.wikimedia.org/~dcausse/wikibase_rdf_scholarly_articles_example.html. I'd be happy to jump in a meeting to better explain this if this helps.

Hi @dcausse, thank you so much, this is very helpful! \o/

I believe that at first we are interested in knowing the number of triples that would be moved out

The "number of triples that would be moved out" seems to be the primary metric of interest for the Blazegraph split. But after your explanation of the table, I now realize that this metric produces is not equal to the number of rows in that table that are required to represent these triples in the table, correct? So could you quickly confirm, that the "number of triples that would be moved out" (distinct triples) is actually the preferable metric for our purposes (and not e.g. the "number of rows that would be moved out")?

I'd be happy to jump in a meeting to better explain this if this helps.

Much appreciated! We'll first dig deeper into your and AKhatun's notebooks, and then come back to you about that meeting for a quick review of our approach.

In T342111#9054619, @Manuel wrote:

Hi @dcausse, thank you so much, this is very helpful! \o/

I believe that at first we are interested in knowing the number of triples that would be moved out

The "number of triples that would be moved out" seems to be the primary metric of interest for the Blazegraph split. But after your explanation of the table, I now realize that this metric produces is not equal to the number of rows in that table that are required to represent these triples in the table, correct? So could you quickly confirm, that the "number of triples that would be moved out" (distinct triples) is actually the preferable metric for our purposes (and not e.g. the "number of rows that would be moved out")?

The table wikibase_rdf does have one row per triple and an additional column named context that we use to annotate the entity the triple was extracted from while reading the dump.
With the caveats of shared values and references that have respectively <http://wikiba.se/ontology#Value> and <http://wikiba.se/ontology#Reference> set as their context column.
It is true that duplicates are in there and a select count(*) from wikibase_rdf will give a number greater than the number of triples stored in blazegraph.

To my knowledge such duplicates are:

shared values and references, this is very likely to have a high number of duplicates even if wikibase tries to deduplicate some on the fly while extracting the RDF dump
some metedata regarding sitelinks such as a triple like this: <https://be-tarask.wikipedia.org/> wikibase:wikiGroup "wikipedia" . are likely to be duplicated in the wikibase_rdf table. I forgot to mention them in the notebook.
possibly some sitelinks might be duplicated in there as the wikibase dump is a slow process and duplicated entries are theoretically possible (hopefully rare)

And to answer your question you are correct the number of distinct triples is what matters to us, so to get an accurate number you might have to distinct(subject, predicate, object) at some point, thanks!

@dcausse, a general point on my end is that when I'm trying to run the code that you sent along via an HTML on people.wikimedia.org I'm getting the following as an output of Spark runs repeated over and over again:

23/07/31 13:01:58 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 1 for reason Container killed by YARN for exceeding physical memory limits. 4.4 GB of 4.4 GB physical memory used. Consider boosting spark.executor.memoryOverhead.

This seems to be happening given your create_custom_session setup, and doesn't happen when I do normal create_session as seen below:

spark_session = wmf.spark.create_session(type='yarn-large', app_name="wdqs-subgraph-analysis")

Would you be able to let me know if there's something in my permissions or setup that's causing this? I'm assuming that your setup will make queries faster, but we can disregard if my working setup gets me mostly there. I'm running Jupyter on stat1005, and saw that AKhatun was using stat1008, in case that's helpful information :)

Also for all's information, the duplicate triple values in discorvery.wikibase_rdf is very very small as seen in the following snippet/output:

percent_repeat_triples = round((1 - (total_distinct_triples / total_rows)) * 100, 4)
# (1 - (15,043,483,216 / 15,043,046,814)) * 100
percent_repeat_triples
# 0.0029

In T342111#9055326, @AndrewTavis_WMDE wrote:
@dcausse, a general point on my end is that when I'm trying to run the code that you sent along via an HTML on people.wikimedia.org I'm getting the following as an output of Spark runs repeated over and over again:
23/07/31 13:01:58 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 1 for reason Container killed by YARN for exceeding physical memory limits. 4.4 GB of 4.4 GB physical memory used. Consider boosting spark.executor.memoryOverhead.
This seems to be happening given your create_custom_session setup, and doesn't happen when I do normal create_session as seen below:
spark_session = wmf.spark.create_session(type='yarn-large', app_name="wdqs-subgraph-analysis")
Would you be able to let me know if there's something in my permissions or setup that's causing this? I'm assuming that your setup will make queries faster, but we can disregard if my working setup gets me mostly there. I'm running Jupyter on stat1005, and saw that AKhatun was using stat1008, in case that's helpful information :)

I was probably lucky, I used these settings from another notebook but using yarn-large is certainly fine, it uses 8G executors + 1G overhead (doubling my settings) (c.f. https://github.com/wikimedia/wmfdata-python/blob/main/wmfdata/spark.py#L50).
Depending on what you are doing we could ask @JAllemandou once he's back if it's OK to use even more.

In T342111#9055397, @AndrewTavis_WMDE wrote:
Also for all's information, the duplicate triple values in discorvery.wikibase_rdf is very very small as seen in the following snippet/output:
percent_repeat_triples = round((1 - (total_distinct_triples / total_rows)) * 100, 4)
# (1 - (15,043,483,216 / 15,043,046,814)) * 100
percent_repeat_triples
# 0.0029

Good to know, this is definitely a lot lower than I expected, thanks!

Good to know, this is definitely a lot lower than I expected, thanks!

Welcome!

I guess another question that I have and have been discussing with @Manuel a bit: is there a benefit to writing the spark queries in the Python dot notation (data.select(col("subject").alias("schol_art_QID")..., etc) vs. writing a multi line string and passing it to wmfdata.spark.run? I find a well-formatted multi line query to be much easier to read and explain to stakeholders, and in working a bit with the dot notation there doesn't seem to be much of a speed increase on it, but then it would be hard for me to judge this given commands are being run at times of different server loads.

More generally, are most people at WMF writing spark pythonically and not with queries? If there's need for code review I don't want to force people to read something they're not used to, but if there are folks who are writing queries and passing them to spark.run then I might join that camp 😅 Happy to get used to it though if there's even a general benefit or preference :)

are most people at WMF writing spark pythonically and not with queries?

I guess it depends on who you talk to and what they're doing. All of the data scientists/analysts I work with use Spark SQL engine and write HiveQL queries, often because hive.run is too slow. Occasionally I see dot notation for advanced PySpark usage (e.g. Morten's survey aggregation data pipeline).

I suspect dot notation-based Spark usage is probably more common among software engineers.

Great to hear, @mpopov! I guess the distinction between HiveQL queries ran with wmfdata.spark.run for scientists/analysts vs. dot notation for software engineering makes sense. Nice to hear that I'll be at home writing some Hive :)

Checking another concept with you all:

I've already had the /prop/ v.s /prop/direct/ URL distinction explained to me, where in this case for P31 we'll want to be using the latter for the "truthy" values as I'm seeing in both @dcausse and AKhatun's work. We also have /prop/statement/ though, and beyond that there's a distinction between /entity/ and /entity/statement/. I see that in this case we're making use of just /entity/, but it would be nice to understand what the respective /statement/ representations are for properties and entities :)

My assumption is that we use "truthy" P31 only to identify what Items are scholarly articles and what not. For counting the related "number of triples that would be moved out", we will need to use everything there is (not limited to e.g. truthy statements).

More explanations about the different URL paths would still be helpful and appreciated!

<http://www.wikidata.org/entity/> prefix generally wd refers to the concept URI of the entity, this is generally how an entity (whether it's a property, item or lexeme) is identified, e.g. Q42 is identified as wd:Q42 -> <http://www.wikidata.org/entity/Q42>, this is the form that is used to link items to statements and its other constituents (can be seen as a subject or an object)
<http://www.wikidata.org/entity/statement/> prefix s in the dumps and Special:EntityData and wds in WDQS, these are used to identify a wikibase statement, e.g. wds:q42-D8404CDA-25E4-4334-AF13-A3290BCD9C0F is the identity of the date of birth statement for Q42 (can be seen as a subject or an object)
<http://www.wikidata.org/prop/statement/> prefix ps is what actually link the statement ID as defined above to its actual simple value form, so the actual date of birth of Q42 is stored in the triple wds:q42-D8404CDA-25E4-4334-AF13-A3290BCD9C0F ps:P569 ?dateOfBirth (can be seen only as a predicate)

As you can see there are a lot of indirections to access the date of birth of item, this is where /prop/direct/ (prefix wdt) becomes helpful, this link directly the entity (wd:Q42) to its simple value form using the predicate wdt:P569. Of course there are some caveats and hidden details here (this would be too simple otherwise!), to be promoted as a direct statement it must be either the BestRank (only one) or the NormalRank if no BestRank are available (multiple are possible here), for instance the occupations of Q42.

For more details please see: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format.

I hope you won't have to dig too deeply into this structure but there as you noted some URIs might share the same beginning of the string, so identifying a entity ID simply doing startswith("<http://www.wikidata.org/entity/") might yield false positives.

Thank you for the information here, @dcausse! Nice to have all this in one place where I can reference it when I need a recap 😊😊

Will check the following with @Manuel later today, but here are the metrics I'm getting from the 20230717 dated data from discovery.wikibase_rdf (note that I don't have access to later ones given permission restrictions that are documented in T342416):

print_num_str_with_commas(total_triples)
# 15,043,483,216

# -- Metrics with values and references that are NOT unique to scholarly articles --
total_sa_triples = total_sa_direct_triples + total_sa_val_triples + total_sa_ref_triples
# 7,188,746,257 + 200,337 + 332,476,964
# Meaning: context_sa + context_val_related_to_sas + context_ref_related_to_sas
print_num_str_with_commas(total_sa_triples)
# 7,521,423,558

percent_sa_triples = round(total_sa_triples / total_triples * 100, 4)
percent_sa_triples
# 49.9979

# -- Metrics with values and references that ARE unique to scholarly articles --
total_only_sa_triples = total_sa_direct_triples + total_only_sa_val_triples + total_only_sa_ref_triples
# 7,188,746,257 + 13,651 + 332,466,067
# Meaning: context_sa + context_val_only_related_to_sas + context_ref_only_related_to_sas
print_num_str_with_commas(total_only_sa_triples)
# 7,521,225,975

percent_only_sa_triples = round(total_only_sa_triples / total_triples * 100, 4)
percent_only_sa_triples
# 49.9966

I did end up using PySpark so I could follow @dcausse's example as well as I could :) Should I upload the finished notebook to people.wikimedia.org?

AndrewTavis_WMDE updated the task description. (Show Details)Aug 3 2023, 4:44 PM

AndrewTavis_WMDE updated the task description. (Show Details)

Aggregations have been added to the task description :) We'll upload the work for this to GitHub or GitLab once we have or repo set up, and I'd be happy to do a call if someone besides @Manuel wants an explanation :) Also happy to put the notebook on people.wikimedia.org for an interim presentation of the work.

Would also be great if someone on the WMF side could help me a bit with PySpark UDFs, as I was experimenting with them a bit, but am running into problems getting them to execute properly.

I'll write some more details of the problems I'm facing tomorrow 😊

@AndrewTavis_WMDE thanks! this is really exciting, we couldn't hope for better results... it's almost a 50-50 split. And on top of that, and if I read your results correctly we only have 197,583 common triples (7,521,423,558 - 7,521,225,975) that will have to be duplicated in both subgraphs.

@dcausse, glad to help :) Maybe doing a call to check all of this might make sense? If you have availability tomorrow I'm basically free, or if not then next week for say 25 min sometime?

@AndrewTavis_WMDE sure! I'll send you an invite for next monday, in the meantime could you share your notebook somewhere so that I can take a look before the call?

@dcausse, just finished the people.wikimedia.org upload. An HTML for the notebook can be found at:

https://people.wikimedia.org/~andrewtavis-wmde/T342111_spark_sa_subgraph_metrics.html

I tried to generally follow the process you laid out, but did do some variable renames for myself. I hope it all makes sense, and please let me know if there are any questions/concerns about it all.

AndrewTavis_WMDE updated the task description. (Show Details)Aug 4 2023, 10:04 AM

As for as the Spark UDF issues are concerned, let me just sketch out the process here as it's in a separate notebook from the main one just linked. The general goal in this is to explore using UDFs to easily derive data via the claims column of wmf.wikidata_entity. We can easily find out how many scholarly articles we have via the discovery.wikibase_rdf table as in the example notebook I linked on people.wikimedia.org, but then the goal was to do something similar via wmf.wikidata_entity.claims so I can have a claims exploration example to work from later :)

I've made major progress on this this morning, but some new questions have come up. The initial problem I was facing is that I was thinking that UDFs could reference local functions, which doesn't seem to be the case. I had a function that was being called recursively, and because of this needed it to be its own function, but then calling it in the UDF was returning null. As soon as I defined the recursive function within the UDF itself it was fine :)

Now onto something that's very much confusing me. Say that we have the following query that's referencing the wmf.wikidata_entity for Q1895685 (Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid):

sa_preview_query = """
SELECT
    *

FROM 
    wmf.wikidata_entity

WHERE 
    snapshot = '2023-07-24'
    AND id = 'Q1895685'
"""

Getting a single row from this table for testing purposes is faster with Presto, but then that ended up causing the next problem in trying to make a UDF... The base version of the function I had written was working on the table I'd gotten from Presto, but then it wasn't working when I used it in Spark. The reason for this is that if I first use Presto with the query above then the claims output is a list of lists, but if I use Spark purely then the output is a dictionary:

df_sa_preview = wmf.presto.run(
    commands=sa_preview_query
)
# claims = [[Q...]...]

df_sa_preview_spark = (
    spark.table("wmf.wikidata_entity")
    .where("snapshot = '2023-07-24'")
    .where("id = 'Q1895685'")
    .alias("df_sa_preview_spark")
)
# claims = [{Q...}...]

This is teaching me to definitely always test UDFs solely within a Spark context, but I'm confused why the column outputs are different. Given that Spark is returning a more traditional JSON/dictionary structure I'm assuming that the change is happening with how Presto outputs data?

The UDF is up and running now, but we may need to discuss my limits as running what I'd assume to be a fairly simple UDF over wmf.wikidata_entity wasn't finishing (@dcausse, @JAllemandou). Even if it does finish, I'm fairly regularly getting:

WARN TaskSetManager: Lost task 624.0 in stage 49.0 (TID 8638) (an-worker1114.eqiad.wmnet executor 1519): ExecutorLostFailure (executor 1519 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding physical memory limits. 8.8 GB of 8.8 GB physical memory used. Consider boosting spark.executor.memoryOverhead.

Maybe there are ways for me to optimize this though as I'm just learning all this through this task. What I've got to check claims for whether an entity is a scholarly article (written in this way as UDFs return StringType by default apparently and I didn't want to fool around more to get a boolean):

def check_if_sa(claims):
    """
    Check to see if an entity is a scholarly article via the SA wikibase-entityid.
    """    
    if claims is not None:
        if '{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}' in f"{claims}":
            return "SA"
        else:
            return "Not SA"
    
    else:
        return "Not SA"

spark.udf.register("check_if_sa", check_if_sa)
udf_check_if_sa = F.udf(lambda z: check_if_sa(z))

df_wikidata_qid_entities = (
    spark.table("wmf.wikidata_entity")
    .where(f"snapshot = '2023-07-24'")
    .where("id LIKE 'Q%'")
    .alias("df_wikidata_entity")
)

sa_or_not = (
    df_wikidata_qid_entities.select(udf_check_if_sa(col("claims")).alias("sa_or_not"))
)

sa_or_not.limit(100).groupBy("sa_or_not").count().collect().show()
# [Row(sa_or_not='SA', count=53), Row(sa_or_not='Not SA', count=47)]

The last line is has a LIMIT of 100 given that without it it won't finish. Happy to discuss this a bit on Monday as well, @dcausse 😊

I suspect that because the claims field being an array of complex types it can potentially be huge and asking to generate its string representation using f"{claims}" might cause excessive mem usage and is I believe a very slow operation.
I would look into ways to avoid having to serialize it as a string and iterate over the object representation (I suspect a Row?) to do your filtering or possibly asking hive to do a lateral view with the mainSnak (which is I believe what you're looking for?):

select id, claims_ex.mainSnak.property, claims_ex.mainSnak.dataValue.value
from wmf.wikidata_entity lateral view explode(claims) claims_explode as claims_ex
where snapshot = '2023-07-24' AND claims_ex.mainSnak.property = 'P31' limit 10;

OK
id	property	value
Q38488724	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q37619467	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q38738598	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q37797268	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q38708632	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q37781259	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q39051969	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q37373175	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q38327391	P31	{"entity-type":"item","numeric-id":5,"id":"Q5"}
Q37598817	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}

Then adding yet another filter on claims_ex.mainSnak.dataValue.value = '{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}' should work.

If you're looking at all possible places where Q13442814 could be used inside a claim (not only as the mainSnak then I'm not sure how to do that without unfolding this deep datastructure.

Thanks, @dcausse! Really appreciate the detailed explanation :) I totally agree that serializing the full claim would be problematic, and that your method is much better. Need a bit more practice with lateral view explode so that it becomes more natural for me to use. I'll implement the above at the start of the week and see if it works properly 😊

The above LATERAL VIEW EXPLODE method came up with 40,529,640 scholarly articles via the claims, @dcausse. I think that that's close enough to the amount from discovery.wikibase_rdf that we don't need to dig more into expanding the WHERE clause :) Thanks again for your help!

For the call we have later today - just as a reminder to myself to ask you as well - let me know if you have an idea for why the claims column output is different based on Presto or Spark being used.

Notes from the call that @dcausse and had:

Notebook seems alright to him, so I'm moving this into review
Our plan of doing the same subclasses as AKhatun as well as the direct subclasses also made sense, so work will progress as planned in T342123
With regards to the Presto vs. Spark outputs of nested columns, he suggested that Presto might also be able to be referenced in a dictionary-like fashion, but it might just not appear so on first glance of the output in Pandas.
- In testing this a bit I found that Presto does have the UNNEST function that has a similar use case to LATERAL VIEW EXPLODE
- The following allows us to get sub-entries within the claims column of wmf.wikidata_entity:

SELECT
    a, b, c, d, e, f, g

FROM 
    wmf.wikidata_entity

CROSS JOIN 
    UNNEST(claims) AS claims_explode (a, b, c, d, e, f, g)

WHERE 
    snapshot = '2023-07-24'
    AND id = 'Q1895685'

It doesn't appear to allow for a direct dictionary-like key to value reference system, but does allow for arrays to be unnested and assigned to output columns. Might be of use in the future 😊

AndrewTavis_WMDE moved this task from In progress to Needs product input on the Wikidata Analytics (Kanban) board.Aug 8 2023, 10:33 AM

Manuel moved this task from Needs product input to Product verification on the Wikidata Analytics (Kanban) board.Aug 10 2023, 11:20 AM

Our work is done here, and we found an ideal 50-50 split! \o/

Thank you again, @dcausse, for all of your support! We learned a lot from this!

Manuel closed this task as Resolved.Aug 10 2023, 12:08 PM

dcausse mentioned this in T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules.Oct 3 2023, 1:23 PM

[Analytics] Find out the size of direct instances of Q13442814 (scholarly article)Closed, ResolvedPublicActions

Description

Assignee Planning

Sub Tasks

Data to be used

Notes and Questions

Related ObjectsSearch...

Event Timeline

[Analytics] Find out the size of direct instances of Q13442814 (scholarly article)
Closed, ResolvedPublic
Actions

Related Objects
Search...