Page MenuHomePhabricator

🧬 Propagate provenance of scholarly vs main graph queries to processed table
Closed, ResolvedPublic

Description

We should propagate information about which "graph" the queries have been run against to the processes queries table.

This will result in altering the QueriesProcessor similar to T396002

We believe that all that is required is pro propagate the graph_name column from event.wdqs_external_sparql_query to the discovery.processed_external_sparql_query. This graph_name appears to take values of: wikidata_main, scholarly_articles wikidata_full

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
search: Update rdf-spark-tools to 0.3.158repos/data-engineering/airflow-dags!1564ebernhardsonwork/ebernhardson/rdf-spark-tools-159main
search: add graph_name column to process_sparql_query tablerepos/data-engineering/airflow-dags!1557andrew-wmdesearch-sparql-graph-name-columnmain
Customize query in GitLab

Event Timeline

Ollie.Shotton_WMDE renamed this task from propagate provenance of scholarly vs main graph queries to processed table to 🧬 Propagate provenance of scholarly vs main graph queries to processed table.Jun 26 2025, 2:19 PM

Change #1169148 had a related patch set uploaded (by Andrew-WMDE; author: Andrew-WMDE):

[wikidata/query/rdf@master] rdf-spark-tools: populate graph_name column

https://gerrit.wikimedia.org/r/1169148

Change #1169148 merged by jenkins-bot:

[wikidata/query/rdf@master] rdf-spark-tools: populate graph_name column

https://gerrit.wikimedia.org/r/1169148

Looks to be working as expected, with data starting to arrive on the 15th. Looks like data exists for all rows:

spark.sql("""
    select hour, count(1), count(graph_name)
    from discovery.processed_external_sparql_query
    where year=2025 and month=7 and day=15
    group by hour
""").toPandas().set_index('hour').sort_index()
	count(1)	count(graph_name)
hour		
0	484619	0
1	488951	0
2	503897	0
3	491695	0
4	503879	0
5	609667	0
6	900310	0
7	590499	0
8	643976	0
9	640911	640911
10	709146	709146
11	580588	580588
12	469506	469506
13	494958	494958
14	531712	531712
15	547528	547528
16	469200	469200
17	481168	481168
18	504282	504282
19	523301	523301
20	556029	556029
21	505686	505686
22	513512	513512
23	502065	502065

Amazing! Thank you @EBernhardson for your help with this.