Productionize Wikidata subgraph analysis
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	AKhatun_WMF
	Mar 15 2022, 2:08 PM

Description

As a Data Analyst for Wikidata/WDQS, I would like for the metrics from subgraph analysis done in T293628 to be periodically evaluated and stored over a period of time for further analysis and also so that anyone can access the analysis results without having to do all analysis from scratch.

This ticket covers productionizing:

subgraph mapping to items and triples
subgraph metrics: subgraph size, number of items, predicate usage etc
query mapping to subgraph
subgraph query metrics: queries per subgraph, UA distribution, query time distribution, items/predicates usage etc

List of all possible metrics: metrics-list

Details

Subject	Repo	Branch	Lines +/-
drop-snapshots: Remove directory handling	wikimedia/discovery/analytics	master	+0 -30
drop-snapshots: Tables are partitioned by wiki	wikimedia/discovery/analytics	master	+3 -3
Automatically drop historical partitions of subgraph analysis	wikimedia/discovery/analytics	master	+260 -20
Remove subgraph/query mapping from drop_old_data	wikimedia/discovery/analytics	master	+0 -29
subgraph_and_query_metrics: Drop wiki from sparql event partition spec	wikimedia/discovery/analytics	master	+2 -2
subgraph_and_query_mapping: Increase memory to 12g, use repartition	wikimedia/discovery/analytics	master	+4 -4
Switch SubgraphQueryMapper from coalesce to repartition	wikidata/query/rdf	master	+4 -4
subgraph_query_mapping_daily: Increase partitioning to 2048	wikimedia/discovery/analytics	master	+5 -0
Remove external queries from wait_for_data	wikimedia/discovery/analytics	master	+4 -7
subgraph: Use HivePartitionRangeSensor to wait for sparql queries	wikimedia/discovery/analytics	master	+37 -5
Reconsider sensor data dates and use hive range sensor	wikimedia/discovery/analytics	master	+129 -72
Update rdf-spark-tools to 0.3.112	wikimedia/discovery/analytics	master	+2 -2
Switch SubgraphMapper from coalesce to repartition	wikidata/query/rdf	master	+3 -3
Tune subgraph_mapping_weekly based on first prod run	wikimedia/discovery/analytics	master	+10 -4
Airflow dags to generate subgraph and query mapping and their metrics	wikimedia/discovery/analytics	master	+1 K -3
Update subgraph table partitions	wikidata/query/rdf	master	+37 -26
Unit tests for subgraph analysis metrics	wikidata/query/rdf	master	+302 -0
Unit tests for subgraph query mapping	wikidata/query/rdf	master	+1 K -11
Unit tests for subgraph mapping	wikidata/query/rdf	master	+4 K -1
Productionize subgraph analysis metrics	wikidata/query/rdf	master	+3 K -135
Add spark-testing-base dependency	wikidata/query/rdf	master	+47 -39
Reorganize rdf-spark-tools submodule	wikidata/query/rdf	master	+58 -56

Related Objects

Mentioned In: T337021: [Analytics] Find out size of term subgraph
rWDAN48e506e0ce1c: drop-snapshots: Remove directory handling
rWDANe35889383807: drop-snapshots: Tables are partitioned by wiki
rWDAN031604df1dfe: Automatically drop historical partitions of subgraph analysis
rWDANd4137b5d9dbf: Remove subgraph/query mapping from drop_old_data
rWDAN45ae36dd036a: subgraph_and_query_metrics: Drop wiki from sparql event partition spec
rWDAN89cb17dfa816: subgraph_and_query_mapping: Increase memory to 12g, use repartition
rWDAN3ba1d4c33dd2: subgraph_query_mapping_daily: Increase partitioning to 2048
rWDANa559f8287345: Remove external queries from wait_for_data
rWDAN00b2b320ce3d: subgraph: Use HivePartitionRangeSensor to wait for sparql queries
rWDAN02ab1c2bcd91: Reconsider sensor data dates and use hive range sensor
rWDANc27177479410: Update rdf-spark-tools to 0.3.112
rWDANe0a8f038588c: Tune subgraph_mapping_weekly based on first prod run
rWDANdebd4025db5d: Airflow dags to generate subgraph and query mapping and their metrics
Mentioned Here: T293628: Get baseline measurements/expectations for splitting various subgraphs from Wikidata

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 15 2022, 2:08 PM

AKhatun_WMF moved this task from Incoming to Current work on the Wikidata-Query-Service board.Mar 15 2022, 2:10 PM

AKhatun_WMF moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

AKhatun_WMF added subscribers: JAllemandou, Gehel, dcausse.

Maintenance_bot added a project: Wikidata.Mar 15 2022, 2:45 PM

Change 771077 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] [WIP] Productionize subgraph analysis metrics

https://gerrit.wikimedia.org/r/771077

gerritbot added a project: Patch-For-Review.Mar 15 2022, 10:58 PM

Change 780888 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Reorganize rdf-spark-tools submodule

https://gerrit.wikimedia.org/r/780888

Change 780888 merged by jenkins-bot:

[wikidata/query/rdf@master] Reorganize rdf-spark-tools submodule

https://gerrit.wikimedia.org/r/780888

Change 787064 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Unit tests for subgraph mapping

https://gerrit.wikimedia.org/r/787064

Change 800599 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Unit tests for subgraph query mapping

https://gerrit.wikimedia.org/r/800599

Change 802506 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Add spark-testing-base dependency

https://gerrit.wikimedia.org/r/802506

Change 802506 merged by jenkins-bot:

[wikidata/query/rdf@master] Add spark-testing-base dependency

https://gerrit.wikimedia.org/r/802506

Change 803492 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Unit tests for subgraph analysis metrics

https://gerrit.wikimedia.org/r/803492

Change 807977 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikimedia/discovery/analytics@master] Airflow dags to generate subgraph and query mapping and their metrics

https://gerrit.wikimedia.org/r/807977

Change 771077 merged by jenkins-bot:

[wikidata/query/rdf@master] Productionize subgraph analysis metrics

https://gerrit.wikimedia.org/r/771077

Change 808977 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Update subgraph table partitions

https://gerrit.wikimedia.org/r/808977

Change 787064 merged by jenkins-bot:

[wikidata/query/rdf@master] Unit tests for subgraph mapping

https://gerrit.wikimedia.org/r/787064

Change 800599 merged by jenkins-bot:

[wikidata/query/rdf@master] Unit tests for subgraph query mapping

https://gerrit.wikimedia.org/r/800599

Change 803492 merged by jenkins-bot:

[wikidata/query/rdf@master] Unit tests for subgraph analysis metrics

https://gerrit.wikimedia.org/r/803492

Change 808977 merged by jenkins-bot:

[wikidata/query/rdf@master] Update subgraph table partitions

https://gerrit.wikimedia.org/r/808977

Change 807977 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Airflow dags to generate subgraph and query mapping and their metrics

https://gerrit.wikimedia.org/r/807977

EBernhardson mentioned this in rWDANdebd4025db5d: Airflow dags to generate subgraph and query mapping and their metrics.Jul 6 2022, 9:53 PM

the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).

subgraph_mapping_weekly failed the first time through. I updated executor memory from 8g to 12g but the second execution is still failing. something is quite unbalanced about the topSubgraphItems, of the 8 shards they have inputs varying from 100MB to 450MB giving executions times of ~30s on the small ones and ~8m before the final one fails.

Not specifically related to this patch, but i wonder if we could change up the SparkUtils.saveTables method to somehow take parameters in the path to specify coalesce vs repartition and the number of partitions to save by, so we only have to update the airflow invocation and not the jar as well to test variations there.

In T303831#8058159, @EBernhardson wrote:

the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).

subgraph_mapping_weekly failed the first time through. I updated executor memory from 8g to 12g but the second execution is still failing. something is quite unbalanced about the topSubgraphItems, of the 8 shards they have inputs varying from 100MB to 450MB giving executions times of ~30s on the small ones and ~8m before the final one fails.

Not specifically related to this patch, but i wonder if we could change up the SparkUtils.saveTables method to somehow take parameters in the path to specify coalesce vs repartition and the number of partitions to save by, so we only have to update the airflow invocation and not the jar as well to test variations there.

Should we have params called coalesce, and repartition, and have them default to false. And when true, use num_partitions to coalesce or repartition accordingly?

Edit: I realize all arg classes that need to coalesce or repartition will need to have these params set.

Update:
I tested a few options in the statbox, I am not sure how much this will represent the prod env, but here goes:

coalesce + 8G driver memory = failed as identified by Erik (SparkOutOfMemoryError at topSubgraphItems, application_1655808530211_109990)
coalesce + 16G driver memory = failed (SparkOutOfMemoryError at topSubgraphItems, application_1655808530211_110190)
repartition + 8G driver memory = failed (Reason: Executor heartbeat timed out after 176110 ms, application_1655808530211_110236)
repartition + 16G driver memory = failed (Reason: Executor heartbeat timed out after 159925 ms, application_1655808530211_110343)
repartition + 16G driver memory + 16G executor memory = failed (Reason: Executor heartbeat timed out after 145549 ms, application_1655808530211_110430)

need to figure out the exact place that causes OOM.

EDIT: Before coalesce or repartition 8G-8G succeeded in 30 minutes

In T303831#8060472, @AKhatun_WMF wrote:

In T303831#8058159, @EBernhardson wrote:

the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).

subgraph_mapping_weekly failed the first time through. I updated executor memory from 8g to 12g but the second execution is still failing. something is quite unbalanced about the topSubgraphItems, of the 8 shards they have inputs varying from 100MB to 450MB giving executions times of ~30s on the small ones and ~8m before the final one fails.

Not specifically related to this patch, but i wonder if we could change up the SparkUtils.saveTables method to somehow take parameters in the path to specify coalesce vs repartition and the number of partitions to save by, so we only have to update the airflow invocation and not the jar as well to test variations there.

Should we have params called coalesce, and repartition, and have them default to false. And when true, use num_partitions to coalesce or repartition accordingly?

Edit: I realize all arg classes that need to coalesce or repartition will need to have these params set.

In this case i was thinking that we could somehow treat the string that is provided over the command line as a specification for how/where to store things and somehow include named parameters in it. So for example right now we provide:

--all-subgraphs-table discovery.wikibase_rdf/date=20220620/wiki=wikidata

What if instead we could provide (syntax to be bikeshedded):

--all-subgraphs-table discovery.wikibase_rdf/date=20220620/wiki=wikidata;repartition=42

This would have the downside that read/write would have different syntaxes and we have to know which to use where, maybe there are better options. Mostly pondering ideas on how to make things we know might have to be modified easier to change. There are probably other ways to magic parameters into various places in the jvm world, this is just a first guess.

I tried a run with the three coalesce's in SubgraphMapper converted into repartitions. In this case instead of having 8 partitions where 7 finish and the 8th takes forever and then fails, now it has 200 partitions and 199 finish with the 200th taking forever and then failing. This seems like it could be a case of skew-join, the dataset is being partitioned based on the join condition (rather than randomly) and a specific part of the join has significantly more values to work through than anything else. To get an idea of how significant the skew is i doubled the ram again (to 24g) in hopes that it will eventually complete and give some stats. The final stats are as follows, clearly showing a significant skew:

Duration	1 s	1 s	2 s	2 s	4.1 min
Scheduler Delay	6 ms	19 ms	21 ms	26 ms	34 ms
Task Deserialization Time	37 ms	61 ms	77 ms	0.1 s	0.2 s
GC Time	0 ms	16 ms	23 ms	48 ms	2.6 min
Result Serialization Time	0 ms	0 ms	0 ms	0 ms	1 ms
Getting Result Time	0 ms	0 ms	0 ms	0 ms	0 ms
Peak Execution Memory	128.8 MB	194.3 MB	196.3 MB	200.3 MB	5.6 GB
Shuffle Read Blocked Time	0 ms	3 ms	5 ms	64 ms	0.3 s
Shuffle Read Size / Records	1469.5 KB / 35062	2.5 MB / 87982	3.1 MB / 133528	5.0 MB / 258108	406.2 MB / 38467392
Shuffle Remote Reads	1433.7 KB	2.5 MB	3.1 MB	4.9 MB	398.5 MB
Shuffle Write Size / Records	0.0 B / 0	184.5 KB / 18106	827.2 KB / 72252	2.5 MB / 195511	404.2 MB / 38411863

Resolving skew on the other hand is a harder problem. Spark 3 added a new skew-join optimization and I've heard that some other teams have spark 3 working in our cluster, but I haven't played around with it at all yet. Will look into this more and see what solutions can be found. In terms of the exact code causing this, spark is terrible at telling us exactly where but trying to infer from the SparkUI output i think it's this join:

def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = {
  wikidataTriples
    .filter(s"predicate='<$p31>'")
    .selectExpr("object as subgraph", "subject as item")
    .join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right")

I'll probably need to recreate some of this in a jupyterlab notebook to look at the actual data used in the middle of the computations and see what exactly is in the skewed side of the dataset. Alternatively it's not the end of the world to let this use more memory and try and ignore the skew.

And i suppose this is also only the first skewed join in the execution, there may be more later in the computations.

Change 812075 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Tune subgraph_mapping_weekly based on first prod run

https://gerrit.wikimedia.org/r/812075

Change 812075 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Tune subgraph_mapping_weekly based on first prod run

https://gerrit.wikimedia.org/r/812075

EBernhardson mentioned this in rWDANe0a8f038588c: Tune subgraph_mapping_weekly based on first prod run.Jul 7 2022, 8:47 PM

Change 812133 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikidata/query/rdf@master] Switch SubgraphMapper from coalesce to repartition

https://gerrit.wikimedia.org/r/812133

Change 812133 merged by jenkins-bot:

[wikidata/query/rdf@master] Switch SubgraphMapper from coalesce to repartition

https://gerrit.wikimedia.org/r/812133

Change 812143 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Update rdf-spark-tools to 0.3.112

https://gerrit.wikimedia.org/r/812143

Change 812143 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Update rdf-spark-tools to 0.3.112

https://gerrit.wikimedia.org/r/812143

EBernhardson mentioned this in rWDANc27177479410: Update rdf-spark-tools to 0.3.112.Jul 8 2022, 1:38 AM

Stats on the final join building topSubgraphTriples. this is using 4096 partitions and repartition(). It works for now so probably not worth dealing with the skew, but these stats might be useful to compare against in the future if it starts failing:

Metric	Min	25th percentile	Median	75th percentile	Max
Duration	15 s	46 s	54 s	1.0 min	9.2 min
Scheduler Delay	2 ms	3 ms	3 ms	4 ms	0.4 s
Task Deserialization Time	1 ms	2 ms	2 ms	3 ms	0.7 s
GC Time	27 ms	0.1 s	0.2 s	0.3 s	41 s
Result Serialization Time	0 ms	0 ms	0 ms	0 ms	1 ms
Getting Result Time	0 ms	0 ms	0 ms	0 ms	0 ms
Peak Execution Memory	2.1 GB	2.1 GB	2.1 GB	2.1 GB	13.6 GB
Shuffle Read Blocked Time	0 ms	23 s	32 s	38 s	2.1 min
Shuffle Read Size / Records	263.2 MB / 3156075	269.9 MB / 3235843	271.6 MB / 3256300	273.4 MB / 3277774	30.5 GB / 414401248
Shuffle Remote Reads	255.2 MB	264.1 MB	266.1 MB	268.0 MB	29.7 GB
Shuffle Write Size / Records	340.9 MB / 3184514	351.8 MB / 3281889	354.4 MB / 3305742	357.0 MB / 3330833	367.5 MB / 3438583
Shuffle spill (memory)	0.0 B	0.0 B	0.0 B	0.0 B	98.1 GB
Shuffle spill (disk)	0.0 B	0.0 B	0.0 B	0.0 B	28.2 GB

Summary of what was done so far to deploy:

Tuned subgraph_mapping_weekly. Set spark parallelism to 4096, Increased memory to 24G (=6g per task) and reduced total executor count to keep total memory usage around 1TB. Changed coalesce() into repartition() in SubgraphMapper. Completes without any failed tasks. Might be a bit wasteful of memory, but probably not worth tuning unless there are complaints and we can hope a later upgrade to spark 3 w/ skew-join optimization will improve things. We could manually implement the same skew-join optimization on a per-use case basis, but it's extra work that might not be necessary.
Enabled subgraph_metrics_weekly. Ran without issue.
This patch added a number of new sensors. We've been intending to switch sensors from mode=poke to mode=reschedule. Adding these new sensors reminded me of why we needed to make that change (all airflow executors used waiting for data to arrive). Deployed a patch to switch everything over.
Enabled subgraph_query_mapping_daily. This started waiting for snapshot=20220613 (last monday) with an execution_date of 20220620 (also a monday). I suspect we should adjust this to target snapshot=20220620, but waiting for confirmation. Turned back off so it doesn't timeout and complain.
Enabled subgraph_query_metrics_daily. This is waiting for event.wdqs_external_sparql_query/datacenter=eqiad/year=2022/month=6/day=20 (and same for codfw) but it needs to be waiting on the individual hourly partitions. I hadn't thought this fully through when reviewing the patch, we will need to adjust the sensor to use HivePartitionRangeSensor which can generate all the intermediate hourly named partitions. Turned back off as it's also waiting for outputs of subgraph_query_mapping_daily (iiuc) which is turned off currently.

In T303831#8063021, @EBernhardson wrote:
In terms of the exact code causing this, spark is terrible at telling us exactly where but trying to infer from the SparkUI output i think it's this join:
def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = {
  wikidataTriples
    .filter(s"predicate='<$p31>'")
    .selectExpr("object as subgraph", "subject as item")
    .join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right")

This is exactly the code that finds out the top subgraphs. And yes, the data is definitely heavily skewed, that is the nature of Wikidata and anything we do on Wikidata by subgraphs is going to run into similar issues. For reference, half of wikidata is under 1 single subgraph, and the rest half has 100s of subgraphs. We might need to start considering spark3.

And i suppose this is also only the first skewed join in the execution, there may be more later in the computations.

Unfortunately, yes. subgraph_query_mapping is going to be another big feat I believe, it has similar joins and writes data daily. But we will see.

In T303831#8064293, @EBernhardson wrote:

Enabled subgraph_query_mapping_daily. This started waiting for snapshot=20220613 (last monday) with an execution_date of 20220620 (also a monday). I suspect we should adjust this to target snapshot=20220620, but waiting for confirmation. Turned back off so it doesn't timeout and complain.

It is correct to look for data from last Monday, because the data of 20220620 actually got populated the following Friday. So if the job is running on current data, it wont find data for Monday on the same day. All of this maneuver is because the input data is both weekly and daily, so every day the job looks for data from the last Monday.

This makes me think if the same should be done for subgraph_mapping_weekly, as it looks for 20220620 on the same day, even though it will be populated the following Friday. This job runs weekly, same as input data.

EDIT: I just realized, this issue with continue to occur for daily data between Mondays and fridays when the sensor looks for last_mondays data but cannot find them until next Friday.....lets talk about these, I don't have a solution for this right now.

Enabled subgraph_query_metrics_daily. This is waiting for event.wdqs_external_sparql_query/datacenter=eqiad/year=2022/month=6/day=20 (and same for codfw) but it needs to be waiting on the individual hourly partitions. I hadn't thought this fully through when reviewing the patch, we will need to adjust the sensor to use HivePartitionRangeSensor which can generate all the intermediate hourly named partitions. Turned back off as it's also waiting for outputs of subgraph_query_mapping_daily (iiuc) which is turned off currently.

Attempting this.

Change 812304 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikimedia/discovery/analytics@master] Reconsider sensor data dates and use hive range sensor

https://gerrit.wikimedia.org/r/812304

Change 812304 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Reconsider sensor data dates and use hive range sensor

https://gerrit.wikimedia.org/r/812304

jenkins-bot mentioned this in rWDAN02ab1c2bcd91: Reconsider sensor data dates and use hive range sensor.Jul 11 2022, 4:38 PM

Change 812927 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] subgraph: Use HivePartitionRangeSensor to wait for sparql queries

https://gerrit.wikimedia.org/r/812927

Change 812927 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] subgraph: Use HivePartitionRangeSensor to wait for sparql queries

https://gerrit.wikimedia.org/r/812927

EBernhardson mentioned this in rWDAN00b2b320ce3d: subgraph: Use HivePartitionRangeSensor to wait for sparql queries.Jul 11 2022, 7:33 PM

Change 812936 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Remove external queries from wait_for_data

https://gerrit.wikimedia.org/r/812936

Change 812936 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Remove external queries from wait_for_data

https://gerrit.wikimedia.org/r/812936

EBernhardson mentioned this in rWDANa559f8287345: Remove external queries from wait_for_data.Jul 11 2022, 8:34 PM

Change 812942 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] subgraph_query_mapping_daily: Increase partitioning to 2048

https://gerrit.wikimedia.org/r/812942

Change 812942 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] subgraph_query_mapping_daily: Increase partitioning to 2048

https://gerrit.wikimedia.org/r/812942

EBernhardson mentioned this in rWDAN3ba1d4c33dd2: subgraph_query_mapping_daily: Increase partitioning to 2048.Jul 11 2022, 9:38 PM

Change 812970 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikidata/query/rdf@master] Switch SubgraphQueryMapper from coalesce to repartition

https://gerrit.wikimedia.org/r/812970

Change 812970 merged by jenkins-bot:

[wikidata/query/rdf@master] Switch SubgraphQueryMapper from coalesce to repartition

https://gerrit.wikimedia.org/r/812970

Change 813190 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] subgraph_and_query_mapping: Increase memory to 12g, use repartition

https://gerrit.wikimedia.org/r/813190

Change 813190 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] subgraph_and_query_mapping: Increase memory to 12g, use repartition

https://gerrit.wikimedia.org/r/813190

EBernhardson mentioned this in rWDAN89cb17dfa816: subgraph_and_query_mapping: Increase memory to 12g, use repartition.Jul 12 2022, 7:07 AM

Change 813334 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] subgraph_and_query_metrics: Drop wiki from sparql event partition spec

https://gerrit.wikimedia.org/r/813334

Change 813334 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] subgraph_and_query_metrics: Drop wiki from sparql event partition spec

https://gerrit.wikimedia.org/r/813334

EBernhardson mentioned this in rWDAN45ae36dd036a: subgraph_and_query_metrics: Drop wiki from sparql event partition spec.Jul 12 2022, 10:12 PM

All dags are now enabled and have completed at least one full execution of each dag.

Increased partition count on map_subgraph_queries to 2048, the largest shuffle is ~600GB and this gets the per-executor work down into the desired 256-512M range.
Increased executor memory on map_subgraph_queries from 8g to 12g. Many executors were red with >10% of time spent in GC. This often leads to intermittent failures that increase when data sizes increase, 12g appears to keep most executors out of the red state.
Seeing intermittent failures in map_subgraph_queries, usually internal spark retries manage to work through it but have seen failures that roll up to the airflow retry level. We might want to increase the timeout waiting on shufle server if it persists. Potentially spark addressed this issue in 3.0 with https://issues.apache.org/jira/browse/SPARK-24355
Mentioned to analytics team that we have a few new high-resource jobs running. These jobs are all in the sequential pool so it shouldn't cause any downstream issues, but seems appropriate to let them know.
Switched SubgraphQueryMapper from coalesce to repartition. Same reasoning as in the weekly dag, the final jobs were giving OOM's and allowing those to compute with the full partition count allows it to complete, at the expense of requiring an additional shuffle.
Removed wiki=wikidata from the sparql event partition specification in subgraph_and_query_metrics. There is no wiki column in this table, rather it is limited to wdqs (TODO: is that true? Can wcqs end up in here?) which is implicitly limited to wikidata.

Thanks a lot @EBernhardson for the help on finishing this!

There is actually one piece remaining, we typically use refinery-drop-older-than to prune our tables. That worked when we used date=... as the partitioning scheme, but it doesn't support snapshot=.... I t takes minimal work (I already have a working POC) to make it interpret snapshot the same as date, but I suspect the partitioning changed the name to snapshot=... due to an intent to not only use dates for partitioning? If so analytics does have a refinery-drop-mediawiki-snapshots script but it's fairly specialized to their use case. I suspect we would need to make a work-alike script that uses the same refinery library methods but provides our own configuration to the script. Or the script could be modified to import it's configuration from somewhere user-defined instead of having the configuration embedded in the script itself.

Lots of options, but we have to figure out which is the appropriate way forward.

Double checked all linked patches, no patches remain for review.

The work still to be done is to decide how to handle pruning data from the snapshot= partitioned tables

EBernhardson moved this task from In Progress to Blocked/Waiting on the Discovery-Search (Current work) board.Jul 25 2022, 10:30 PM

Change 823185 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Remove subgraph/query mapping from drop_old_data

https://gerrit.wikimedia.org/r/823185

gerritbot added a project: Patch-For-Review.Aug 15 2022, 4:48 PM

Change 823185 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Remove subgraph/query mapping from drop_old_data

https://gerrit.wikimedia.org/r/823185

jenkins-bot mentioned this in rWDANd4137b5d9dbf: Remove subgraph/query mapping from drop_old_data.Aug 15 2022, 5:22 PM

Gehel reassigned this task from AKhatun_WMF to EBernhardson.Aug 22 2022, 3:19 PM

@JAllemandou The one remaining piece of this ticket is cleaning up the historical data, per T303831#8081172. Any suggestions on how we should manage droping old data from tables partitioned by a snapshot column?

In T303831#8175252, @EBernhardson wrote:

@JAllemandou The one remaining piece of this ticket is cleaning up the historical data, per T303831#8081172. Any suggestions on how we should manage droping old data from tables partitioned by a snapshot column?

The we currently do this is with this script: https://github.com/wikimedia/analytics-refinery/blob/master/bin/refinery-drop-mediawiki-snapshots
it works differently from the generic refinery-drop-older-than script, in that it lists all the datasets to clean and then applies the deletion.
It's possible to add the datasets you need to delete in there, it shouldn't be complicated.

Gehel moved this task from Blocked/Waiting to In Progress on the Discovery-Search (Current work) board.Aug 29 2022, 3:09 PM

Discussed this with Joseph as we believe that having to configure the cleanup job in another repo is not ideal.
It seems that the long term approach might be around using the data catalog (https://datahub.wikimedia.org/) to store some retention metadata and have generic jobs relying on this to do the cleanups.
One option (short term) could be to copy refinery-drop-mediawiki-snapshots to the search airflow code base and use it for for our needs.
It's not ideal but might be acceptable for some time? @EBernhardson would that work for you?

Change 831635 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Automatically drop historical partitions of subgraph analysis

https://gerrit.wikimedia.org/r/831635

Change 831635 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Automatically drop historical partitions of subgraph analysis

https://gerrit.wikimedia.org/r/831635

EBernhardson mentioned this in rWDAN031604df1dfe: Automatically drop historical partitions of subgraph analysis.Sep 13 2022, 4:13 PM

EBernhardson moved this task from In Progress to Needs Reporting on the Discovery-Search (Current work) board.Sep 13 2022, 5:06 PM

Change 832303 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] drop-snapshots: Tables are partitioned by wiki

https://gerrit.wikimedia.org/r/832303

Change 832303 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] drop-snapshots: Tables are partitioned by wiki

https://gerrit.wikimedia.org/r/832303

EBernhardson mentioned this in rWDANe35889383807: drop-snapshots: Tables are partitioned by wiki.Sep 14 2022, 6:46 PM

Change 832331 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] drop-snapshots: Remove directory handling

https://gerrit.wikimedia.org/r/832331

Change 832331 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] drop-snapshots: Remove directory handling

https://gerrit.wikimedia.org/r/832331

EBernhardson mentioned this in rWDAN48e506e0ce1c: drop-snapshots: Remove directory handling.Sep 14 2022, 7:13 PM

data cleanup looks to now have run successfully

In T303831#8237323, @EBernhardson wrote:

data cleanup looks to now have run successfully

Thanks a lot @EBernhardson for finalizing on this :)

Gehel closed this task as Resolved.Sep 16 2022, 2:49 PM

AndrewTavis_WMDE mentioned this in T337021: [Analytics] Find out size of term subgraph.Jul 14 2023, 3:13 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 14 2023, 3:30 PM

Productionize Wikidata subgraph analysisClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Productionize Wikidata subgraph analysis
Closed, ResolvedPublic
Actions