Page MenuHomePhabricator

Productionize Wikidata subgraph analysis
Closed, ResolvedPublic

Description

As a Data Analyst for Wikidata/WDQS, I would like for the metrics from subgraph analysis done in T293628 to be periodically evaluated and stored over a period of time for further analysis and also so that anyone can access the analysis results without having to do all analysis from scratch.

This ticket covers productionizing:

  • subgraph mapping to items and triples
  • subgraph metrics: subgraph size, number of items, predicate usage etc
  • query mapping to subgraph
  • subgraph query metrics: queries per subgraph, UA distribution, query time distribution, items/predicates usage etc

List of all possible metrics: metrics-list

Details

SubjectRepoBranchLines +/-
wikimedia/discovery/analyticsmaster+0 -30
wikimedia/discovery/analyticsmaster+3 -3
wikimedia/discovery/analyticsmaster+260 -20
wikimedia/discovery/analyticsmaster+0 -29
wikimedia/discovery/analyticsmaster+2 -2
wikimedia/discovery/analyticsmaster+4 -4
wikidata/query/rdfmaster+4 -4
wikimedia/discovery/analyticsmaster+5 -0
wikimedia/discovery/analyticsmaster+4 -7
wikimedia/discovery/analyticsmaster+37 -5
wikimedia/discovery/analyticsmaster+129 -72
wikimedia/discovery/analyticsmaster+2 -2
wikidata/query/rdfmaster+3 -3
wikimedia/discovery/analyticsmaster+10 -4
wikimedia/discovery/analyticsmaster+1 K -3
wikidata/query/rdfmaster+37 -26
wikidata/query/rdfmaster+302 -0
wikidata/query/rdfmaster+1 K -11
wikidata/query/rdfmaster+4 K -1
wikidata/query/rdfmaster+3 K -135
wikidata/query/rdfmaster+47 -39
wikidata/query/rdfmaster+58 -56
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 771077 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] [WIP] Productionize subgraph analysis metrics

https://gerrit.wikimedia.org/r/771077

Change 780888 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Reorganize rdf-spark-tools submodule

https://gerrit.wikimedia.org/r/780888

Change 780888 merged by jenkins-bot:

[wikidata/query/rdf@master] Reorganize rdf-spark-tools submodule

https://gerrit.wikimedia.org/r/780888

Change 787064 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Unit tests for subgraph mapping

https://gerrit.wikimedia.org/r/787064

Change 800599 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Unit tests for subgraph query mapping

https://gerrit.wikimedia.org/r/800599

Change 802506 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Add spark-testing-base dependency

https://gerrit.wikimedia.org/r/802506

Change 802506 merged by jenkins-bot:

[wikidata/query/rdf@master] Add spark-testing-base dependency

https://gerrit.wikimedia.org/r/802506

Change 803492 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Unit tests for subgraph analysis metrics

https://gerrit.wikimedia.org/r/803492

Change 807977 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikimedia/discovery/analytics@master] Airflow dags to generate subgraph and query mapping and their metrics

https://gerrit.wikimedia.org/r/807977

Change 771077 merged by jenkins-bot:

[wikidata/query/rdf@master] Productionize subgraph analysis metrics

https://gerrit.wikimedia.org/r/771077

Change 808977 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Update subgraph table partitions

https://gerrit.wikimedia.org/r/808977

Change 787064 merged by jenkins-bot:

[wikidata/query/rdf@master] Unit tests for subgraph mapping

https://gerrit.wikimedia.org/r/787064

Change 800599 merged by jenkins-bot:

[wikidata/query/rdf@master] Unit tests for subgraph query mapping

https://gerrit.wikimedia.org/r/800599

Change 803492 merged by jenkins-bot:

[wikidata/query/rdf@master] Unit tests for subgraph analysis metrics

https://gerrit.wikimedia.org/r/803492

Change 808977 merged by jenkins-bot:

[wikidata/query/rdf@master] Update subgraph table partitions

https://gerrit.wikimedia.org/r/808977

Change 807977 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Airflow dags to generate subgraph and query mapping and their metrics

https://gerrit.wikimedia.org/r/807977

the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).

subgraph_mapping_weekly failed the first time through. I updated executor memory from 8g to 12g but the second execution is still failing. something is quite unbalanced about the topSubgraphItems, of the 8 shards they have inputs varying from 100MB to 450MB giving executions times of ~30s on the small ones and ~8m before the final one fails.

Not specifically related to this patch, but i wonder if we could change up the SparkUtils.saveTables method to somehow take parameters in the path to specify coalesce vs repartition and the number of partitions to save by, so we only have to update the airflow invocation and not the jar as well to test variations there.

the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).

subgraph_mapping_weekly failed the first time through. I updated executor memory from 8g to 12g but the second execution is still failing. something is quite unbalanced about the topSubgraphItems, of the 8 shards they have inputs varying from 100MB to 450MB giving executions times of ~30s on the small ones and ~8m before the final one fails.

Not specifically related to this patch, but i wonder if we could change up the SparkUtils.saveTables method to somehow take parameters in the path to specify coalesce vs repartition and the number of partitions to save by, so we only have to update the airflow invocation and not the jar as well to test variations there.

Should we have params called coalesce, and repartition, and have them default to false. And when true, use num_partitions to coalesce or repartition accordingly?

Edit: I realize all arg classes that need to coalesce or repartition will need to have these params set.

Update:
I tested a few options in the statbox, I am not sure how much this will represent the prod env, but here goes:

coalesce + 8G driver memory = failed as identified by Erik (SparkOutOfMemoryError at topSubgraphItems, application_1655808530211_109990)
coalesce + 16G driver memory = failed (SparkOutOfMemoryError at topSubgraphItems, application_1655808530211_110190)
repartition + 8G driver memory = failed (Reason: Executor heartbeat timed out after 176110 ms, application_1655808530211_110236)
repartition + 16G driver memory = failed (Reason: Executor heartbeat timed out after 159925 ms, application_1655808530211_110343)
repartition + 16G driver memory + 16G executor memory = failed (Reason: Executor heartbeat timed out after 145549 ms, application_1655808530211_110430)

need to figure out the exact place that causes OOM.

EDIT: Before coalesce or repartition 8G-8G succeeded in 30 minutes

the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).

subgraph_mapping_weekly failed the first time through. I updated executor memory from 8g to 12g but the second execution is still failing. something is quite unbalanced about the topSubgraphItems, of the 8 shards they have inputs varying from 100MB to 450MB giving executions times of ~30s on the small ones and ~8m before the final one fails.

Not specifically related to this patch, but i wonder if we could change up the SparkUtils.saveTables method to somehow take parameters in the path to specify coalesce vs repartition and the number of partitions to save by, so we only have to update the airflow invocation and not the jar as well to test variations there.

Should we have params called coalesce, and repartition, and have them default to false. And when true, use num_partitions to coalesce or repartition accordingly?

Edit: I realize all arg classes that need to coalesce or repartition will need to have these params set.

In this case i was thinking that we could somehow treat the string that is provided over the command line as a specification for how/where to store things and somehow include named parameters in it. So for example right now we provide:

--all-subgraphs-table discovery.wikibase_rdf/date=20220620/wiki=wikidata

What if instead we could provide (syntax to be bikeshedded):

--all-subgraphs-table discovery.wikibase_rdf/date=20220620/wiki=wikidata;repartition=42

This would have the downside that read/write would have different syntaxes and we have to know which to use where, maybe there are better options. Mostly pondering ideas on how to make things we know might have to be modified easier to change. There are probably other ways to magic parameters into various places in the jvm world, this is just a first guess.

I tried a run with the three coalesce's in SubgraphMapper converted into repartitions. In this case instead of having 8 partitions where 7 finish and the 8th takes forever and then fails, now it has 200 partitions and 199 finish with the 200th taking forever and then failing. This seems like it could be a case of skew-join, the dataset is being partitioned based on the join condition (rather than randomly) and a specific part of the join has significantly more values to work through than anything else. To get an idea of how significant the skew is i doubled the ram again (to 24g) in hopes that it will eventually complete and give some stats. The final stats are as follows, clearly showing a significant skew:

Duration1 s1 s2 s2 s4.1 min
Scheduler Delay6 ms19 ms21 ms26 ms34 ms
Task Deserialization Time37 ms61 ms77 ms0.1 s0.2 s
GC Time0 ms16 ms23 ms48 ms2.6 min
Result Serialization Time0 ms0 ms0 ms0 ms1 ms
Getting Result Time0 ms0 ms0 ms0 ms0 ms
Peak Execution Memory128.8 MB194.3 MB196.3 MB200.3 MB5.6 GB
Shuffle Read Blocked Time0 ms3 ms5 ms64 ms0.3 s
Shuffle Read Size / Records1469.5 KB / 350622.5 MB / 879823.1 MB / 1335285.0 MB / 258108406.2 MB / 38467392
Shuffle Remote Reads1433.7 KB2.5 MB3.1 MB4.9 MB398.5 MB
Shuffle Write Size / Records0.0 B / 0184.5 KB / 18106827.2 KB / 722522.5 MB / 195511404.2 MB / 38411863

Resolving skew on the other hand is a harder problem. Spark 3 added a new skew-join optimization and I've heard that some other teams have spark 3 working in our cluster, but I haven't played around with it at all yet. Will look into this more and see what solutions can be found. In terms of the exact code causing this, spark is terrible at telling us exactly where but trying to infer from the SparkUI output i think it's this join:

def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = {
  wikidataTriples
    .filter(s"predicate='<$p31>'")
    .selectExpr("object as subgraph", "subject as item")
    .join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right")

I'll probably need to recreate some of this in a jupyterlab notebook to look at the actual data used in the middle of the computations and see what exactly is in the skewed side of the dataset. Alternatively it's not the end of the world to let this use more memory and try and ignore the skew.

And i suppose this is also only the first skewed join in the execution, there may be more later in the computations.

Change 812075 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Tune subgraph_mapping_weekly based on first prod run

https://gerrit.wikimedia.org/r/812075

Change 812075 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Tune subgraph_mapping_weekly based on first prod run

https://gerrit.wikimedia.org/r/812075

Change 812133 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikidata/query/rdf@master] Switch SubgraphMapper from coalesce to repartition

https://gerrit.wikimedia.org/r/812133

Change 812133 merged by jenkins-bot:

[wikidata/query/rdf@master] Switch SubgraphMapper from coalesce to repartition

https://gerrit.wikimedia.org/r/812133

Change 812143 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Update rdf-spark-tools to 0.3.112

https://gerrit.wikimedia.org/r/812143

Change 812143 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Update rdf-spark-tools to 0.3.112

https://gerrit.wikimedia.org/r/812143

Stats on the final join building topSubgraphTriples. this is using 4096 partitions and repartition(). It works for now so probably not worth dealing with the skew, but these stats might be useful to compare against in the future if it starts failing:

MetricMin25th percentileMedian75th percentileMax
Duration15 s46 s54 s1.0 min9.2 min
Scheduler Delay2 ms3 ms3 ms4 ms0.4 s
Task Deserialization Time1 ms2 ms2 ms3 ms0.7 s
GC Time27 ms0.1 s0.2 s0.3 s41 s
Result Serialization Time0 ms0 ms0 ms0 ms1 ms
Getting Result Time0 ms0 ms0 ms0 ms0 ms
Peak Execution Memory2.1 GB2.1 GB2.1 GB2.1 GB13.6 GB
Shuffle Read Blocked Time0 ms23 s32 s38 s2.1 min
Shuffle Read Size / Records263.2 MB / 3156075269.9 MB / 3235843271.6 MB / 3256300273.4 MB / 327777430.5 GB / 414401248
Shuffle Remote Reads255.2 MB264.1 MB266.1 MB268.0 MB29.7 GB
Shuffle Write Size / Records340.9 MB / 3184514351.8 MB / 3281889354.4 MB / 3305742357.0 MB / 3330833367.5 MB / 3438583
Shuffle spill (memory)0.0 B0.0 B0.0 B0.0 B98.1 GB
Shuffle spill (disk)0.0 B0.0 B0.0 B0.0 B28.2 GB

Summary of what was done so far to deploy:

  • Tuned subgraph_mapping_weekly. Set spark parallelism to 4096, Increased memory to 24G (=6g per task) and reduced total executor count to keep total memory usage around 1TB. Changed coalesce() into repartition() in SubgraphMapper. Completes without any failed tasks. Might be a bit wasteful of memory, but probably not worth tuning unless there are complaints and we can hope a later upgrade to spark 3 w/ skew-join optimization will improve things. We could manually implement the same skew-join optimization on a per-use case basis, but it's extra work that might not be necessary.
  • Enabled subgraph_metrics_weekly. Ran without issue.
  • This patch added a number of new sensors. We've been intending to switch sensors from mode=poke to mode=reschedule. Adding these new sensors reminded me of why we needed to make that change (all airflow executors used waiting for data to arrive). Deployed a patch to switch everything over.
  • Enabled subgraph_query_mapping_daily. This started waiting for snapshot=20220613 (last monday) with an execution_date of 20220620 (also a monday). I suspect we should adjust this to target snapshot=20220620, but waiting for confirmation. Turned back off so it doesn't timeout and complain.
  • Enabled subgraph_query_metrics_daily. This is waiting for event.wdqs_external_sparql_query/datacenter=eqiad/year=2022/month=6/day=20 (and same for codfw) but it needs to be waiting on the individual hourly partitions. I hadn't thought this fully through when reviewing the patch, we will need to adjust the sensor to use HivePartitionRangeSensor which can generate all the intermediate hourly named partitions. Turned back off as it's also waiting for outputs of subgraph_query_mapping_daily (iiuc) which is turned off currently.

In terms of the exact code causing this, spark is terrible at telling us exactly where but trying to infer from the SparkUI output i think it's this join:

def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = {
  wikidataTriples
    .filter(s"predicate='<$p31>'")
    .selectExpr("object as subgraph", "subject as item")
    .join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right")

This is exactly the code that finds out the top subgraphs. And yes, the data is definitely heavily skewed, that is the nature of Wikidata and anything we do on Wikidata by subgraphs is going to run into similar issues. For reference, half of wikidata is under 1 single subgraph, and the rest half has 100s of subgraphs. We might need to start considering spark3.

And i suppose this is also only the first skewed join in the execution, there may be more later in the computations.

Unfortunately, yes. subgraph_query_mapping is going to be another big feat I believe, it has similar joins and writes data daily. But we will see.

  • Enabled subgraph_query_mapping_daily. This started waiting for snapshot=20220613 (last monday) with an execution_date of 20220620 (also a monday). I suspect we should adjust this to target snapshot=20220620, but waiting for confirmation. Turned back off so it doesn't timeout and complain.

It is correct to look for data from last Monday, because the data of 20220620 actually got populated the following Friday. So if the job is running on current data, it wont find data for Monday on the same day. All of this maneuver is because the input data is both weekly and daily, so every day the job looks for data from the last Monday.

This makes me think if the same should be done for subgraph_mapping_weekly, as it looks for 20220620 on the same day, even though it will be populated the following Friday. This job runs weekly, same as input data.

EDIT: I just realized, this issue with continue to occur for daily data between Mondays and fridays when the sensor looks for last_mondays data but cannot find them until next Friday.....lets talk about these, I don't have a solution for this right now.

  • Enabled subgraph_query_metrics_daily. This is waiting for event.wdqs_external_sparql_query/datacenter=eqiad/year=2022/month=6/day=20 (and same for codfw) but it needs to be waiting on the individual hourly partitions. I hadn't thought this fully through when reviewing the patch, we will need to adjust the sensor to use HivePartitionRangeSensor which can generate all the intermediate hourly named partitions. Turned back off as it's also waiting for outputs of subgraph_query_mapping_daily (iiuc) which is turned off currently.

Attempting this.

Change 812304 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikimedia/discovery/analytics@master] Reconsider sensor data dates and use hive range sensor

https://gerrit.wikimedia.org/r/812304

Change 812304 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Reconsider sensor data dates and use hive range sensor

https://gerrit.wikimedia.org/r/812304

Change 812927 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] subgraph: Use HivePartitionRangeSensor to wait for sparql queries

https://gerrit.wikimedia.org/r/812927

Change 812927 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] subgraph: Use HivePartitionRangeSensor to wait for sparql queries

https://gerrit.wikimedia.org/r/812927

Change 812936 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Remove external queries from wait_for_data

https://gerrit.wikimedia.org/r/812936

Change 812936 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Remove external queries from wait_for_data

https://gerrit.wikimedia.org/r/812936

Change 812942 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] subgraph_query_mapping_daily: Increase partitioning to 2048

https://gerrit.wikimedia.org/r/812942

Change 812942 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] subgraph_query_mapping_daily: Increase partitioning to 2048

https://gerrit.wikimedia.org/r/812942

Change 812970 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikidata/query/rdf@master] Switch SubgraphQueryMapper from coalesce to repartition

https://gerrit.wikimedia.org/r/812970

Change 812970 merged by jenkins-bot:

[wikidata/query/rdf@master] Switch SubgraphQueryMapper from coalesce to repartition

https://gerrit.wikimedia.org/r/812970

Change 813190 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] subgraph_and_query_mapping: Increase memory to 12g, use repartition

https://gerrit.wikimedia.org/r/813190

Change 813190 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] subgraph_and_query_mapping: Increase memory to 12g, use repartition

https://gerrit.wikimedia.org/r/813190

Change 813334 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] subgraph_and_query_metrics: Drop wiki from sparql event partition spec

https://gerrit.wikimedia.org/r/813334

Change 813334 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] subgraph_and_query_metrics: Drop wiki from sparql event partition spec

https://gerrit.wikimedia.org/r/813334

All dags are now enabled and have completed at least one full execution of each dag.

  • Increased partition count on map_subgraph_queries to 2048, the largest shuffle is ~600GB and this gets the per-executor work down into the desired 256-512M range.
  • Increased executor memory on map_subgraph_queries from 8g to 12g. Many executors were red with >10% of time spent in GC. This often leads to intermittent failures that increase when data sizes increase, 12g appears to keep most executors out of the red state.
  • Seeing intermittent failures in map_subgraph_queries, usually internal spark retries manage to work through it but have seen failures that roll up to the airflow retry level. We might want to increase the timeout waiting on shufle server if it persists. Potentially spark addressed this issue in 3.0 with https://issues.apache.org/jira/browse/SPARK-24355
  • Mentioned to analytics team that we have a few new high-resource jobs running. These jobs are all in the sequential pool so it shouldn't cause any downstream issues, but seems appropriate to let them know.
  • Switched SubgraphQueryMapper from coalesce to repartition. Same reasoning as in the weekly dag, the final jobs were giving OOM's and allowing those to compute with the full partition count allows it to complete, at the expense of requiring an additional shuffle.
  • Removed wiki=wikidata from the sparql event partition specification in subgraph_and_query_metrics. There is no wiki column in this table, rather it is limited to wdqs (TODO: is that true? Can wcqs end up in here?) which is implicitly limited to wikidata.

Thanks a lot @EBernhardson for the help on finishing this!

There is actually one piece remaining, we typically use refinery-drop-older-than to prune our tables. That worked when we used date=... as the partitioning scheme, but it doesn't support snapshot=.... I t takes minimal work (I already have a working POC) to make it interpret snapshot the same as date, but I suspect the partitioning changed the name to snapshot=... due to an intent to not only use dates for partitioning? If so analytics does have a refinery-drop-mediawiki-snapshots script but it's fairly specialized to their use case. I suspect we would need to make a work-alike script that uses the same refinery library methods but provides our own configuration to the script. Or the script could be modified to import it's configuration from somewhere user-defined instead of having the configuration embedded in the script itself.

Lots of options, but we have to figure out which is the appropriate way forward.

Double checked all linked patches, no patches remain for review.

The work still to be done is to decide how to handle pruning data from the snapshot= partitioned tables

Change 823185 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Remove subgraph/query mapping from drop_old_data

https://gerrit.wikimedia.org/r/823185

Change 823185 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Remove subgraph/query mapping from drop_old_data

https://gerrit.wikimedia.org/r/823185

@JAllemandou The one remaining piece of this ticket is cleaning up the historical data, per T303831#8081172. Any suggestions on how we should manage droping old data from tables partitioned by a snapshot column?

@JAllemandou The one remaining piece of this ticket is cleaning up the historical data, per T303831#8081172. Any suggestions on how we should manage droping old data from tables partitioned by a snapshot column?

The we currently do this is with this script: https://github.com/wikimedia/analytics-refinery/blob/master/bin/refinery-drop-mediawiki-snapshots
it works differently from the generic refinery-drop-older-than script, in that it lists all the datasets to clean and then applies the deletion.
It's possible to add the datasets you need to delete in there, it shouldn't be complicated.

Discussed this with Joseph as we believe that having to configure the cleanup job in another repo is not ideal.
It seems that the long term approach might be around using the data catalog (https://datahub.wikimedia.org/) to store some retention metadata and have generic jobs relying on this to do the cleanups.
One option (short term) could be to copy refinery-drop-mediawiki-snapshots to the search airflow code base and use it for for our needs.
It's not ideal but might be acceptable for some time? @EBernhardson would that work for you?

Change 831635 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Automatically drop historical partitions of subgraph analysis

https://gerrit.wikimedia.org/r/831635

Change 831635 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Automatically drop historical partitions of subgraph analysis

https://gerrit.wikimedia.org/r/831635

Change 832303 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] drop-snapshots: Tables are partitioned by wiki

https://gerrit.wikimedia.org/r/832303

Change 832303 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] drop-snapshots: Tables are partitioned by wiki

https://gerrit.wikimedia.org/r/832303

Change 832331 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] drop-snapshots: Remove directory handling

https://gerrit.wikimedia.org/r/832331

Change 832331 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] drop-snapshots: Remove directory handling

https://gerrit.wikimedia.org/r/832331

data cleanup looks to now have run successfully

data cleanup looks to now have run successfully

Thanks a lot @EBernhardson for finalizing on this :)