Page Menu
Home
Phabricator
Search
Configure Global Search
Log In
Files
F36457008
Test rdf-spark-tools with spark3
No One
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Authored By
dcausse
Jan 24 2023, 11:06 AM
2023-01-24 11:06:51 (UTC+0)
Size
1 KB
Referenced Files
None
Subscribers
None
Test rdf-spark-tools with spark3
View Options
#!/bin/bash
PATH_TO_RDF_SPARK_TOOL_JAR
=
/path/to/rdf-spark-tools.jar
SNAPSHOT
=
20230116
OUTPUT
=
hdfs:///user/pfischer/test_rdf_spark_tools/
$SNAPSHOT
/rev_map.csv
SPARK3_SUBMIT
=
spark3-submit
# Production OUTPUT is at hdfs://analytics-hadoop/wmf/data/discovery/wdqs/entity_revision_map/$SNAPSHOT/rev_map.csv
# so it can be tested to see if the data is similar
# E.g.: hdfs dfs -du -s -h hdfs://analytics-hadoop/wmf/data/discovery/wdqs/entity_revision_map/20230116/rev_map.csv
# Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
# 873.0 M hdfs://analytics-hadoop/wmf/data/discovery/wdqs/entity_revision_map/20230116/rev_map.csv
# hdfs dfs -ls hdfs://analytics-hadoop/wmf/data/discovery/wdqs/entity_revision_map/20230116/rev_map.csv | wc -l
# Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
# 102
$SPARK3_SUBMIT
\
--master yarn
\
--conf spark.dynamicAllocation.maxExecutors
=
25
\
--conf spark.yarn.maxAppAttempts
=
1
\
--executor-cores
8
\
--executor-memory 16g
\
--driver-memory 2g
\
--name
"SPARK3 TEST: [Search Airflow Job] Import Wikidata Ttl: Gen Rev Map"
\
--class org.wikidata.query.rdf.spark.transform.structureddata.dumps.EntityRevisionMapGenerator
\
--queue default
\
--deploy-mode cluster
\
$PATH_TO_RDF_SPARK_TOOL_JAR
--input-table discovery.wikibase_rdf/date
=
$SNAPSHOT
/wiki
=
wikidata
\
--output-path
$OUTPUT
File Metadata
Details
Attached
Mime Type
text/plain; charset=utf-8
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
10426505
Default Alt Text
Test rdf-spark-tools with spark3 (1 KB)
Attached To
Mode
P43307 Test rdf-spark-tools with spark3
Attached
Detach File
Event Timeline
Log In to Comment