Page MenuHomePhabricator

Sqooping test following schema update

Authored By
Antoine_Quhen
May 24 2023, 1:52 PM
Size
3 KB
Referenced Files
None
Subscribers
None

Sqooping test following schema update

# Squooping test after schema update
# Prepare some directories with loose permissions
hdfs dfs -mkdir /tmp/test-aqu-20230522
hdfs dfs -chmod -R 777 /tmp/test-aqu-20230522
hdfs dfs -ls /tmp/ | grep test-aqu-20230522
mkdir -p /tmp/test-aqu-20230522
cd /tmp/test-aqu-20230522
git clone https://github.com/wikimedia/analytics-refinery.git
chmod 777 -R /tmp/test-aqu-20230522
# Edit sqoop script
vim analytics-refinery/python/refinery/sqoop.py
# Launch Scooping process
sudo -u analytics \
PYTHONPATH=/tmp/test-aqu-20230522/analytics-refinery/python \
/usr/bin/python3 \
/tmp/test-aqu-20230522/analytics-refinery/bin/sqoop-mediawiki-tables \
--job-name aqu-sqoop-mediawiki-monthly-2023-04 \
--clouddb \
--output-dir /tmp/test-aqu-20230522 \
--wiki-file /mnt/hdfs/wmf/refinery/current/static_data/mediawiki/grouped_wikis/grouped_wikis_test.csv \
--tables externallinks \
--user s53272 \
--password-file /user/analytics/mysql-analytics-labsdb-client-pw.txt \
--partition-name snapshot \
--partition-value 2023-04 \
--mappers 64 \
--processors 10 \
--yarn-queue production \
--output-format avrodata \
--log-file /tmp/test-aqu-20230522/sqoop-mediawiki.log \
--local-tmp-path /tmp/test-aqu-20230522/sqoop-jars \
--hdfs-tmp-path /tmp/test-aqu-20230522
# Read a generated avro file
sudo -u analytics hdfs dfs -chmod +r /tmp/test-aqu-20230522/externallinks/snapshot=2023-04/wiki_db=dkwikimedia/part-m-00006.avro
sudo -u analytics hdfs dfs -chmod -R +x /tmp/test-aqu-20230522/externallinks
spark3-shell --packages org.apache.spark:spark-avro_2.12:3.1.2
val path = "file:///tmp/test-aqu-20230522/part-m-00006.avro"
val path = "hdfs:///tmp/test-aqu-20230522/externallinks/snapshot=2023-04/wiki_db=dkwikimedia/part-m-00006.avro"
val df = spark.read.format("avro").load(path)
df.show(5, false)
+-----+-------+-------------------------+---------------------------------------------------------------------------------------+
|el_id|el_from|el_to_domain_index |el_to_path |
+-----+-------+-------------------------+---------------------------------------------------------------------------------------+
|1222 |467 |http://dk.politiken. |/kultur/medier/ECE1992455/wikipedia-gaar-til-kamp-mod-redigeringskrig/ |
|1223 |214 |https://org.wikipedia.da.|/w/index.php?title=Wikipedia:Tr%C3%A6f&oldid=7165729 |
|1224 |182 |https://org.wikipedia.da.|/wiki/Wikipedia:GLAM/Edit-a-thon_1864 |
|1225 |214 |http://dk.politiken. |/kultur/medier/ECE1992455/wikipedia-gaar-til-kamp-mod-redigeringskrig/ |
|1226 |214 |https://org.wikipedia.da.|/w/index.php?title=Wikipedia:Landsbybr%C3%B8nden/Kommunevalgs-edit-a-thon&oldid=7295725|
+-----+-------+-------------------------+---------------------------------------------------------------------------------------+
# Cleanup
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -R /tmp/test-aqu-20230522
sudo -u analytics rm -Rf /tmp/test-aqu-20230522/*
rm -Rf /tmp/test-aqu-20230522

File Metadata

Mime Type
text/plain; charset=utf-8
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
10849458
Default Alt Text
Sqooping test following schema update (3 KB)

Event Timeline