Page Menu
Home
Phabricator
Search
Configure Global Search
Log In
Files
F37031157
Sqooping test following schema update
No One
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Authored By
Antoine_Quhen
May 24 2023, 1:52 PM
2023-05-24 13:52:02 (UTC+0)
Size
3 KB
Referenced Files
None
Subscribers
None
Sqooping test following schema update
View Options
# Squooping test after schema update
# Prepare some directories with loose permissions
hdfs dfs -mkdir /tmp/test-aqu-20230522
hdfs dfs -chmod -R 777 /tmp/test-aqu-20230522
hdfs dfs -ls /tmp/ | grep test-aqu-20230522
mkdir -p /tmp/test-aqu-20230522
cd /tmp/test-aqu-20230522
git clone https://github.com/wikimedia/analytics-refinery.git
chmod 777 -R /tmp/test-aqu-20230522
# Edit sqoop script
vim analytics-refinery/python/refinery/sqoop.py
# Launch Scooping process
sudo -u analytics \
PYTHONPATH=/tmp/test-aqu-20230522/analytics-refinery/python \
/usr/bin/python3 \
/tmp/test-aqu-20230522/analytics-refinery/bin/sqoop-mediawiki-tables \
--job-name aqu-sqoop-mediawiki-monthly-2023-04 \
--clouddb \
--output-dir /tmp/test-aqu-20230522 \
--wiki-file /mnt/hdfs/wmf/refinery/current/static_data/mediawiki/grouped_wikis/grouped_wikis_test.csv \
--tables externallinks \
--user s53272 \
--password-file /user/analytics/mysql-analytics-labsdb-client-pw.txt \
--partition-name snapshot \
--partition-value 2023-04 \
--mappers 64 \
--processors 10 \
--yarn-queue production \
--output-format avrodata \
--log-file /tmp/test-aqu-20230522/sqoop-mediawiki.log \
--local-tmp-path /tmp/test-aqu-20230522/sqoop-jars \
--hdfs-tmp-path /tmp/test-aqu-20230522
# Read a generated avro file
sudo -u analytics hdfs dfs -chmod +r /tmp/test-aqu-20230522/externallinks/snapshot=2023-04/wiki_db=dkwikimedia/part-m-00006.avro
sudo -u analytics hdfs dfs -chmod -R +x /tmp/test-aqu-20230522/externallinks
spark3-shell --packages org.apache.spark:spark-avro_2.12:3.1.2
val path = "file:///tmp/test-aqu-20230522/part-m-00006.avro"
val path = "hdfs:///tmp/test-aqu-20230522/externallinks/snapshot=2023-04/wiki_db=dkwikimedia/part-m-00006.avro"
val df = spark.read.format("avro").load(path)
df.show(5, false)
+-----+-------+-------------------------+---------------------------------------------------------------------------------------+
|el_id|el_from|el_to_domain_index |el_to_path |
+-----+-------+-------------------------+---------------------------------------------------------------------------------------+
|1222 |467 |http://dk.politiken. |/kultur/medier/ECE1992455/wikipedia-gaar-til-kamp-mod-redigeringskrig/ |
|1223 |214 |https://org.wikipedia.da.|/w/index.php?title=Wikipedia:Tr%C3%A6f&oldid=7165729 |
|1224 |182 |https://org.wikipedia.da.|/wiki/Wikipedia:GLAM/Edit-a-thon_1864 |
|1225 |214 |http://dk.politiken. |/kultur/medier/ECE1992455/wikipedia-gaar-til-kamp-mod-redigeringskrig/ |
|1226 |214 |https://org.wikipedia.da.|/w/index.php?title=Wikipedia:Landsbybr%C3%B8nden/Kommunevalgs-edit-a-thon&oldid=7295725|
+-----+-------+-------------------------+---------------------------------------------------------------------------------------+
# Cleanup
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -R /tmp/test-aqu-20230522
sudo -u analytics rm -Rf /tmp/test-aqu-20230522/*
rm -Rf /tmp/test-aqu-20230522
File Metadata
Details
Attached
Mime Type
text/plain; charset=utf-8
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
10849458
Default Alt Text
Sqooping test following schema update (3 KB)
Attached To
Mode
P48501 Sqooping test following schema update
Attached
Detach File
Event Timeline
Log In to Comment