Basic access / overview:
- ssh access to stat machines
- ssh config setup and access confirmed with stat1008, which is where my code for this project is located: ssh mhoutti@stat1008.eqiad.wmnet
- create HTTP proxy by adding to ~/.profile:
export http_proxy=http://webproxy:8080 export https_proxy=http://webproxy:8080
- HDFS + kinit-ing
- able to access hive ($ kinit; $ hive; $ show databases; etc.)
- PySpark + Jupyter notebooks
- alias set up to easily access jupyter notebooks ($ NEWPYTER 8 and then navigate to localhost:8880/ in your web browser)
- wmfdata demonstrated in Jupyter notebook:
import wmfdata # initiate session to access the cluster and indicate how many resources you need (the `type` param) spark = wmfdata.spark.get_session(app_name='pyspark regular; example application', type='yarn-regular', # local, yarn-regular, yarn-large ) # run code etc. against the cluster spark.sql('show databases;').show(100, False)
- Superset
- showed sql lab and saved queries
- local access to dumps / mariadb replicas
Detailed documentation / examples:
- Data on the cluster: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
- wmfdata: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#PySpark_and_wmfdata
- Superset: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset
- Superset example of top-viewed articles on a given day/wiki: https://superset.wikimedia.org/superset/sqllab?savedQueryId=355
- Suggestbot extraction: https://github.com/geohci/wiki-prioritization/blob/master/recommendation_evaluation/suggestbot/SuggestBotExtractor.ipynb