Basic access / overview:
- ssh config setup (can remove config related to gerrit) and confirm access with stat1008: ssh appledora@stat1008.eqiad.wmnet
- create HTTP proxy by adding to ~/.profile on stat1008:
export http_proxy=http://webproxy:8080 export https_proxy=http://webproxy:8080
- HDFS + kinit-ing
- able to access hive (from stat1008: $ kinit, $ hive, $ show databases;)
- PySpark + Jupyter notebooks
- Verify you can access Jupyter cluster from local computer: ssh -N stat1008.eqiad.wmnet -L 8880:127.0.0.1:8880 and then navigate to http://localhost:8880/ in your web browser and verify you can log in (shell username + Wikitech password). NOTE: requires ldap-access T322222
- Create a bash alias for the above ssh command so you don't have to remember all that -- e.g., adding the following to your .bash_profile or .bash_aliases file so typing in JUPYTER into your terminal will connect you:
function JUPYTER() { ssh -N "stat1008.eqiad.wmnet" -L 8880:127.0.0.1:8880; }
Working with PySpark:
- Walk through example notebook together -- wmfdata, SparkSQL queries, etc.
- Data on the cluster: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
Backlog:
- Superset: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset
- example of top-viewed articles on a given day/wiki: https://superset.wikimedia.org/superset/sqllab?savedQueryId=355
- stat1008 access to dumps / mariadb replicas