Page MenuHomePhabricator

Onboard Nazia onto analytics infrastructure
Closed, ResolvedPublic

Description

Basic access / overview:

  • ssh config setup (can remove config related to gerrit) and confirm access with stat1008: ssh appledora@stat1008.eqiad.wmnet
  • create HTTP proxy by adding to ~/.profile on stat1008:
export http_proxy=http://webproxy:8080
export https_proxy=http://webproxy:8080
  • HDFS + kinit-ing
  • able to access hive (from stat1008: $ kinit, $ hive, $ show databases;)
  • PySpark + Jupyter notebooks
  • Verify you can access Jupyter cluster from local computer: ssh -N stat1008.eqiad.wmnet -L 8880:127.0.0.1:8880 and then navigate to http://localhost:8880/ in your web browser and verify you can log in (shell username + Wikitech password). NOTE: requires ldap-access T322222
  • Create a bash alias for the above ssh command so you don't have to remember all that -- e.g., adding the following to your .bash_profile or .bash_aliases file so typing in JUPYTER into your terminal will connect you:
function JUPYTER() {
  ssh -N "stat1008.eqiad.wmnet" -L 8880:127.0.0.1:8880;
}

Working with PySpark:

Backlog:

Event Timeline

When you get to the stage where you have server access and you want to test out a notebook, the Jupyter Hub interface has an upload button that you can use to add a notebook that you've downloaded to your laptop (similar to PAWS) or you just ssh into the stat1008 machine and download the notebook directly via command line -- e.g., $ wget -O "~/pages-wiktionary_abbreviations.ipynb" "https://raw.githubusercontent.com/martingerlach/Wiki-examples/master/pages-wiktionary_abbreviations.ipynb"

MGerlach claimed this task.