Page MenuHomePhabricator

Onboard Mo to analytics infrastructure
Closed, ResolvedPublic


Basic access / overview:

  • ssh access to stat machines
  • ssh config setup and access confirmed with stat1008, which is where my code for this project is located: ssh mhoutti@stat1008.eqiad.wmnet
  • create HTTP proxy by adding to ~/.profile:
export http_proxy=http://webproxy:8080
export https_proxy=http://webproxy:8080
  • HDFS + kinit-ing
  • able to access hive ($ kinit; $ hive; $ show databases; etc.)
  • PySpark + Jupyter notebooks
  • alias set up to easily access jupyter notebooks ($ NEWPYTER 8 and then navigate to localhost:8880/ in your web browser)
  • wmfdata demonstrated in Jupyter notebook:
import wmfdata

# initiate session to access the cluster and indicate how many resources you need (the `type` param)
spark = wmfdata.spark.get_session(app_name='pyspark regular; example application',
                                  type='yarn-regular', # local, yarn-regular, yarn-large

# run code etc. against the cluster
spark.sql('show databases;').show(100, False)
  • Superset
  • showed sql lab and saved queries
  • local access to dumps / mariadb replicas

Detailed documentation / examples:

Related Objects


Event Timeline

Isaac added a subscriber: RoccoMo.

@RoccoMo we can continue with this next week (we can schedule a time during our wednesday call but same time works for me). We'll go through the SuggestBot extraction notebook in detail. If you're curious, feel free to take a look ahead of time at some of the examples/documentation but no expectation that you will have done that ahead of our next session.

Closing this task -- we have not gone through every aspect given that e.g., Superset hasn't been relevant for our analyses but it's fair to say onboarding is complete and the analytical support needed from me is quite minimal at this point.

Isaac claimed this task.