There are a few people in the process of requesting access to the analytics cluster for querying data through hive. We thought it would be a good idea to run a hands on session for this group.
Prerequisites:
1. Request access for stat1002 (https://wikitech.wikimedia.org/wiki/Requesting_shell_access)
Most of you have already done this, and are in the pipeline to get access. Feel free to reach out if there's any trouble in this process.
2. Set up your ssh config (https://wikitech.wikimedia.org/wiki/SSH_access)
Add/update your ~/.ssh/config file. It should look something like this: http://pastebin.com/Mb0vCkd1. The User value should be your labs/prod username accordingly.
3. Add keys to the ssh-agent. On the terminal, something like:
ssh-add ~/.ssh/id_rsa
ssh-add ~/.ssh/id_rsa_prod
4. If your access has been granted, and ssh config is all good, you should be able to get into stat1002 from the terminal, like this:
ssh stat1002.eqiad.wmnet
It will prompt to confirm the RSA fingerprint, and when you say yes, log you in to the server.
You can quit the session by typing exit.
5. SQL basics.
Ping me/anyone on #wikimedia-analytics if you run into any trouble in these steps.
Once this is done, you are all set to query data. I would like to host this session next week, and explain
- How Hive works
- How to query pageview data and anything else you may be interested in
- Privacy concerns around the data
- How to monitor your queries' progress, and troubleshoot common errors