There are a few people in the process of requesting access to the analytics cluster for querying data through hive. We thought it would be a good idea to run a hands on session for this group.
- Request access for stat1002 (https://wikitech.wikimedia.org/wiki/Requesting_shell_access)
Most of you have already done this, and are in the pipeline to get access. Feel free to reach out if there's any trouble in this process.
- Set up your ssh config (https://wikitech.wikimedia.org/wiki/SSH_access)
Add/update your ~/.ssh/config file. It should look something like this: http://pastebin.com/Mb0vCkd1. The User value should be your labs/prod username accordingly.
- Add keys to the ssh-agent. On the terminal, something like:
- If your access has been granted, and ssh config is all good, you should be able to get into stat1002 from the terminal, like this:
It will prompt to confirm the RSA fingerprint, and when you say yes, log you in to the server.
You can quit the session by typing exit.
- SQL basics.
Ping me/anyone on #wikimedia-analytics if you run into any trouble in these steps.
Once this is done, you are all set to query data. I would like to host this session next week, and explain
- How Hive works
- How to query pageview data and anything else you may be interested in
- Privacy concerns around the data
- How to monitor your queries' progress, and troubleshoot common errors