Page MenuHomePhabricator

Request access to analytics cluster for Alaa Sarhan
Closed, ResolvedPublic

Description

I am from Wikidata team and would like to get access to analytics cluster in order to be able to do the following:

  • access to hadoop/logstash in order to perform data-analysis to support product/technical decisions

My ldap username: "alaasarhan"

Event Timeline

@alaa_wmde this requires explicit approval here on task from your manager at WMDE and from @Nuria (added in copy) from the Analytics team at WMF.

Volans triaged this task as Medium priority.May 20 2019, 2:49 PM

hi! I am @alaa_wmde's manager at WMDE and hereby I approve that Alaa gets the access needed

Logstash and hadoop are two different systems , do you need access to both?

Hi @Nuria yes access to both would be ideal .. thank you

@alaa_wmde Can you be a bit more specific as to what data you need access to? So you know Logstash and hadoop do not share any data, maybe an example or use case of what you are trying to do will help.

hi @Nuria

Yeap I actually have access to logstash already. I must have confused it somehow into thinking that there's another logstash for analytics or something.

This means this request is only about hadoop for anlaytics purposes. One example case I need such access for from recent work is to query how many requests we get to a certain endpoints, with some filtering on the parameters used on those calls.

hashar subscribed.

Removing Release-Engineering-Team since there is already deployment/logstash access :]

@alaa_wmde you might already have access to http://pivot.wikimedia.org/ which is a nice GUI frontend and lets one extract some useful informations about queries. The data set I keep using:

webrequest_sampled_128

All web requests received for all projects sampled to 1/128.

It is sampled and only kept for a week, so it is not as powerful as Hadoop :]

@alaa_wmde please check if you have access to turnilo (before known as pivot) as @hashar mentioned this is probably a good tool to find answers to your questions.

Please see https://turnilo.wikimedia.org/#webrequest_sampled_128 and https://wikitech.wikimedia.org/wiki/Analytics/Systems/Turnilo-Pivot#Access if you do not have access to turnilo.

In order to gain access to hadoop you need: https://wikitech.wikimedia.org/wiki/Production_shell_access (now, per your use case I think turnilo is a much better way to get the data you need, should be pretty small and there is no need to comb terabytes of data)

@alaa_wmde If turnilo is enough for analysis, should we mark this as resolved?

Hello .. apologies for not replying sooner and thanks @hashar for pinging me directly about it :)

I will follow the links, indeed I should have access to turnilo as I already have signed nda and have logstash access. Will update this task once I either got access or faced a problem getting it.

In order to gain access to hadoop you need: https://wikitech.wikimedia.org/wiki/Production_shell_access (now, per your use case I think turnilo is a much better way to get the data you need, should be pretty small and there is no need to comb terabytes of data)

I have a separate requets for Production shell access T223698: Request access to deployment cluster for Alaa Sarhan. Though it is definitely good to know about turnilo and activate it as it might be enough/faster to run some inquires than on hadoop for some cases.

@alaa_wmde If turnilo is enough for analysis, should we mark this as resolved?

I will probably still need to have hadoop access at some point, as some inquiries might not be fulfilled by turnilo with the sampling and 1-week lifetime of data limitations.
Though I'm definitely fine if you prefer to resolve this task for now and open another one later when hadoop access is needed next time, or putting it on hold/waiting. Whatever suits your process is fine by me ;)

I will probably still need to have hadoop access at some point, as some inquiries might not be fulfilled by turnilo with the sampling and 1-week lifetime of data limitations.

I think you need to look at data in turnilo a bit , most data is not sampled and all datasets (but one) are retained further than 1 week. Most are retained forever as it is aggregated data

Hi there, so apparently I am not in nda group yet (as seen here https://tools.wmflabs.org/wmde-access/) for some reason. I signed the NDA on Feb 1 2019, 2:36 PM (as seen when I visit https://phabricator.wikimedia.org/L37. Or are we talking about another NDA to be signed here?

@alaa_wmde NDA permissions in genral for wmde staff is being discussed on a different ticket https://phabricator.wikimedia.org/T225004. however as @RStallman-legalteam has confirmed your NDA status in that ticket, i have now added you and you should be able to access turnilo

alaa_wmde claimed this task.

yeap confirmed.. I can access Turnilo now. Resolving it for now, and a new one for access to hadoop would be created if needed again.

Thanks everyone for being patient on me and for your help getting there!