Page MenuHomePhabricator

Request creation of Wikidata Concepts labs project
Closed, ResolvedPublic

Description

Project Name: Wikidata Concepts
Purpose: I am working as a Data Analyst for WMDE. We are building a project that will track and provide advanced analytics on the usage of custom, pre-specified selections of concepts and categories from Wikidata across the projects.
Wikitech Username of requestor: GoranSMilovanovic

Brief description: I need a labs instance where I can install everything that I need for analytics in R, and most probably the Anaconda ecosystem in the future. The instance will run a local postgreSQL server in order to support the application development there. The instance will be accessing the mySQL replicas to track the usage of Wikidata across >900 projects. I will need to install R, RStudio Server, Shiny Server, postgreSQL, and many Linux packages there. In order for the Shiny dashboards that will be developed there to be available to the team/community, the instance will probably need to have a public IP address; the ports 8080 (default RStudio Server) and 3838 (default Shiny Server) opened, at least. In the near future the instance will also have to be able to connect to Spark from RStudio using {sparklyr} on our Hadoop cluster (production).

How soon you are hoping this can be fulfilled: one week or so.

Event Timeline

GoranSMilovanovic renamed this task from Request creation of <Replace Me> labs project to Request creation of Wikidata Concepts labs project.Apr 24 2017, 10:17 AM

Unless there's a public interface for e.g. Hadoop you won't be able to route there from a Labs VM. Having a public IP doesn't help with that.

Are you sure you want a scratch start for this and not use existing Analytics infrastructure? Have you discussed this plan with anyone over there?

(No real objection to granting this resource request, if you're sure it's actually going to be useful to you)

In the near future the instance will also have to be able to connect to Spark from RStudio using {sparklyr} on our Hadoop cluster (production).

Connecting from labs to production hadoop will not really be possible.

@Andrew @Addshore

(1) As of Hadoop, I do not know whether there is a public interface, but I can ask. Let me check and I will get back to you.

(2) As of the following: "Are you sure you want a scratch start for this and not use existing Analytics infrastructure? Have you discussed this plan with anyone over there?" - No, but I am eager to learn who would be the right person to approach and ask.

(2) As of the general question on whether am I sure that I need the Lab instance running for this project, please advise in respect to the following constraints:

  • The idea is to have an RStudio Shiny dashboard running from a Labs instance, accessible for other people (WMDE, WMF, interested parties); the dashboard will have to be connected to a db back-end, postgreSQL ideally;
  • That postgreSQL back-end will be initially fed from mariaDB replicas (fetch mariaDB -> pre-process -> postgreSQL -> Dashboard); however, in the future, and I mean: near future, I would like to have the following workflow implemented: Hadoop/Spark (Cluster) -> postgreSQL (Instance, close to the dashboard) -> Dashboard (Instance).

From your questions and suggestions, I can see that the bottleneck here seems to be the Hadoop/Spark (Cluster) -> postgreSQL (Instance, supporting the dashboard) step. @Addshore says no connection of Labs to Hadoop on production. I need help to understand how to bridge in that case: maybe running a Lab instance is not a solution; is there an alternative? For example, can I have an RStudio and Shiny Server + postgreSQL installed somewhere close to Hadoop/Spark on production (I guess not)?

Thanks,
Goran

(2) As of the following: "Are you sure you want a scratch start for this and not use existing Analytics infrastructure? Have you discussed this plan with anyone over there?" - No, but I am eager to learn who would be the right person to approach and ask.

The Analytics team own the #analytics-cluster and would be the best people to ask. @Ottomata perhaps.

@Addshore says no connection of Labs to Hadoop on production. I need help to understand how to bridge in that case: maybe running a Lab instance is not a solution; is there an alternative?

It should be possible to do the processing of the data from hadoop within the analytics cluster and then transfer data with no PI out for use on the Shiny Server

@Addshore Ok. Thanks. Let me talk to Analytics then and we will decide what is the best option when I learn about exporting data from the Analytics Cluster.

@Andrew Thanks for support. Let's not rush with a Lab instance until I learn more from the Analytics; please leave the ticket opened.

@GoranSMilovanovic It might also be worth talking to the Analysts on Discovery (Mikhail and Chelsy) - they use R extensively. One thing we do have on our Hadoop Cluster is Jupyter Notebooks to talk to Hadoop and the internal mysql data store -
and these are the helpful instructions Discovery folks wrote to use it with R https://meta.wikimedia.org/wiki/Discovery/Analytics#PAWS_Internal

chasemp changed the task status from Open to Stalled.May 1 2017, 12:39 PM
chasemp triaged this task as Low priority.
Addshore changed the task status from Stalled to Open.May 3 2017, 1:00 PM

So, could we move forward with the creation of this project?

We have discussed the use case and believe moving forward with a VM is the right approach for now.

It will likely be running a Postgres SQL server, Rstudio & Shiney.

Approved, I will set this up shortly.

Ah, sorry, one other thing -- project names need to be a single lowercase word, no spaces or camel case. 'wikidataconcepts' maybe?

That will do just fine: 'wikidataconcepts'. Thank you.

Regards,
Goran

ok! I have (finally) created the project. You can add other users or projectadmins as you see fit.