Page MenuHomePhabricator

Request creation of data-engineering VPS project
Closed, ResolvedPublic

Description

Project Name: data-engineering

Wikitech Usernames of requestors: Btullis, Elukey

Purpose: This will be used with Pontoon as a testbed for changes to the data engineering team's infrastructure - T292388: Move the Analytics/DE testing infrastructure to Pontoon

Brief description: We require a new cloud VPS project in which we can bootstrap a new Pontoon environment, then work iteratively to incorporate the following subsystems:

  • User management with Kerberos
  • Hadoop masters and workers
  • Hive, Presto, Spark, Alluxio, Jupyter

How soon you are hoping this can be fulfilled: As soon as possible please, ideally to start work during the DSE hackathon this week. However, we're unlikely to need more than 3-4 VMs this week in order to start addressing the first sub-task: T292389: Automate kerberos credential creation and management to ease the creation of testing infrastructure

Event Timeline

https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_project#Reviews_of_Cloud_VPS_Project_requests

"Umbrella" projects with a broad scope, such as all the work to be done by an engineering team or a large problem space.

"Umbrella" projects with broad scopes are difficult to track over time because of organizational changes and lack of continuity in ownership.

I know that the https://openstack-browser.toolforge.org/project/analytics project exists. This sounds like a second version of that project, is that correct?

Yes, in a way it's a second version of that project. I have recently been added as an admin to the analytics project and have started using it.
However, I thought it best to request a new project for several reasons:

  • As this is to be based on pontoon, it is quite a departure from anything that is already in the analytics network. Therefore I thought it better to bootstrap a fresh environment: https://wikitech.wikimedia.org/wiki/Puppet/Pontoon#Server_bootstrap
  • There are already several systems with several different admins using them in the analytics project, but these currently focus on other parts of the stack, such as Kafka, Airflow, and AQS. At the moment the requirement I have is more back-end focused, such as Kubernetes integration with Hadoop, Spark, etc.
  • I would hope to be able to deprecate the analytics project over time (with the consent of the existing admins) and expand the scope of the data-engineering project.
  • This deprecation of the old system and replacement with the new system also reflects the change in name of the Analytics team to the Data Engineering team.

However, I'll understand if you would prefer us to stick with a single project, or even rename it (if that's possible).

BTullis renamed this task from Request creation of data-engineering-testing VPS project to Request creation of data-engineering VPS project.Oct 7 2021, 8:27 AM
BTullis updated the task description. (Show Details)

I have changed the requested project name from data-engineering-testing to data-engineering.

If this is intended as a simple project rename then that's fine -- in that case please suggest a potential deadline for when we can close out the old project.

If, on the other hand, there are multiple projects intended here that can be separated into different tenants then I'd encourage you to request multiple projects, e.g. data-engineering-kafka, data-engineering-hadoop, etc.

Thanks for your reply. I haven't yet had a chance to talk to my team about a deprecation schedule for the analytics project, but I would think it likely that we will be able to progress in that direction and I'll make a note to raise it at our next weekly meeting.

In the meantime, I'd be grateful if we could set this up as a parallel project please, even with a very limited cap on the resources (instances, vCPU, RAM etc) available.
The reason for requesting it is that the existing analytics project has a fairly extensive set of project-wide hiera definitions and a couple of standalone puppetmasters already. Since I'm going to be starting out with a new approach of using [[https://wikitech.wikimedia.org/wiki/Puppet/Pontoon#Pontoon|pontoon], I would really rather start out with a fresh project-wide puppet configuration. This would also mean that other team members' existing workflows with testing components in the analytics projects won't be affected by my work on this new stack.

How about if you were to create a data-engineering umbrella project with a limit of, say, 6 instances, 12 vCPUs and 18 GB of RAM to begin with? Would that be possible?
I should be able to get this new pontoon based stack working to the point where we can then begin migrating other projects into it, deprecating those instances in the analytics project, and transferring the existing quotas accordingly.

I do appreciate your suggestion of creating separate projects for different elements (kafka, hadoop etc). The issue around this is that we need to create a framework for testing the integration of lots of different elements combined, including Kerberos authentication. Therfore it would make the interaction between different projects quite cumbersome. We currently use physical servers for this testing setup, but we are hoping to use Pontoon on WMCS to make this testbed significantly more flexible and the creation of this new WMCS project is the first step in this test phase.

Mentioned in SAL (#wikimedia-cloud) [2021-10-21T11:35:45Z] <arturo> create project with btullis & elukey as projectadmins, quota 6 instances 12 cores and 18G ram (T292563)

aborrero claimed this task.