DSE kubernetes namespace for llm-inference
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fkaelin
	Oct 15 2024, 8:34 PM

Description

Research wants to deploy workloads on the DSE cluster

Examples include: postgres db for state that requires random access, vector database for research use case, poc endpoints (api not fit for liftwing, internal data exploration ui, using kafka events), llm inference on gpu
Initially the namespace will be used internally for testing/development, hopefully/presumably there will be dedicated namespaces for shared infrastructure (e.g. database-as-a service, ml inference/training on gpu)

Details

Due Date: Fri, Dec 20, 5:00 AM

	Subject	Repo	Branch	Lines +/-
	dse-k8s: Add a namespace for llm-inference work by the ML team	operations/deployment-charts	master	+25 -0
	dse-k8s: Add tokens for the llm-inference namespace	operations/puppet	production	+8 -0

Customize query in gerrit

Related Objects

Mentioned In: T382070: Deploy pipeline under DSE namespace

Event Timeline

fkaelin created this task.Oct 15 2024, 8:34 PM

I'm tagging some people and projects for visibility and approval.

I think this is a great idea in terms of self-service, but maybe we should try to split your examples out into separate requests at an early stage.
Maybe just having one research namespace would become unmanageable if you are experimenting with different technologies with it concurrently.

We could perhaps start with an llm-inference namespace?
If you can expand on the postgres use-case, that might help too.

We currently deploy one postgresql cluster per application and these are currently limited to Airflow instances.
It would be great to understand how much data you would be intending to use in postgresql and what the expected usage pattern would be.

XiaoXiao-WMF added a project: Research-engineering.Oct 16 2024, 2:23 PM

Ahoelzl subscribed.Oct 16 2024, 7:50 PM

Ottomata subscribed.Oct 21 2024, 3:12 PM

VirginiaPoundstone moved this task from Incoming to Dumps on the Data-Platform board.Oct 30 2024, 3:50 AM

VirginiaPoundstone moved this task from Dumps to SRE on the Data-Platform board.

Gehel triaged this task as Medium priority.Nov 8 2024, 2:55 PM

Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Gehel edited projects, added Data-Platform-SRE (2024.11.09 - 2024.11.29); removed Data-Platform-SRE.

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.

XiaoXiao-WMF moved this task from Backlog to In Progress on the Research board.Nov 19 2024, 3:07 PM

XiaoXiao-WMF changed the task status from Open to In Progress.Nov 20 2024, 3:10 PM

XiaoXiao-WMF assigned this task to fkaelin.

XiaoXiao-WMF set Due Date to Fri, Dec 20, 5:00 AM.

Gehel edited projects, added Data-Platform-SRE (2024.11.30 - 2024.12.20); removed Data-Platform-SRE (2024.11.09 - 2024.11.29).Nov 29 2024, 1:22 PM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2024.11.30 - 2024.12.20) board.

Picking this back up. Thanks for the background Ben.

In this case I agree that starting with an llm-inference namespace makes most sense, especially as this is also our most important active use case. As part of SDS 1.2.1 (Test existing AI models for internal use-cases) we are running tests on the GPUs installed on the new ml-labs instances. We are facing limitations in what we can install/run due to the lack of docker, and it would be helpful and informative to run these llm inference workloads on the "untapped" MI210 GPUs on the DSE cluster.

We can publish a docker image to run the workload to the wmf registry, but it would great to get some hands-on help to setup the helm charts and review/deployment steps, as this is research's first namespace and also requires the provisioning of a gpu (for which I hope/expect we can lean on existing charts for the ml-serve cluster). What are the next steps for this?

Assigning to myself to pick this up for now.

BTullis renamed this task from DSE kubernetes namespace for Research to DSE kubernetes namespace for llm-inference.Tue, Dec 10, 4:42 PM

Change #1102284 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] dse-k8s: Add a namespace for llm-inference work by the ML team

https://gerrit.wikimedia.org/r/1102284

gerritbot added a project: Patch-For-Review.Wed, Dec 11, 12:22 PM

Change #1102287 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] dse-k8s: Add token for the llm-inference namespace

https://gerrit.wikimedia.org/r/1102287

BTullis added a subscriber: brouberol.Wed, Dec 11, 12:41 PM

Change #1102287 merged by Btullis:

[operations/puppet@production] dse-k8s: Add tokens for the llm-inference namespace

https://gerrit.wikimedia.org/r/1102287

BTullis moved this task from Backlog - operations to In Progress on the Data-Platform-SRE (2024.11.30 - 2024.12.20) board.Wed, Dec 11, 2:03 PM

Change #1102284 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s: Add a namespace for llm-inference work by the ML team

https://gerrit.wikimedia.org/r/1102284

Maintenance_bot removed a project: Patch-For-Review.Wed, Dec 11, 2:31 PM

This is now ready for use.
Currently, three users are permitted to access this namespace, by virtue of being members of the research-deployers group.

btullis@deploy2002:~$ getent group research-deployers
research-deployers:x:835:fab,gmodena,mnz

That's @fkaelin and @gmodena and @MunizaA - If you would like to modify this access list, let me know, or create a ticket.

The next step is probably to start to create a helm chart for the new llm-inference work that you would like to do.

If you let us know what sort of processes, inputs and outpus you expect from the work, then we can likely help you to make a start here. This will be a lot more manageable than simply using kubectl to deploy resources into the namespace.
In the meantime, I'll close this ticket, if that's OK. Feel free to tag us on any follow-ups and reach out if you would like assistance getting started.

XiaoXiao-WMF mentioned this in T382070: Deploy pipeline under DSE namespace.Thu, Dec 12, 2:25 PM

Gehel moved this task from Done to Reported on the Data-Platform-SRE (2024.11.30 - 2024.12.20) board.Fri, Dec 13, 10:34 AM

DSE kubernetes namespace for llm-inferenceClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

DSE kubernetes namespace for llm-inference
Closed, ResolvedPublic
Actions