Maniphest T219708

[Discuss] Changes to ores codebase for migrating to kubernets
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ladsgroup
	Mar 30 2019, 4:02 PM

Description

This is a task about to discuss and spec out how changes to ores should be made to make it work with the blubber/helm/kubernetes. In order to achieve this, we need to archive ores deploy repo. That means:

All of model files need to be either pulled when building the images (That would make the images gigantic) or make service pulls them down when it's starting (make the configs accept URLs as model path)
Configs need to be overhauled into one big yaml file that's going to be checked out into helm charts
Model files wouldn't work without the codes (features and feature lists). This means the repos need to be turned to libraries and they would be installed building the image using a file like requirements-production.txt
Assets ores use, they also need to pulled down when starting the service.
Scap configs need to turn to helm configs
wheels repo need to be archived and completely ditched
helm chart for ores need to have redis subchart for tests, meaning redis should also have a chart in releases.wikimedia.org/charts
Speaking of tests, it's better to wrap all of python tests into one tox.ini file so entrypoint for the "test" stage of docker images can be easily defined
The blubber files need to be added to ores repo but it's blocked on T210267: Execution of the deployment pipeline should be configurable via .pipeline/config.yaml

(More?)

Related Objects
Search...

Status	Assigned	Task
Open	None	T198901 Migrate production services to kubernetes using the pipeline
Declined	None	T182331 [Epic] Deploy ORES in kubernetes cluster
Resolved	• ACraze	T219708 [Discuss] Changes to ores codebase for migrating to kubernets

Event Timeline

Ladsgroup created this task.Mar 30 2019, 4:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 30 2019, 4:02 PM

@Ladsgroup I feel like this should be a meeting?

Halfak added a parent task: T182331: [Epic] Deploy ORES in kubernetes cluster.Apr 2 2019, 9:13 PM

Harej triaged this task as High priority.Apr 2 2019, 9:16 PM

Harej moved this task from Unsorted to Ready to go on the Machine-Learning-Team board.

In T219708#5079457, @Harej wrote:

@Ladsgroup I feel like this should be a meeting?

yup

Just some notes, I agree on premise with most of the above.

Configs need to be overhauled into one big yaml file that's going to be checked out into helm charts

That's not necessary. We can ship more than 1 config files via helm charts. We can ship config directories as well.

Assets ores use, they also need to pulled down when starting the service.

I am not so sure about this one. Very often assets are tightly coupled with the deployed code version and usually relatively small in size and fine to ship with the application.

Scap configs need to turn to helm configs

Which parts and in which sense ? The 2 systems are very different in both features and shortcomings, it probably makes more sense to just create the helm chart.

helm chart for ores need to have redis subchart for tests, meaning redis should also have a chart in releases.wikimedia.org/charts

Indeed. I 'll take care of that

In T219708#5082329, @akosiaris wrote:

Configs need to be overhauled into one big yaml file that's going to be checked out into helm charts

That's not necessary. We can ship more than 1 config files via helm charts. We can ship config directories as well.

Nice, thank you for letting me know!

Assets ores use, they also need to pulled down when starting the service.

I am not so sure about this one. Very often assets are tightly coupled with the deployed code version and usually relatively small in size and fine to ship with the application.

We have an asset for ores that is 1.3 GB (word2vec dictionary)

Scap configs need to turn to helm configs

Which parts and in which sense ? The 2 systems are very different in both features and shortcomings, it probably makes more sense to just create the helm chart.

Good question, what I meant was scap configs (things in the scap folder) and puppet configs all will be tossed meaning we should not assume anything there. Things like "the scap config says 3 parallel deploys at the same time will happen"

In T219708#5084339, @Ladsgroup wrote:

In T219708#5082329, @akosiaris wrote:

Configs need to be overhauled into one big yaml file that's going to be checked out into helm charts

That's not necessary. We can ship more than 1 config files via helm charts. We can ship config directories as well.

Nice, thank you for letting me know!

Assets ores use, they also need to pulled down when starting the service.

I am not so sure about this one. Very often assets are tightly coupled with the deployed code version and usually relatively small in size and fine to ship with the application.

We have an asset for ores that is 1.3 GB (word2vec dictionary)

Ah, good point. We 'll probably need to figure that out indeed.

Scap configs need to turn to helm configs

Which parts and in which sense ? The 2 systems are very different in both features and shortcomings, it probably makes more sense to just create the helm chart.

Good question, what I meant was scap configs (things in the scap folder) and puppet configs all will be tossed meaning we should not assume anything there. Things like "the scap config says 3 parallel deploys at the same time will happen"

Ah ok understood. Probably should be reworded to "Investigate which behaviors of scap configs (and how) should be kept while migrating to helm charts"

Met with @Ladsgroup and @akosiaris to understand some of this better. Here are our notes: https://etherpad.wikimedia.org/p/ores-to-k8s

Harej unsubscribed.Jul 4 2019, 9:27 AM

Halfak added a subscriber: • ACraze.Aug 16 2019, 9:21 PM

Halfak lowered the priority of this task from High to Medium.Sep 11 2019, 9:17 PM

Halfak moved this task from Ready to go to Blocked on team discussion on the Machine-Learning-Team board.Mar 24 2020, 4:55 PM

Adding some notes:

In T182331#6253315, @Halfak wrote:

Some notes from the deployment pipeline meeting:

ORES K8s and COW - https://phabricator.wikimedia.org/T182331

Takes advantage of CoW on a single machine

Containerization might mean that we significantly increase memory usage

This might add complexity and reduce capacity to ORES; i.e., sudden bursts of requests for uncommonly used wikis

Alex: loosing excess capacity is not a big loss ( more efficeincy); adding capacity in k8s is easier

ORES relying on CoW will cause migration and pain issues in future; i.e., the kernel does not break user space; however, memory allocations may change in future

Tricks like apertium are using: loading and dropping models on demand

Actual RSS 210GB, with CoW 56GB. We have a lot of uwsgi and celery running. Currently, we have tuned the software to the hardware. If we need more capacity now, we need to add equipment (weeks to months). With K8s, room for growth.

Other benefits: standardized deployments, rollback being easier (no mess with scap handling virtualenvs), automatic garbage collecting of images. also a lot more control of CI -- less work inside of puppet. More self-serve.

Aaron: we could say that we want to move *new* models to k8s -- want to ensure that we're using time efficiently

We have a backpressure system to determine whether or not we're running out of capcity (via queue and redis)

We need, though, to be able to handle random bots crawling uncommonly requested wikis -- how long does that scaling take?

Alex: seconds.

Aaron: if dynamic scaling is fast then that could give us more capacity that changes a lot -- how much is already built and how much would we have build ourselves for this scaling?

Alex: we can help the scaler by feeding it RpS to help it work better, but most components are there. We're not there yet; we're working with a contractor; it's on our roadmap. We should have that before end of September (hedging a bit).

Aaron: we're worried about being trailblazers. Multi-container deployment in particular.

Tyler: PipelineLib exists. Not being used by other repos. No reason it shouldn't work. Re. deployment, there is not a lot of magic there that could fall apart (Helm and K8s -- not a trailblazer there).

Alex: Only trailblazing is the creation of two different container images from the same repo.

In T182331#6256623, @akosiaris wrote:

Couple of more benefits of k8s I forgot to mention yesterday

Ability for >1 deployments. This might be beneficial from a product perspective, e.g. create an ORES deployment for internal mediawiki usage and another for external researchers usage. That would mean that RC could not be negatively impacted by a spike in research queries.

Canaries (actually a byproduct of the above and some statistical request routing)

Decoupling from the OS the hardware node the service runs on. That is, you can decided to adopt a newer Debian version faster and without much intervention from SRE. Alternatively, SRE can proceed with with upgrading nodes OS without impacting the service at all.

Decoupling from puppet and the need to have +2s from an SRE.

Going to mark this as resolved since we have good idea of what is needed to migrate and will not be moving ORES in it's current form to k8s.

[Discuss] Changes to ores codebase for migrating to kubernets Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

[Discuss] Changes to ores codebase for migrating to kubernets
Closed, ResolvedPublic
Actions

Related Objects
Search...