Page MenuHomePhabricator

[Discuss] Changes to ores codebase for migrating to kubernets
Closed, ResolvedPublic

Description

This is a task about to discuss and spec out how changes to ores should be made to make it work with the blubber/helm/kubernetes. In order to achieve this, we need to archive ores deploy repo. That means:

  • All of model files need to be either pulled when building the images (That would make the images gigantic) or make service pulls them down when it's starting (make the configs accept URLs as model path)
  • Configs need to be overhauled into one big yaml file that's going to be checked out into helm charts
  • Model files wouldn't work without the codes (features and feature lists). This means the repos need to be turned to libraries and they would be installed building the image using a file like requirements-production.txt
  • Assets ores use, they also need to pulled down when starting the service.
  • Scap configs need to turn to helm configs
  • wheels repo need to be archived and completely ditched
  • helm chart for ores need to have redis subchart for tests, meaning redis should also have a chart in releases.wikimedia.org/charts
  • Speaking of tests, it's better to wrap all of python tests into one tox.ini file so entrypoint for the "test" stage of docker images can be easily defined
  • The blubber files need to be added to ores repo but it's blocked on T210267: Execution of the deployment pipeline should be configurable via .pipeline/config.yaml

(More?)

Event Timeline

@Ladsgroup I feel like this should be a meeting?

Harej triaged this task as High priority.Apr 2 2019, 9:16 PM
Harej moved this task from Unorganized to Ready to go on the Machine-Learning-Team board.

@Ladsgroup I feel like this should be a meeting?

yup

Just some notes, I agree on premise with most of the above.

Configs need to be overhauled into one big yaml file that's going to be checked out into helm charts

That's not necessary. We can ship more than 1 config files via helm charts. We can ship config directories as well.

Assets ores use, they also need to pulled down when starting the service.

I am not so sure about this one. Very often assets are tightly coupled with the deployed code version and usually relatively small in size and fine to ship with the application.

Scap configs need to turn to helm configs

Which parts and in which sense ? The 2 systems are very different in both features and shortcomings, it probably makes more sense to just create the helm chart.

helm chart for ores need to have redis subchart for tests, meaning redis should also have a chart in releases.wikimedia.org/charts

Indeed. I 'll take care of that

Configs need to be overhauled into one big yaml file that's going to be checked out into helm charts

That's not necessary. We can ship more than 1 config files via helm charts. We can ship config directories as well.

Nice, thank you for letting me know!

Assets ores use, they also need to pulled down when starting the service.

I am not so sure about this one. Very often assets are tightly coupled with the deployed code version and usually relatively small in size and fine to ship with the application.

We have an asset for ores that is 1.3 GB (word2vec dictionary)

Scap configs need to turn to helm configs

Which parts and in which sense ? The 2 systems are very different in both features and shortcomings, it probably makes more sense to just create the helm chart.

Good question, what I meant was scap configs (things in the scap folder) and puppet configs all will be tossed meaning we should not assume anything there. Things like "the scap config says 3 parallel deploys at the same time will happen"

Configs need to be overhauled into one big yaml file that's going to be checked out into helm charts

That's not necessary. We can ship more than 1 config files via helm charts. We can ship config directories as well.

Nice, thank you for letting me know!

Assets ores use, they also need to pulled down when starting the service.

I am not so sure about this one. Very often assets are tightly coupled with the deployed code version and usually relatively small in size and fine to ship with the application.

We have an asset for ores that is 1.3 GB (word2vec dictionary)

Ah, good point. We 'll probably need to figure that out indeed.

Scap configs need to turn to helm configs

Which parts and in which sense ? The 2 systems are very different in both features and shortcomings, it probably makes more sense to just create the helm chart.

Good question, what I meant was scap configs (things in the scap folder) and puppet configs all will be tossed meaning we should not assume anything there. Things like "the scap config says 3 parallel deploys at the same time will happen"

Ah ok understood. Probably should be reworded to "Investigate which behaviors of scap configs (and how) should be kept while migrating to helm charts"

Halfak lowered the priority of this task from High to Medium.Sep 11 2019, 9:17 PM

Adding some notes:

Some notes from the deployment pipeline meeting:

  • ORES K8s and COW - https://phabricator.wikimedia.org/T182331
    • Takes advantage of CoW on a single machine
    • Containerization might mean that we significantly increase memory usage
    • This might add complexity and reduce capacity to ORES; i.e., sudden bursts of requests for uncommonly used wikis
    • Alex: loosing excess capacity is not a big loss ( more efficeincy); adding capacity in k8s is easier
      • ORES relying on CoW will cause migration and pain issues in future; i.e., the kernel does not break user space; however, memory allocations may change in future
      • Tricks like apertium are using: loading and dropping models on demand
      • Actual RSS 210GB, with CoW 56GB. We have a lot of uwsgi and celery running. Currently, we have tuned the software to the hardware. If we need more capacity now, we need to add equipment (weeks to months). With K8s, room for growth.
      • Other benefits: standardized deployments, rollback being easier (no mess with scap handling virtualenvs), automatic garbage collecting of images. also a lot more control of CI -- less work inside of puppet. More self-serve.
    • Aaron: we could say that we want to move *new* models to k8s -- want to ensure that we're using time efficiently
      • We have a backpressure system to determine whether or not we're running out of capcity (via queue and redis)
      • We need, though, to be able to handle random bots crawling uncommonly requested wikis -- how long does that scaling take?
        • Alex: seconds.
        • Aaron: if dynamic scaling is fast then that could give us more capacity that changes a lot -- how much is already built and how much would we have build ourselves for this scaling?
        • Alex: we can help the scaler by feeding it RpS to help it work better, but most components are there. We're not there yet; we're working with a contractor; it's on our roadmap. We should have that before end of September (hedging a bit).
  • Aaron: we're worried about being trailblazers. Multi-container deployment in particular.
    • Tyler: PipelineLib exists. Not being used by other repos. No reason it shouldn't work. Re. deployment, there is not a lot of magic there that could fall apart (Helm and K8s -- not a trailblazer there).
    • Alex: Only trailblazing is the creation of two different container images from the same repo.

Couple of more benefits of k8s I forgot to mention yesterday

  • Ability for >1 deployments. This might be beneficial from a product perspective, e.g. create an ORES deployment for internal mediawiki usage and another for external researchers usage. That would mean that RC could not be negatively impacted by a spike in research queries.
  • Canaries (actually a byproduct of the above and some statistical request routing)
  • Decoupling from the OS the hardware node the service runs on. That is, you can decided to adopt a newer Debian version faster and without much intervention from SRE. Alternatively, SRE can proceed with with upgrading nodes OS without impacting the service at all.
  • Decoupling from puppet and the need to have +2s from an SRE.
ACraze claimed this task.

Going to mark this as resolved since we have good idea of what is needed to migrate and will not be moving ORES in it's current form to k8s.