Page MenuHomePhabricator

[Epic] Deploy ORES in kubernetes cluster
Open, MediumPublic

Description

This is unlikely to happen until Q4 of FY18. Keep this task as a means of gathering together work that we think we'll need to do in order to get this done.

Event Timeline

Halfak created this task.Dec 7 2017, 4:04 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 7 2017, 4:04 PM
awight added a subscriber: awight.Dec 7 2017, 4:05 PM
This comment was removed by awight.
Ottomata triaged this task as Low priority.Jan 16 2018, 7:34 PM
Halfak raised the priority of this task from Low to High.Feb 19 2019, 10:35 PM
awight removed a subscriber: awight.Mar 21 2019, 4:02 PM
Halfak lowered the priority of this task from High to Medium.Sep 11 2019, 9:18 PM
ACraze added a subscriber: ACraze.May 27 2020, 12:45 AM

I'm wondering about pod size limits and what that means for our current architecture. I believe I've heard there is a 2GB limit, is that correct?

Currently, I'm able to package up the deployment repo with all assets/models/etc into a single container at ~1.79 GB, however, that will grow fairly quickly as we continue to host new models. We will need to spin up a number of these containers as uwsgi services and others as celery workers too.

Also wondering if there are any sort of RAM limits per pod. Right now our production servers need ~56GB (and growing) to run all the models, although some models are much more memory-intensive than others. We should eventually split out the different model types into separate containers so we can manage memory better, but for now everything is hosted together.

I'm wondering about pod size limits and what that means for our current architecture. I believe I've heard there is a 2GB limit, is that correct?

I am assuming you mean disk. I am not aware of any such limit. That being said, if we manage to produce a container that fills up the nodes disks, that's going to be an issue, so don't go overboard.

Currently, I'm able to package up the deployment repo with all assets/models/etc into a single container at ~1.79 GB, however, that will grow fairly quickly as we continue to host new models. We will need to spin up a number of these containers as uwsgi services and others as celery workers too.

Such a big container image would make deployments pretty slow I have to say. It's not a blocker, but at some point, and in accordance to the comment below, it would make sense to a invest in the capability to load only a subset of models in each celery worker.

Also wondering if there are any sort of RAM limits per pod. Right now our production servers need ~56GB (and growing) to run all the models, although some models are much more memory-intensive than others. We should eventually split out the different model types into separate containers so we can manage memory better, but for now everything is hosted together.

Yes there are. They are configurable on a service basis, but up to now pods are of 4GB in RAM. We could definitely bump them up but with a total of 56GB per pod/container it would mean that on every node only 1 pod currently fits, which defeats a big part of the premise of moving the service into kubernetes. Even with 128GB RAM nodes (instead of the current 64GB RAM ones) the point remains. So it might make sense to figure out how to load subsets of models on each pod. Note that this will have a pretty interesting repercussion, which is that copy-on-write optimizations of the linux kernel (which is what keeps memory usage at 56GB RAM instead of the ~210 RAM that RSS reports) will no longer apply, at least not at such a scale.

mark added a subscriber: mark.May 29 2020, 1:41 PM

I'm wondering about pod size limits and what that means for our current architecture. I believe I've heard there is a 2GB limit, is that correct?

I am assuming you mean disk. I am not aware of any such limit. That being said, if we manage to produce a container that fills up the nodes disks, that's going to be an issue, so don't go overboard.

Currently, I'm able to package up the deployment repo with all assets/models/etc into a single container at ~1.79 GB, however, that will grow fairly quickly as we continue to host new models. We will need to spin up a number of these containers as uwsgi services and others as celery workers too.

Such a big container image would make deployments pretty slow I have to say. It's not a blocker, but at some point, and in accordance to the comment below, it would make sense to a invest in the capability to load only a subset of models in each celery worker.

Also wondering if there are any sort of RAM limits per pod. Right now our production servers need ~56GB (and growing) to run all the models, although some models are much more memory-intensive than others. We should eventually split out the different model types into separate containers so we can manage memory better, but for now everything is hosted together.

Yes there are. They are configurable on a service basis, but up to now pods are of 4GB in RAM. We could definitely bump them up but with a total of 56GB per pod/container it would mean that on every node only 1 pod currently fits, which defeats a big part of the premise of moving the service into kubernetes. Even with 128GB RAM nodes (instead of the current 64GB RAM ones) the point remains. So it might make sense to figure out how to load subsets of models on each pod. Note that this will have a pretty interesting repercussion, which is that copy-on-write optimizations of the linux kernel (which is what keeps memory usage at 56GB RAM instead of the ~210GB RAM that RSS reports) will no longer apply, at least not at such a scale.

@ACraze, there is something that has been bugging me about those numbers. I may have misunderstood something. It seems that with a deployment size of ~1.79GB, we somehow have a requirement of 56GB of RAM. That sounds off from what I would expect by at least 1 order of magnitude. Looking at it more closely, I am guessing the 56GB RAM is the utilized memory on each hardware node, right? I am not sure this is the best way to come up with an estimation of the memory requirements we will have for a kubernetes installation. For starters as I point above, actual RSS seems to be closer to 210GB RAM and it's copy on write that keeps it to ~56GB and most importantly each machines runs multiple instances of uwsgi and celery. In a kubernetes uwsgi ORES pod however you can expect that we will be running 1 or perhaps 2 or 3 uwsgi processes and conversely in a kubernetes ORES celery worker pod 1 to 2 celery workers (we would be adding capacity on either by just running more pods). Which would probably yield completely different numbers, at least cost however of lowering the gains of copy-on-write (which can be extremely beneficial, but makes migrations like this one problematic).

ACraze added a comment.EditedMay 29 2020, 6:52 PM

@akosiaris ah yeah I see what you're saying, the 56GB RAM is utilized memory due to COW, but the RSS is much higher, which is not great for a containerized solution. It's definitely starting to seem like ORES is going to be difficult to migrate to k8s in it's current form. I've been experimenting with some different model configurations, but nothing concrete yet. There's really only a handful of memory-hungry and/or high traffic models, so it might make sense to split those out or possibly treat individual models as microservices at some point.

Some notes from the deployment pipeline meeting:

  • ORES K8s and COW - https://phabricator.wikimedia.org/T182331
    • Takes advantage of CoW on a single machine
    • Containerization might mean that we significantly increase memory usage
    • This might add complexity and reduce capacity to ORES; i.e., sudden bursts of requests for uncommonly used wikis
    • Alex: loosing excess capacity is not a big loss ( more efficeincy); adding capacity in k8s is easier
      • ORES relying on CoW will cause migration and pain issues in future; i.e., the kernel does not break user space; however, memory allocations may change in future
      • Tricks like apertium are using: loading and dropping models on demand
      • Actual RSS 210GB, with CoW 56GB. We have a lot of uwsgi and celery running. Currently, we have tuned the software to the hardware. If we need more capacity now, we need to add equipment (weeks to months). With K8s, room for growth.
      • Other benefits: standardized deployments, rollback being easier (no mess with scap handling virtualenvs), automatic garbage collecting of images. also a lot more control of CI -- less work inside of puppet. More self-serve.
    • Aaron: we could say that we want to move *new* models to k8s -- want to ensure that we're using time efficiently
      • We have a backpressure system to determine whether or not we're running out of capcity (via queue and redis)
      • We need, though, to be able to handle random bots crawling uncommonly requested wikis -- how long does that scaling take?
        • Alex: seconds.
        • Aaron: if dynamic scaling is fast then that could give us more capacity that changes a lot -- how much is already built and how much would we have build ourselves for this scaling?
        • Alex: we can help the scaler by feeding it RpS to help it work better, but most components are there. We're not there yet; we're working with a contractor; it's on our roadmap. We should have that before end of September (hedging a bit).
  • Aaron: we're worried about being trailblazers. Multi-container deployment in particular.
    • Tyler: PipelineLib exists. Not being used by other repos. No reason it shouldn't work. Re. deployment, there is not a lot of magic there that could fall apart (Helm and K8s -- not a trailblazer there).
    • Alex: Only trailblazing is the creation of two different container images from the same repo.

Couple of more benefits of k8s I forgot to mention yesterday

  • Ability for >1 deployments. This might be beneficial from a product perspective, e.g. create an ORES deployment for internal mediawiki usage and another for external researchers usage. That would mean that RC could not be negatively impacted by a spike in research queries.
  • Canaries (actually a byproduct of the above and some statistical request routing)
  • Decoupling from the OS the hardware node the service runs on. That is, you can decided to adopt a newer Debian version faster and without much intervention from SRE. Alternatively, SRE can proceed with with upgrading nodes OS without impacting the service at all.
  • Decoupling from puppet and the need to have +2s from an SRE.