Page MenuHomePhabricator

How can we enhance our production infrastructure using a cluster coordination tool like kubernetes
Closed, ResolvedPublic

Description

We have currently partially streamlined releasing a new service to production via a bunch of abstractions for puppet/monitoring/coding infrastructure, but we're still relying on a pretty static configuration of our production infrastructure.

While this is mostly acceptable for the MediaWiki application layer, it's starting to show its limitations for services.

Ideally, given most microservices don't use a lot of resources, we need to ensure that:

  • they run constantly with a given number of working instances per service
  • they're reasonably resilient to hardware failures
  • hardware usage is efficient enough
  • single services are properly isolated from each other
  • it's easy to deploy a new service, and that it does require the minimum amount of ops intervention once the service is set up
  • it's easy for developers to test their service reliably and be guaranteed that the environment it runs on in production is extremely similar to what they can reproduce both locally and in labs/beta
  • There is a clear, defined way to refer to other services from your own service in this environment

We think a potentially interesting way of achieving this is to use kubernetes - a cluster coordination solution developed by google that uses containers and dynamic configurations - to this aim; kubernetes is currently in the process of being used in toollabs as a modern, nice replacement for the rusty gridengine, and we're quite happy with it.

There is a ton of things we have to figure out before we can think of deploying this to production, from mananging containers security to monitoring/alerting to permissions. This session is supposed to be a open discussion about the experience ops is having with kubernetes in toollabs, what else would be needed for using it in production, and how do we plan to go on and try to extend its usage both on the short and on the long term.

Event Timeline

Joe raised the priority of this task from to Needs Triage.
Joe updated the task description. (Show Details)
Joe subscribed.
yuvipanda set Security to None.

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!

Etherpad notes from the meeting are here:

https://etherpad.wikimedia.org/p/WikiDev16-T122822

I will not close the ticket until I have the time to sort those notes in a more ordered way.

I will not close the ticket until I have the time to sort those notes in a more ordered way.

Hence assigning to @Joe.

And maybe this should be closed now ;-)