How can we enhance our production infrastructure using a cluster coordination tool like kubernetes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Jan 4 2016, 6:42 PM

Description

We have currently partially streamlined releasing a new service to production via a bunch of abstractions for puppet/monitoring/coding infrastructure, but we're still relying on a pretty static configuration of our production infrastructure.

While this is mostly acceptable for the MediaWiki application layer, it's starting to show its limitations for services.

Ideally, given most microservices don't use a lot of resources, we need to ensure that:

they run constantly with a given number of working instances per service
they're reasonably resilient to hardware failures
hardware usage is efficient enough
single services are properly isolated from each other
it's easy to deploy a new service, and that it does require the minimum amount of ops intervention once the service is set up
it's easy for developers to test their service reliably and be guaranteed that the environment it runs on in production is extremely similar to what they can reproduce both locally and in labs/beta
There is a clear, defined way to refer to other services from your own service in this environment

We think a potentially interesting way of achieving this is to use kubernetes - a cluster coordination solution developed by google that uses containers and dynamic configurations - to this aim; kubernetes is currently in the process of being used in toollabs as a modern, nice replacement for the rusty gridengine, and we're quite happy with it.

There is a ton of things we have to figure out before we can think of deploying this to production, from mananging containers security to monitoring/alerting to permissions. This session is supposed to be a open discussion about the experience ops is having with kubernetes in toollabs, what else would be needed for using it in production, and how do we plan to go on and try to extend its usage both on the short and on the long term.

Event Timeline

Joe created this task.Jan 4 2016, 6:42 PM

Joe raised the priority of this task from to Needs Triage.

Joe updated the task description. (Show Details)

Joe added a project: Wikimedia-Developer-Summit-2016.

Joe subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 4 2016, 6:42 PM

yuvipanda added subscribers: yuvipanda, ori, Krinkle.Jan 4 2016, 6:46 PM

yuvipanda updated the task description. (Show Details)Jan 4 2016, 6:49 PM

yuvipanda set Security to None.

Joe added subscribers: mark, akosiaris, fgiunchedi and 3 others.Jan 4 2016, 7:05 PM

ArielGlenn subscribed.Jan 4 2016, 11:11 PM

jmadler subscribed.Jan 5 2016, 12:22 AM

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!

Hardikj subscribed.Jan 30 2016, 4:53 PM

• MZMcBride subscribed.Jan 30 2016, 5:12 PM

Etherpad notes from the meeting are here:

https://etherpad.wikimedia.org/p/WikiDev16-T122822

I will not close the ticket until I have the time to sort those notes in a more ordered way.

Nemo_bis subscribed.Jan 31 2016, 6:18 PM

yuvipanda unsubscribed.Jun 30 2016, 2:01 PM