Page MenuHomePhabricator

Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes
Open, MediumPublic

Description

What?
Memcached is a crucial component in our Mediawiki installation. Right now MediaWiki interacts (via mcrouter) with two memcached instances.

  • Onhost, an instance running in with in the same application server where specific keys are stored with a max TTL of 10s
  • Main memcached cluster, a set of 18 servers per DC where mcrouter shards all keys.

Nutcracker is a component we would like to relieve of its duties. Nutcracker shards keys to a redis cluster that resides with in the memcached servers. Whatever solution we choose for mcrouter, makes sense to use for nutcracker too, if nutcracker is still around. T277183

Onhost memached
It makes sense to have one onhost memcached instance running in each node where MediaWiki pods will run:

  • either outside k8s, as a service running on a node
  • as a daemonset

Where should mcrouter reside?

1) Run mcrouter as part of the mediawiki pod (via a TCP port or a UNIX socket)

  • Prons: rollout will be similar to mediawiki
  • Cons: harder to test changes in production (eg push a change o a single node and monitor), # of connections towards the memcached main cluster will multiply

2) Run mcrouter as daemonset

  • prons: easy rollout
  • cons: some unavailability might happen during rollout , if the damonset fails, all pods in the node are left without access to the memcached main cluster

3) Run mcrouter outside of kubernetes, on the node

  • Prons: reduces complexity, we already have everything setup in puppet, easy rollout, easy to test changing in production, easy to control changes (by eg using feature flags in puppet)
  • Cons: if the service fails, all pods in the node are left without access to the memcached main cluster (almost unusuable)

Event Timeline

jijiki triaged this task as Medium priority.Mar 17 2021, 9:59 PM
jijiki added projects: SRE, serviceops.

As far as mcrouter goes, the only non-brittle solution is to run it inside the pod, so solution 1. The reason is simple: restarting mcrouter and/or it crashing on the node or in a daemonset would make it unavailable for all the pods on the node, without MediaWiki *or* kubernetes noticing.

If we keep it in the pod, we have a few advantages:

  1. we reduce the amount of requests a single mcrouter instance has to deal with. We've seen that during peak traffic latencies of mcrouter can increase as we're stretching its ability to scale as a single instance
  2. Zero-downtime config changes (they become k8s deployments)
  3. Ensuring failures are as constrained as possible.

As far as mcrouter goes, the only non-brittle solution is to run it inside the pod, so solution 1. The reason is simple: restarting mcrouter and/or it crashing on the node or in a daemonset would make it unavailable for all the pods on the node, without MediaWiki *or* kubernetes noticing.

If we keep it in the pod, we have a few advantages:
we reduce the amount of requests a single mcrouter instance has to deal with. We've seen that during peak traffic latencies of mcrouter can increase as we're stretching its ability to scale as a single instance

I would like to revisit this, given we are running 0.41 now, I could have a quick look if this still stands, because that would indeed be a problem.

Zero-downtime config changes (they become k8s deployments)

Mcrouter has a zero downtime config changes already as it is watching its config file and reloads. I might be on the wrong here, but specifically for config changes, it might take longer in k8s.

Ensuring failures are as constrained as possible.

In general, mcrouter is one of the most stable pieces in our stack, I do not recall mcrouter failing us, apart from a bad config or our beloved TKOs (which are a feature). Unless we run into some odd bug, I have faith it will not fail us. Maybe we can include a check in the php-fpm container readiness probe that a connection to mcrouter is established, so in case of a mcrouter fail, we limit the consequences. I don't know if that would be an anti-pattern, but maybe we should consider it in general.

Additionally, running multiple mcouter instances within a single node, we multiply the number of connections towards the local memcached cluster + the remote one (assuming T271967 is successful). Mcrouter supports connection pooling, so running on the host will reduce the # of connections. This is not necessarily problematic, but worth pointing out.

If we choose to run it on the node, we have a more controllable way to test config changes or version changes (eg roll it to 1-2 nodes and see how it goes, via puppet feature flags), which is my main concern if we go with options 1 or 2.

I don't really like option 3 just because it moves parts of the software stack to the node itself and I would personally like them to be as dumb as possible, ideally just running kubernetes components and docker. This might be a bit opinionated, but IMHO it makes dealing with the nodes more easy as one can be sure that all the actual workload on the node is "visible" via kubenetes API and there is nothing "hidden" that one might need to take care of when dealing with nodes.

I don't really like option 3 just because it moves parts of the software stack to the node itself and I would personally like them to be as dumb as possible, ideally just running kubernetes components and docker. This might be a bit opinionated, but IMHO it makes dealing with the nodes more easy as one can be sure that all the actual workload on the node is "visible" via kubenetes API and there is nothing "hidden" that one might need to take care of when dealing with nodes.

This is understood, but we need to come up with a reasonable way to gradually rollout changes to mcrouter when needed, be that a newer version or a configuration change.

I don't really like option 3 just because it moves parts of the software stack to the node itself and I would personally like them to be as dumb as possible, ideally just running kubernetes components and docker. This might be a bit opinionated, but IMHO it makes dealing with the nodes more easy as one can be sure that all the actual workload on the node is "visible" via kubenetes API and there is nothing "hidden" that one might need to take care of when dealing with nodes.

This is understood, but we need to come up with a reasonable way to gradually rollout changes to mcrouter when needed, be that a newer version or a configuration change.

That is already done in the MediaWiki chart.

I don't really like option 3 just because it moves parts of the software stack to the node itself and I would personally like them to be as dumb as possible, ideally just running kubernetes components and docker. This might be a bit opinionated, but IMHO it makes dealing with the nodes more easy as one can be sure that all the actual workload on the node is "visible" via kubenetes API and there is nothing "hidden" that one might need to take care of when dealing with nodes.

This is understood, but we need to come up with a reasonable way to gradually rollout changes to mcrouter when needed, be that a newer version or a configuration change.

That is already done in the MediaWiki chart.

Looking at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/670220, I am not sure I can understand, it could be due to my poor kubernetes knowledge, can you give me an example of how we will test a mcrouter version/config change on a limited number of pods to ensure it works as expected?

That is already done in the MediaWiki chart.

But that does now deploy mcrouter as a sidecar in each MW pod. AIUI this might come with additional cons like more connections to memcached, less use of pooled connections etc. or did I get that wrong? Should we evaluate that?

That is already done in the MediaWiki chart.

But that does now deploy mcrouter as a sidecar in each MW pod. AIUI this might come with additional cons like more connections to memcached, less use of pooled connections etc. or did I get that wrong? Should we evaluate that?

more connections to the memcached hosts is true, but we can fine-tune that when we need it. Less use of pooled connections is OTOH not true, and not an issue.

Looking at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/670220, I am not sure I can understand, it could be due to my poor kubernetes knowledge, can you give me an example of how we will test a mcrouter version/config change on a limited number of pods to ensure it works as expected?

by changing the values.yaml file for a canary release, and/or changing the docker image version.

Given it has created some doubts, let me clarify: I've created a first version of the charts that implements solution 1 (and not a complete version of it, either).

I did so not to ignore the discussion ongoing here, I just need a functional version of the chart that we can start working from. And the work I did there (a couple days worth) will be not thrown away unless we pick option 3.

Trying to break down my current thoughts:

Onhost memcached

In terms of functionality, I don't see a difference between being a DaemonSet and running on the host itself. I would opt for running it inside kubernetes for the reasons outlined by @JMeybohm previously. It's still an open question how we will inject the node IP into the mcrouter configuration. it would mean we'd need to pass the host IP as an env variable to the mcrouter container and somehow inject it into the configuration. The downside is we wouldn't be able to make use of mcrouter's ability to reload its configuration at runtime.

I don't really see alternatives right now, as adding the onhost memcached to the pod would require even more memory we already need (see T278220) and make smaller pods (which we like) impractical.

Mcrouter

I would like to focus on the main question, which is mcrouter should be inside the pod or not. I see the following reasons for answering "yes":

  • it will be part of the pod, meaning that it will be part of the functional unit of work
  • it will be serving a smaller amount of requests, making it a performance afterthought

And the reason for the "no" are:

  • We don't need between 5 and 7 mcrouters running on each node.
  • We could do without the overhead of 0.5 cores and 300 MB of ram, multiplied by the number of pods per node

I also agree with @jijiki that mcrouter hardly ever crashes. All that considered, I think running it as a daemonset with an hostPort is probably the best solution for us right now. We might also consider changing architecture a bit and running mcrouter on the memcached hosts, as facebook does as far as I understand.

Nutcracker

Nutcracker will be eliminated, one way or the other, but its resource consumption is so small that I don't think keeping it in the pod creates a lot of issues. I would consider it a non-issue

onhost memcached

It's still an open question how we will inject the node IP into the mcrouter configuration. it would mean we'd need to pass the host IP as an env variable to the mcrouter container and somehow inject it into the configuration. The downside is we wouldn't be able to make use of mcrouter's ability to reload its configuration at runtime.

That's a bummer. As you know I have quite limited experience with mcrouter and it's config. Is there maybe a way to interpolate env variables or maybe leverage the downward api to write a (unfortunately not json) file to import into the mcrouter config?
In the past I've worked around a similar limitation by having a "inotify-sidecar" that loads the updates config, interpolates some variables in it and then writes a valid-for-the-service version of the config into a shared emptyDir. That could maybe work as well here.

mcrouter

With mcrouter in the pod it is very clear on how different mcrouter versions or config would be tested, as that will be just another mediawiki release. For mcrouter as daemonset, it is not very clear to me how that would be done.

(sorry for not quoting)

As far as connectivity goes, we can run both mcrouter and onhost memcached on a unix socket, if that is of any help. Generally speaking, we have to decide if we want to solve this soon (next 1-3 quarters), or choose a temporary solution that will take us to at least the first release of mw on k8s. Since part of our plan is to run both clusters (mw-on-k8s+ our current infra) for a period of time:

*option 1: maybe it would make sense to have both mcrouter+memcached running as normal services, since this will help us keep control of this moving part in both installations, the same way.
*option 2: if we will need time to solve the challenges presented running them as a daemonset, we can consider keeping mcrouter+memcached in the pod, and take the resource wasting hit.

When I am talking about "changes", apart from upgrading to a new version, what we will most likely need will be to add/remove hosts from the pool, provided that 1) a couple of ongoing memcached/mcrouter related projects have been completed 2) we have refreshed both memcached clusters.

We can consider either solution as temporary, with the goal to eventually run them both as daemonsets.