Page MenuHomePhabricator

Define the size of a pod for mediawiki in terms of resource usage
Open, MediumPublic

Description

We need to define the ideal size of limits and resources for a single pod running mediawiki. Specifically, we need to define the following limits:

containermemorycores
httpd
php
mcrouter
nutcracker
mcrouter (dset)
nutcracker (dset)

I added two separate lines for both mcrouter and nutcracker for the two cases - having them as part of a pod and as daemonsets.

I have some basic numbers for the php image. Most of these are a function of the number of php workers we're going to run in the pod.

  • opcache doesn't depend on the size of the pod. We need to reserve ~ 400 MB of memory for opcache (and keep an eye out for it)
  • APCu space. Currently an appserver uses ~ 1.5 G of opcache and an api server uses ~ 400MB of it. We might expect this to be a bit smaller for a smaller installation, but not as much as we'd like.
  • Each worker will need ~ 500 MB of memory available (more for parsoid servers)
  • d_f * CPU/2 per worker, where d_f is a dumping factor that I would empirically set at 0.8
  • We always need to add 2 workers to support liveness /status probes

So we have a relatively simple equation to play with:

CPU(n_workers) = (n_workers -2) / 2d_f
MEM(n_workers) = opcache + apcu + mem_limit * n_workers

The goal is to pack 4 or even 5 pods in a single modern node.

Event Timeline

Some data from one appserver:

  • httpd uses less than 1 GB of memory and 1 cpu. If we assume we'll reduce the number of workers, it can be safe to assume e.g. 600 MB and 0.6 CPUs are ok
  • mcrouter uses around 300 MB of memory. Again this would be reduced if it's inside the pod, down to ~ 200 MB should be safe. 1 CPU is enough for a whole-host mcrouter, so we can assume 0.5 CPUs should be enough
  • nutcracker currently uses 200 MB of memory + 0.1 cpus
Joe updated the task description. (Show Details)

The goal is to pack 4 or even 5 pods in a single modern node.

I recently created T277876 where I propose we should reserve some of the resources of each node to the system. So the allocateable resources of each node will slightly drop in the near future.

A typical appserver has 96 GB of memory and 48 cores. Let's assume we can use up to 85% of those with pods, which looks a bit conservative, but it's ok for our current calculations.

Assuming the numbers above are somewhat correct, we would have: 1.2 CPU and 1 GB of memory for the support pods listed above. If we include the prometheus exporters, this goes to 1.5 CPUs and 1.5 GB of memory.

So let's first make a couple calculations for the appservers workload:

Mem = 1.5 + 0.66 * workers + 2 GB (apcu + opcache)  # see wmgMemoryLimit
CPU = 0.4 * workers + 1.5
n_workerscpumemn_podstot_workers
53.56.81155
105.510.1770
157.513.5575
209.516.7480

numbers for an api workload are similar, and don't need a separate discussion, as the only resource we save on is memory, and we're not constrained by it.

Please note: this means we would have a smaller number of workers per node than we used to; on the other hand, the numbers can be tuned a lot by just moving some levers. Also: at smaller concurrencies, php-fpm performs significantly better. We will need to fine-tune these numbers once we run real workloads.

Now for a parsoid workload:

Mem = 1.5 + 1.4 * workers + 2 GB (apcu + opcache)  # see wmgMemoryLimitParsoid
CPU = 0.4 * workers + 1.5

This would mean having 4 pods with 15 workers per node.

At 15 workers per pod, we get 5 pods per node (6 if we only reserve 5% of ram and cpu). That's more or less the maximum concurrency at which the sweet spot holds for php-fpm. It gets us either 75 or 90 workers per node, and I think it would be a net win. I will update the task once I have more realistic numbers.

Regarding reserving RAM for the node, after we complete T264604, we will have an estimation of how much memory we will need for onhost memcached. Right now we only use it for parsercache, so current numbers are not useful. Lastly, given TTL is 10s, I do not expect any unreasonable requirements.

Change 674634 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] mediawiki: Include profile::prometheus::cadvisor_exporter

https://gerrit.wikimedia.org/r/674634

I 've just uploaded the above change for review. The idea is to use our current setup to gauge more accurately over a period of time what our current usage patterns are.

This does not invalidate or alter the above assumptions and discussions, but I am hoping it will inform them better.

A quick output from 1 of the cache boxes where this is live for quite some time now

container_cpu_user_seconds_total{id="/system.slice/traffic-pool.service"} 0 1616600553810
container_cpu_user_seconds_total{id="/system.slice/trafficserver-tls.service"} 4.3010419e+06 1616600579566
container_cpu_user_seconds_total{id="/system.slice/trafficserver.service"} 2.60761874e+06 1616600578912
container_cpu_user_seconds_total{id="/system.slice/varnish-frontend-fetcherr.service"} 153922 1616600579876
container_cpu_user_seconds_total{id="/system.slice/varnish-frontend-hospital.service"} 66365.76 1616600578573
container_cpu_user_seconds_total{id="/system.slice/varnish-frontend-slowlog.service"} 386079.24 1616600578880
container_cpu_user_seconds_total{id="/system.slice/varnish-frontend.service"} 1.060081682e+07 1616600579064
container_cpu_user_seconds_total{id="/system.slice/varnishkafka-eventlogging.service"} 179009.44 1616600579172
container_cpu_user_seconds_total{id="/system.slice/varnishkafka-statsv.service"} 154894.36 1616600579567
container_cpu_user_seconds_total{id="/system.slice/varnishkafka-webrequest.service"} 863048.87 1616600579143
container_cpu_user_seconds_total{id="/system.slice/varnishmtail.service"} 423485.74 1616600579723

I still can't find the relevant dashboard

Change 674634 merged by Alexandros Kosiaris:
[operations/puppet@production] mediawiki: Include profile::prometheus::cadvisor_exporter

https://gerrit.wikimedia.org/r/674634

Change 675113 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):
[operations/puppet@production] prometheus::cadvisor_exporter: Support only buster and later

https://gerrit.wikimedia.org/r/675113

Change 675113 merged by Alexandros Kosiaris:
[operations/puppet@production] prometheus::cadvisor_exporter: Support only buster and later

https://gerrit.wikimedia.org/r/675113

I 've created

https://grafana.wikimedia.org/d/0VjCCwwGk/mediawiki-server-clusters-utilization?orgId=1

I don't like how I named the dashboard much, but I 'd rather work on cache invalidation and off-by-one errors instead.

For the memory consumption, I think that in a couple of weeks we will have a pretty good idea per component. I am still working a bit on the CPU part, it requires a few puppet changes to enable CPUAccounting=yes for those components, but it does look relatively easy and promising in some tests. Hopefully this data will allow us to make more informed decisions about placement and sizing of components.

Change 675237 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):
[operations/puppet@production] mediawiki: Enable CPUAccounting for various components

https://gerrit.wikimedia.org/r/675237

I have applied the patch manually on mw2305 and wtp1032. I haven't seen any difference in the various dashboards for those hosts. Respectively:

@jijiki will try it out on mwdebug1001 as well and I 'll enable it on mw1412 and mw1413 for about 24H.

The general rollout plan for probably Thursday is:

  • Disable puppet for the entire set of hosts affected (O:mediawiki::common)
  • Merge the change
  • Using cumin for 1 server at a time do the following:
    • Depool
    • Enable puppet
    • Run puppet
    • Restart components
    • Pool

This essentially translates to the following

sudo cumin 'O::mediawiki::common' 'disable-puppet "Slow rollout of T278220"'

sudo cumin -b 1 'O::mediawiki::common' 'depool ; enable-puppet "Slow rollout of T278220"' ; puppet-run ; systemctl restart memcached.service php7.2-fpm envoyproxy.service mcrouter.service nutcracker.service apache2.service ; pool

mw1412 and mw1413 have puppet disabled and the changes live as of a few mins ago.

At first it looks like a mw1410 (api same h/w as mw1412) performs slightly better in p50, but that was happening from before, so basically I don't see any differences compared to before. I think we can merge and have a go.

mw1410 vs mw1412


I think so too. I 've worked up a new version of the patch that no longer touches needlessly all hosts having mcrouter and memcached installed. PCC is at https://puppet-compiler.wmflabs.org/compiler1002/28843/, I think we are good to go.

Mentioned in SAL (#wikimedia-operations) [2021-03-31T13:34:05Z] <akosiaris> disabling puppet on role::mediawiki::appserver, role::mediawiki::appserver::api, role::mediawiki::maintenance, role::mediawiki::jobrunner, role::parsoid, role::parsoid::testing T278220

Mentioned in SAL (#wikimedia-operations) [2021-03-31T13:39:34Z] <akosiaris> revert mw1412, mw1413, wtp1032, mw2305 to the previous state for T278220

Change 675237 merged by Alexandros Kosiaris:

[operations/puppet@production] mediawiki: Enable CPUAccounting for various components

https://gerrit.wikimedia.org/r/675237

akosiaris triaged this task as Medium priority.Wed, Mar 31, 3:05 PM

Change merged and shepherd into production. https://grafana-rw.wikimedia.org/d/0VjCCwwGk/mediawiki-server-clusters-utilization?orgId=1 now has also CPU data. Something to note is that the dashboard has mean and max values across instances, with the specific of every hardware playing some role into this. e.g. for php-fpm we will have to divide by the number of workers to come into some sane number. I 've added a "Suggestion row" where we can more easily do those calculations. But for now I think we need to let prometheus gather some data first.

Some more changes down the road (thanks @jijiki for the hint about php-fpm workers per node prometheus metric), the dashboard is ready to provide us with insights. Some quick observations (I am going to stick to the appserver cluster mostly).

  • nutcracker consumes almost nothing in all clusters except the appserver cluster where it consumes 9.5GB across the eqiad cluster. All that for sharding 2.5GB of redis data. Still, a cheap component right now, but one where we can have some quick and easy savings if we get rid of it.
  • envoy is REALLY performant. It consumes some 16-17 cores across the eqiad appserver clusters, which amounts to a max of 0.32 cores per node. That's pretty nice.
  • mcrouter is the second most CPU hungry component we got. Some 40-41 cores for the appserver eqiad cluster. At the same time it's barely consuming 17GB of RAM, so it's pretty nice.
  • apache seems to have a huge memory footprint (up to 13GB per node), despite a minimal CPU usage (something like 5 cores across the entire fleet). However upon drilling more, we discover that: 1) the RSS is never more than 100MB) 2) the working set is actually on average ~2% (with a max of 28%) of the total memory usage, 3) That the pagecache is up to 95% of the total memory usage. That's probably explained by the fact that apache serves assets, which don't need anything more than being read and sent directly to to the requesting client. My take is to take that with a huge grain of salt and probably just settle on the maximum working set. That should avoid OOM as well as memory starvations and allow enough pagecache to serve most needs.
  • The suggested numbers for resources.requests and resources.limits seem sane without needing too much tweaking (most are actually small enough to not matter), with the exception of php-fpm which required a dedicated adjusted per worker row to come up with numbers. Those specific numbers are preliminary and it's probable that they are quite dependent on the number of workers we 'll have per pod.