Page MenuHomePhabricator

Determine safe concurrent puppet run batches via cumin
Open, LowPublic

Description

There's currently FUD around what's a safe batch number to use with cumin when issuing puppet runs on many hosts. Achieving a fast puppet run is especially important in emergency situations, for example when distributing new policies/ACLs/etc. Said number must strike a balance between being high enough (i.e. fast propagation of puppet changes) and not make the puppet run fail (e.g. overloaded puppet servers).

For context, in the past we've run into problems where too many concurrent puppet runs would overload the puppet servers; we've limited the mod_passengers concurrent workers to mitigate this problem. The exact effects of many concurrent puppet runs needs to be determined though (e.g. agents queue waiting for their turn? apache / passengers makes the puppet run fail, etc)

Event Timeline

Adding @jbond that might have some insights about it.
My 2 cents are that historically we've used 15 as a safe batch number that should not cause issues (see for example https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed ). But that was a while ago and I hope we can now revisit that number and maybe bump it a bit.
That said some puppet catalogs are bigger than others (like the Icinga server) and so we should use a conservative enough number that doesn't overload the puppet servers even if used with a more demanding cluster from the puppet catalog compilation point of view.

Another factor to keep on mind that is that we have three puppet masters each in eqiad/codfw, so requests better balanced. pm1003 was only added during the Buster upgrade e.g.

How about we simply do a test run at a time when there's no ongoing incident/maintanance? We can force a puppet run with 20 servers per batch and keep an eye on the load on the puppet masters/logs and then raise the number by 10 incrementally?

Good point re: more expensive catalogs! Also +1 on empirically testing batches, the most common use case for wanting fast rollouts in my mind is the edge caches; so starting with cp* catalogs testing would be a good first test IMHO

FYI i ran puppet fleet wide today using a batch size of 40 and there was no issue. puppet master load rose from ~1.5 -> 4.0. you can see a small peak in grafana. From this we should be able to go a bit higher

akosiaris added a subscriber: akosiaris.

FYI i ran puppet fleet wide today using a batch size of 40 and there was no issue. puppet master load rose from ~1.5 -> 4.0. you can see a small peak in grafana. From this we should be able to go a bit higher

Oh, that's nice. Couple of points/questions (some perhaps rhetorical):

  • How long did the run take? My reading of the graphs says ~40m (from 12:00 to 12:38), is that correct?
  • There are 3 different CPU/load/network peaks (all correlating with each other, I wonder why. Any ideas?
  • I see load1 actually reaching maximums of ~10 , not 4.0.
  • puppetmaster1002 peaked somewhat shy of 45% for a couple of mins, the other 2 nodes even lower. That's nice, it means we got some more room. Perhaps we should also do some rebalancing of the weights for the backends?
  • puppetdb1002 interestingly also rose to 50% CPU during this. Pretty interesting, I did not expect that much.

@akosiaris thanks for digging into this a bit further, and appolagise for not leaving more then a drive by comment:

How long did the run take? My reading of the graphs says ~40m (from 12:00 to 12:38), is that correct?

This sounds about right to me but unfortunately i don't have anything more precise still in my terminal

There are 3 different CPU/load/network peaks (all correlating with each other, I wonder why. Any ideas?

Nothing very concrete but. I did a run yesterday using -b 120 (which is definitely too high. The run started at about 10:52:16 and i cancelled it at 11:33:37. In the graphs here you can see similar peaks towards the beginning then after the second peak we things look like they calm down. however in fact at this point all puppet runs started to fail. Looking in the logs you see the occasional 400 and icinga also alerted a couple of times on Apache being unavailable.

My working theory is that the servers get a bit burst of requests start compiling, and then serving files to the first batch then at some point the stress gets too much and apache/passanger starts to ask users to back off. It seems that the agents at this point implements some type of backoff retry algorithm to continue trying to fetch the files/catalogue/submit facts etc. This ties up the connections but not necessarily cpu as they are not compiling catalogues. In icinga the errors look like either 0 resources received or unable to fetch File[/foo]. On the agent side one will notice agent runs starting to take much longer to seemingly hanging indefinitely.

I see load1 actually reaching maximums of ~10 , not 4.0.

yes sorry i was just monitoring by eye with top

puppetmaster1002 peaked somewhat shy of 45% for a couple of mins, the other 2 nodes even lower. That's nice, it means we got some more room. Perhaps we should also do some rebalancing of the weights for the backends?

+1 I have also wondered if we should switch to using srv records and remove the concept of frontend backend and simply have CA function and compliers. i know this has had issues with this in the past in the past but may be worth exploring to see what the current state is

puppetdb1002 interestingly also rose to 50% CPU during this. Pretty interesting, I did not expect that much.

with a fairly constant 3-4% iowaite, I'm not a ganati expert but i have wondered if we would get benefit from moving theses to bare metal or if there are some tuning parameters we should set for postgresql guests

@akosiaris thanks for digging into this a bit further, and appolagise for not leaving more then a drive by comment:

How long did the run take? My reading of the graphs says ~40m (from 12:00 to 12:38), is that correct?

This sounds about right to me but unfortunately i don't have anything more precise still in my terminal

There are 3 different CPU/load/network peaks (all correlating with each other, I wonder why. Any ideas?

Nothing very concrete but. I did a run yesterday using -b 120 (which is definitely too high. The run started at about 10:52:16 and i cancelled it at 11:33:37. In the graphs here you can see similar peaks towards the beginning then after the second peak we things look like they calm down. however in fact at this point all puppet runs started to fail. Looking in the logs you see the occasional 400 and icinga also alerted a couple of times on Apache being unavailable.

My working theory is that the servers get a bit burst of requests start compiling, and then serving files to the first batch then at some point the stress gets too much and apache/passanger starts to ask users to back off. It seems that the agents at this point implements some type of backoff retry algorithm to continue trying to fetch the files/catalogue/submit facts etc. This ties up the connections but not necessarily cpu as they are not compiling catalogues. In icinga the errors look like either 0 resources received or unable to fetch File[/foo]. On the agent side one will notice agent runs starting to take much longer to seemingly hanging indefinitely.

Hm, interesting. Not sure it's worth putting effort into figuring it out, I just noticed it and wondered.

I see load1 actually reaching maximums of ~10 , not 4.0.

yes sorry i was just monitoring by eye with top

puppetmaster1002 peaked somewhat shy of 45% for a couple of mins, the other 2 nodes even lower. That's nice, it means we got some more room. Perhaps we should also do some rebalancing of the weights for the backends?

+1 I have also wondered if we should switch to using srv records and remove the concept of frontend backend and simply have CA function and compliers. i know this has had issues with this in the past in the past but may be worth exploring to see what the current state is

For the record we did try that back in 2016. Granted it's been quite a long time but there was 1 big issue that made us (@Joe and me that is) backtrack full speed. The issue at hand was that for every file resource (so easily 1000+ for a node) the agent tried to resolv the DNS hostname of the resulting URL, without however caching the answer. The result was 1000+ DNS requests per agent run. While that was not very perceptible on the main DCs, it would end up exploding the catalog run time (just multiple 1000 * 40ms say as an average cross DC latency to get an idea) for agents in the PoPs as there was no caching resolver in there and DNS requests had to be sent to the main DCs. With today's DNS infrastructure, this should not be a problem anymore (also the agent might be better implemented now and actually cache DNS responses for a bit)

puppetdb1002 interestingly also rose to 50% CPU during this. Pretty interesting, I did not expect that much.

with a fairly constant 3-4% iowaite, I'm not a ganati expert but i have wondered if we would get benefit from moving theses to bare metal or if there are some tuning parameters we should set for postgresql guests

It's currently on ganeti1012, and on SSDs. Disk speed wise, on the physical layer it's more or less as good as it gets. That being said, on ganeti disk nodes are backed by DRBD and in fact they use the DRBD protocol C [1], which makes it fully synchronous between nodes and does add latency. We could go bare metal or switch to a non DRBD setup (at the cost of not having the ability to recover from a hardware node catastrophic failure) and see if it changes anything. I doubt it though. At 3-4% it isn't increasing CPU usage much, if anything it just adds latency. Tuning the database might be a more fruitful path.

[1] https://stackoverflow.com/questions/45998076/an-explanation-of-drbd-protocol-c

The result was 1000+ DNS requests per agent run.

yes i also hit the same issue in $JOB~1 will try and find the relevant bugs and check on progress. As mentioned the pop caches will help here, further i have I have also drafted a change to start using systemd-resolved which could also help (although it was a Friday afternoon draft and needs much more testing and thought)

It's currently on ganeti1012, and on SSDs. Disk speed wise, on the physical layer it's more or less as good as it gets. That being said, on ganeti disk nodes are backed by DRBD and in fact they use the DRBD protocol C [1], which makes it fully synchronous between nodes and does add latency. We could go bare metal or switch to a non DRBD setup (at the cost of not having the ability to recover from a hardware node catastrophic failure) and see if it changes anything. I doubt it though. At 3-4% it isn't increasing CPU usage much, if anything it just adds latency. Tuning the database might be a more fruitful path.

Ack agreed worth exploring tuning other options before going to bare metal

How long did the run take? My reading of the graphs says ~40m (from 12:00 to 12:38), is that correct?

This sounds about right to me but unfortunately i don't have anything more precise still in my terminal

From cumin logs:

2021-04-20 12:01:50,741 [INFO 14083 cumin.cli.main] Cumin called by user 'jbond' with args: Namespace(backend=None, batch_size={'value': 40, 'ratio': None}, batch_sleep=None, commands=['run-puppet-agent -q'], config='/etc/cumin/config.yaml', debug=False, dry_run=False, force=False, global_timeout=None, hosts='*', ignore_exit_codes=False, interactive=False, mode='sync', output=None, success_percentage=95, timeout=None, trace=False, transport=None)
[...SNIP...]
2021-04-20 12:32:57,570 [WARNING 14083 cumin.transports.clustershell.SyncEventHandler._success_nodes_report] 99.6% (1700/1706) success ratio (>= 95.0% threshold) [...SNIP...]

So it took ~32m.