Page MenuHomePhabricator

Homer: add parallelization support
Open, HighPublic

Description

It would be useful to be able to run commands in parallel on network devices.
At least for diffs, show (in T250413).

And maybe later down the road for commits.

Event Timeline

ayounsi created this task.

I'm wondering if we could look at prioritizing this work. With new network devices arriving in codfw, we're reaching the limit of configuring network devices one after the others serially.

FYI, this limitation is becoming more and more problematic for deploying a change to the whole infra.

ayounsi raised the priority of this task from Low to High.Apr 2 2024, 1:07 PM

I had a quick look to understand our options in terms of parallelization. Keeping in mind the usual 3 possible approaches: multi-process, multi-thread, async.

async

The ncclient library seems to support "async" RPC calls in a very simple and non-pythonic way, basically just returning immediately and letting the client implement a polling logic to check when the answer is there.

But the py-junos-eznc library doesn't seem to support it. In their 1.0 release they clearly stated:

* Command execution is synchronous and blocking.  The underlying NETCONF transport library is the ncclient module.  If your application requires asynchronous or nonblocking execution logic, you should investigate other libraries to wrap around the PyEZ framework such as Twisted or Python Threads.

They are now at v2.7.2 but there is no mention of async since then, so it's unlikely that it supports it and even if there was a way we can't just inject the async_mode parameter to the underlying ncclient because it would need to be handled differently by the Junos library.

It's probably worth another look, just to be sure.

Of course we could make some async stuff in homer to wrap the juniper calls and "make them look like they are async".

multi-thread

We should check how much the various underlying libraries are advertised as thread-safe or not and see how feasible it would be. If chosen we could use the concurrent.futures higher framework [1] to ease the work.

multi-process

At first I would discard this option for the overhead of the communication between the processes and the multiple python
If chosen we could use the concurrent.futures higher framework [1] to ease the work.

[1] https://docs.python.org/3/library/concurrent.futures.html

One observation is that the config generation could be parallelized separate to the router transport.

i.e. once the globbing on hostnames is done spawn separate threads to build all the conf files, then push out to the devices as we do currently.

Obviously not the end-game state but would probably improve things significantly.

Sorry for not mentioning it, the parallelization of the configuration generation was implicit to me, and also easier, but ideally we should parallelize both and hence we should find the common ground parallelization approach for both if possible to prevent having N different ways of parallelizing things in the same tools ;)

I had a chat with Riccardo about a possible first change that could help one of the use cases mentioned (a sort of version-0 of the final solution) could be simply to implement (hopefully) and give some relief for tedious tasks. IIUC at the moment if an admin needs to add a config to multiple devices (if not all, like adding a new user/ssh-key) then they will need to input "Y" for 90-ish times, one for each device, even if the diff is the same. One idea could be to ask Y the first time, and "cache" the (diff/Yes) combination to avoid re-asking if the diff is the same on the next device. Ideally we could add all the diffs at the beginning, asking the admin to input Y only few times in a row and then forget about it until homer finishes, something more complicated that we could do for version 1 later on.

Possible issue:

  • Admin A starts homer for a long task to run, say add a new user to all Devices
  • Admin B does something on Device X, that is in 50th position of Admin A's list (maybe a quick fix for an issue, a test, etc.)
  • By the time that Admin A's homer run reaches Device X, its config could be different and we'd override it.

To avoid this consistency problem we could do the diff every time for all devices, and apply only the "cached" config if the diff doesn't report anything strange.

Lemme know if you like the version-0 idea, if so I can start working on a patch :)

It's necessary to do the diff on all target devices anyway, so that behavior is fine.

For example, if we run homer "*ulsfo*" commit "foo" to change a SSH key

it will iterate over all the devices, first device will display a diff, SRE type "yes" to commit, then that diff will be saved. If for any other device the diff is similar it will automatically commit it. If it's different, it will ask if yes/no it should be committed and cached for any other possible similar case.

In the implementation, I could see 3 options :

  • Same UI as currently (regular commit yes/no question), main downside is that there is no granularity: it's all or nothing for a given diff, with the risk of pushing a change to a device we don't want
  • Sdditional --batch command line parameter, downside is an additional parameter, upside is the possibility to run homer on a device by device mode by default
  • yes/no/batch question for the commit, upside: best of both worlds

We can also decide that batch means to silently skip any device that have a different diff, to not risk blocking the run in the middle of it if a device have local changes

I like the last proposal but I was thinking that there is an additional case:

  1. apply to this device and ask for the next one unless already cached and approved
  2. skip this device and ask for the next one unless already cached and approved
  3. apply to this device and to all with the same diff, ask if the diff is different. If this answer is picked multiple times it caches multiple diffs that are "approved"
  4. apply to this device and to all with the same diff, skip if the diff is different, this can be offered and selected only if (3) has never been picked

While 1 and 2 are clearly yes and no, I'm not sure about the naming of the remaining two, batch-skip and batch-ask don't seem good choices ;)

Yeah I think it's what I tried to mean with

We can also decide that batch means to silently skip any device that have a different diff, to not risk blocking the run in the middle of it if a device have local changes

Basically decide if the batch behavior is (3) or (4) and then stick to it. 4 options seems a bit too much.
I tend to prefer (3), and would be ok to not support (4), especially as in a good state there should be no local changes.

For my part I like “3” as set out by Volans above.

@ayounsi is your proposal that “batch” would be a valid answer (in addition to yes/no) when presented with a diff, indicating “yes, and yes to any others the same”? Interesting idea, could work well.