Page MenuHomePhabricator

Cumin: add --limit to randomly select N hosts
Closed, DeclinedPublic

Description

From Cumin's TODO:
CLI: add --limit to randomly select N hosts within a broader selection.

The scenario that this wants to cover is that you have a large number of hosts from different groups and want to test a command or gather some info from a subset of them, but trying to be as much representative as possible. Some sort of automatic canarying.

The proposed solution, given the generic nature of Cumin and that the only information it has are the hostnames, is to use the ClusterShell's NodeSet compact syntax to extract them in "groups".
Then basically two different cases can happen:

  • the limit is greater than the number of distinct groups: in this case from each group a random subset of hosts is selected, with its size based on the relative weight of the len(group) compared to the total number of hosts, while guaranteeing to gather at least 1 host from each group.
  • the limit is smaller than the number of hosts: in this case from the top N largest groups, with N == limit chosen, one random host is selected.

There is also some logic to adjust rounding offsets, I can add more details on that if needed.

I know that in our specific use case there are two main issues:

  • we have hosts that are part of the same hostname group but have different roles (mw for api/app/videoscaler)
  • we have hosts that are part of the same group but have different hostnames (einsteinium/tegmen for icinga)

In those cases Cumin cannot help and if a deterministic subset is needed it can be done programmatically via the library.
Another approach that was evaluated but discarded was to add something to the global query grammar to specify that a query has a limit, like P{O:foo}#3 or A:bar#3 (or any other syntax). I found this approach not very elegant for the query syntax, even uglier when combining queries, would it be something like (A:foo and A:bar)#3.

Event Timeline

Change 409980 had a related patch set uploaded (by Volans; owner: Volans):
[operations/software/cumin@master] Backend: allow to extract random subset of hosts

https://gerrit.wikimedia.org/r/409980

Volans triaged this task as Medium priority.Feb 14 2018, 11:52 AM
Volans updated the task description. (Show Details)
Volans added a subscriber: Joe.

The proposed solution, given the generic nature of Cumin and that the only information it has are the hostnames, is to use the ClusterShell's NodeSet compact syntax to extract them in "groups".

Given cumin's nature of being a software for large operations, where we can assume servers are pets and not cattle, I would assume that if we want to try to magically guess how to group hosts (which is problematic) we could do something like measuring the levenshtein distance between hostnames for grouping.

I still find this very suboptimal as a feature as, as a user, I'd expect to be able to do the following:

  • pick some value on my query backend to use as a grouping variable (if the backend allows it)
  • if I do not specify any grouping via the backend, to use a sensible algorhytm to group them by similar hostname.
  • be able to save the list of hosts where I ran as a canary and feed it to the next run of the cumin command to exclude those hosts explicitly
  • be able to specify the percentage of hosts in each group to which to run the commands onto, rather than a total number.

So while I like the feature idea in general, I think that in order to make it usable we would need to put in more work on it. What we built right now is not going to be very useful IMHO outside of random testing.

Two and a half years later, is there some more consensus in this task (as asked for in the Gerrit patch)?

The topic has not been discussed any further since then nor I've heard any more request of a feature like this one for cumin.
Since then we've introduced Spicerack and Cookbooks that allow to add more logic and orchestration to the operations to perform, probably covering some of the use cases listed above, in conjunction with a wider usage of cumin aliases for canary hosts.
At this point this is more of a wish list item than anything else, I'm resolving it and abandoning the change.

Feel free to reopen if anyone feels strongly about it and thinks that this should be prioritized.

Change 409980 abandoned by Volans:
[operations/software/cumin@master] Query: allow to extract random subset of hosts

Reason:
No activity nor consensus on the related task

https://gerrit.wikimedia.org/r/409980