Page MenuHomePhabricator

spicerack: introduce GridEngine controller
Closed, DeclinedPublic

Description

problem statement

As of this writing, WMCS Toolforge is partially based on the GridEngine software. Our plan is to stop doing GridEngine and move everything to Kubernetes as soon as possible, but the reality is that we will need to support the grid for a while. Maintaining the grid is painful, not very well documented, and error prone. That's why we're working on automating the most relevant operations.

Worth noting that the code is already being used in the spicerack framework by means of the dedicated wmcs branch in the cookbook repo (link). So this task is mostly about relocating the code into spicerack.

New data structures

We will define a few custom datatypes, exceptions and classes to abstract away grid state and be able to interact with it from spicerack/cookbooks.

See detailed list below.

Third party dependencies

defusedxml: latest version on PyPi is 0.7.1, buster has 0.5.0, bullseye has 0.6.0.

Additional configuration

Nothing special.

possible future improvements

Most of Toolforge grid-related cookbooks share the same parser option --grid-master-fqdn. At some point we may want to create some abstraction to introduce this common argparse configuration for all related cookbooks. But this is out of scope for this initial iteration.

Also, mostly a cosmetic thing, most of the code does not need to handle full node FQDN, but short hostnames. The FQDN can be inferred from WMCS project name + deployment (i.e, whatever-vm.<project>.<deployment>.wikimedia.cloud). So this is something we can improve before the code introduction or shortly after. Ideally after, so we can figure out how to tackle this cosmetic problem on a global scale (same happens in our openstack-specific cookbooks).

This is not big deal anyway, and interface compatibility is just one .split(".")[0] away.

Projected definitions

This is a projection of what we would define as new interface for this module.

class GridError(Exception):
    """Base parent class for all grid related exceptions."""

class GridNodeNotFound(GridError):

class GridUnableToJoin(GridError):

class GridQueueType(Enum):

@dataclass(frozen=True)
class GridQueueTypesSet:
    @classmethod
    def from_types_string(cls, types_string: Optional[str]) -> "GridQueueTypesSet":

class GridQueueState(Enum):

@dataclass(frozen=True)
class GridQueueStatesSet:
    @classmethod
    def from_state_string(cls, state_string: Optional[str]) -> "GridQueueStatesSet":
    def is_ok(self):

@dataclass(frozen=True)
class GridQueueInfo:
    @classmethod
    def from_xml(cls, xml_obj: ElementTree) -> "GridQueueInfo":
    def is_ok(self):

@dataclass(frozen=True)
class GridNodeInfo:
    @classmethod
    def from_xml(cls, xml_obj: ElementTree) -> "GridNodeInfo":
    def is_ok(self) -> bool:

class GridController:
    """Grid cluster controller class."""
    def __init__(self, remote: Remote, master_node_fqdn: str):
    def reconfigure(self, is_tools_project: bool) -> None:
    def add_node(self, host_fqdn: str, is_tools_project: bool, force: bool = False) -> None:
    def get_nodes_info(self) -> Dict[str, GridNodeInfo]:
    def get_node_info(self, host_fqdn: str) -> GridNodeInfo:
    def depool_node(self, host_fqdn: str) -> None:
    def pool_node(self, hostname: str) -> None:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Volans triaged this task as Medium priority.Jan 27 2022, 9:24 AM
Volans updated the task description. (Show Details)

@aborrero thanks for opening this task!

I had a chat with @jbond on what improvements we could make on our side to simply the integration of wmcs-specific code into Spicerack.
Among the various ideas, some things that might be relevant here:

  1. Sudo support. As part of the efforts for rootless cumin (T284302), we're investigating ways to add a better sudo support everywhere in Spicerack. This should simplify the integration of code that heavily depends on that.
  2. Investigate a nice way to have an extra set of modules that live in the same Spicerack repository and extend the Spicerack's accessors but that are packaged in a separate PyPi and Debian packages. This would allow to split the dependencies and simplify the deployment in the different realms.
  3. Add support to include multiple cookbook directories so that we could explore the possibility of having a separate repository for cookbooks that are WMCS-specific and can't be run from the production hosts. This should simplify both the local setup for WMCS SRE and also not clutter the production list of cookbooks with cookbooks that can't be run.

We should evaluate if it would be easier to wait for some of the above to be ready in Spicerack to simply the integration process.

Here some preliminary general comments for the proposed API definitions (in no particular order):

  • What would be the filename of the module? (grid, grid_engine, etc...)
  • GridError should inherit from SpicerackError
  • In the GridController class all methods refer to node while all parameters refer to host. For naming purposes I think that using fqdn instead of host_fqdn is cleaner and still totally unambiguous.
  • GridController.__init__ is requiring a Remote instance. In all the other modules we usually pass a RemoteHosts instance instead. I'm not saying we can't but it's an usual use case. I see that you do that because then in add_node you need to get a RemoteHosts instance of the new node. One option could be to:
    • require just a RemoteHost instance of the master host in the __init__ signature (that will be generated by the Spicerack accessor, see below)
    • modify add_node to require a RemoteHost instead of the FQDN
  • It is missing how the functionality will be exposed to the cookbooks via the Spicerack class with one or more accessors. I guess you might want a Spicerack.grid_engine_controller accessor or similar.
  • The code will also need to be test-covered.

P.S. I've edited the task description to add the dependency on defusedxml.

aborrero changed the task status from Open to Stalled.Feb 7 2022, 12:14 PM

@aborrero thanks for opening this task!

I had a chat with @jbond on what improvements we could make on our side to simply the integration of wmcs-specific code into Spicerack.
Among the various ideas, some things that might be relevant here:

  1. Sudo support. As part of the efforts for rootless cumin (T284302), we're investigating ways to add a better sudo support everywhere in Spicerack. This should simplify the integration of code that heavily depends on that.
  2. Investigate a nice way to have an extra set of modules that live in the same Spicerack repository and extend the Spicerack's accessors but that are packaged in a separate PyPi and Debian packages. This would allow to split the dependencies and simplify the deployment in the different realms.
  3. Add support to include multiple cookbook directories so that we could explore the possibility of having a separate repository for cookbooks that are WMCS-specific and can't be run from the production hosts. This should simplify both the local setup for WMCS SRE and also not clutter the production list of cookbooks with cookbooks that can't be run.

We should evaluate if it would be easier to wait for some of the above to be ready in Spicerack to simply the integration process.

All of the above sounds right to me. I'm fine waiting, so I'll mark this task as stalled.

Thanks!

taavi added a project: Toolforge.