problem statement
As of this writing, WMCS Toolforge is partially based on the GridEngine software. Our plan is to stop doing GridEngine and move everything to Kubernetes as soon as possible, but the reality is that we will need to support the grid for a while. Maintaining the grid is painful, not very well documented, and error prone. That's why we're working on automating the most relevant operations.
Worth noting that the code is already being used in the spicerack framework by means of the dedicated wmcs branch in the cookbook repo (link). So this task is mostly about relocating the code into spicerack.
New data structures
We will define a few custom datatypes, exceptions and classes to abstract away grid state and be able to interact with it from spicerack/cookbooks.
See detailed list below.
Third party dependencies
defusedxml: latest version on PyPi is 0.7.1, buster has 0.5.0, bullseye has 0.6.0.
Additional configuration
Nothing special.
possible future improvements
Most of Toolforge grid-related cookbooks share the same parser option --grid-master-fqdn. At some point we may want to create some abstraction to introduce this common argparse configuration for all related cookbooks. But this is out of scope for this initial iteration.
Also, mostly a cosmetic thing, most of the code does not need to handle full node FQDN, but short hostnames. The FQDN can be inferred from WMCS project name + deployment (i.e, whatever-vm.<project>.<deployment>.wikimedia.cloud). So this is something we can improve before the code introduction or shortly after. Ideally after, so we can figure out how to tackle this cosmetic problem on a global scale (same happens in our openstack-specific cookbooks).
This is not big deal anyway, and interface compatibility is just one .split(".")[0] away.
Projected definitions
This is a projection of what we would define as new interface for this module.
class GridError(Exception): """Base parent class for all grid related exceptions.""" class GridNodeNotFound(GridError): class GridUnableToJoin(GridError): class GridQueueType(Enum): @dataclass(frozen=True) class GridQueueTypesSet: @classmethod def from_types_string(cls, types_string: Optional[str]) -> "GridQueueTypesSet": class GridQueueState(Enum): @dataclass(frozen=True) class GridQueueStatesSet: @classmethod def from_state_string(cls, state_string: Optional[str]) -> "GridQueueStatesSet": def is_ok(self): @dataclass(frozen=True) class GridQueueInfo: @classmethod def from_xml(cls, xml_obj: ElementTree) -> "GridQueueInfo": def is_ok(self): @dataclass(frozen=True) class GridNodeInfo: @classmethod def from_xml(cls, xml_obj: ElementTree) -> "GridNodeInfo": def is_ok(self) -> bool: class GridController: """Grid cluster controller class.""" def __init__(self, remote: Remote, master_node_fqdn: str): def reconfigure(self, is_tools_project: bool) -> None: def add_node(self, host_fqdn: str, is_tools_project: bool, force: bool = False) -> None: def get_nodes_info(self) -> Dict[str, GridNodeInfo]: def get_node_info(self, host_fqdn: str) -> GridNodeInfo: def depool_node(self, host_fqdn: str) -> None: def pool_node(self, hostname: str) -> None: