= problem statement =
As of this writing, WMCS Toolforge is partially based on the GridEngine software. Our plan is to stop doing GridEngine and move everything to Kubernetes as soon as possible, but the reality is that we will need to support the grid for a while. Maintaining the grid is painful, not very well documented, and error prone. That's why we're working on automating the most relevant operations.
Worth noting that the code is already being used in the spicerack framework by means of the dedicated `wmcs` branch in the cookbook repo. So this task is mostly about relocating the code into spicerack.
== New data structures ==
We will define a few custom datatypes, exceptions and classes to abstract away grid state and be able to interact with it from spicerack/cookbooks.
See detailed list below.
=== Third party dependencies ===
Nothing special.
=== Additional configuration ===
Nothing special.
== possible future improvements ==
Most of Toolforge grid-related cookbooks share the same parser option `--grid-master-fqdn`. At some point we may want to create some abstraction to introduce this common argparse configuration for all related cookbooks. But this is out of scope for this initial iteration.
Also, mostly a cosmetic thing, most of the code does not need to handle full node FQDN, but short hostnames. The FQDN can be inferred from WMCS project name + deployment (i.e, `whatever-vm.<project>.<deployment>.wikimedia.cloud`). So this is something we can improve before the code introduction or shortly after. Ideally after, so we can figure out how to tackle this cosmetic problem on a global scale (same happens in our openstack-specific cookbooks).
This is not big deal anyway, and interface compatibility is just one `.split(".")[0]` away.
=== Projected definitions ===
This is a projection of what we would define as new interface for this module.
```lang=python
class GridError(Exception):
"""Base parent class for all grid related exceptions."""
class GridNodeNotFound(GridError):
class GridUnableToJoin(GridError):
class GridQueueType(Enum):
@dataclass(frozen=True)
class GridQueueTypesSet:
@classmethod
def from_types_string(cls, types_string: Optional[str]) -> "GridQueueTypesSet":
class GridQueueState(Enum):
@dataclass(frozen=True)
class GridQueueStatesSet:
@classmethod
def from_state_string(cls, state_string: Optional[str]) -> "GridQueueStatesSet":
def is_ok(self):
@dataclass(frozen=True)
class GridQueueInfo:
@classmethod
def from_xml(cls, xml_obj: ElementTree) -> "GridQueueInfo":
def is_ok(self):
@dataclass(frozen=True)
class GridNodeInfo:
@classmethod
def from_xml(cls, xml_obj: ElementTree) -> "GridNodeInfo":
def is_ok(self) -> bool:
class GridController:
"""Grid cluster controller class."""
def __init__(self, remote: Remote, master_node_fqdn: str):
def reconfigure(self, is_tools_project: bool) -> None:
def add_node(self, host_fqdn: str, is_tools_project: bool, force: bool = False) -> None:
def get_nodes_info(self) -> Dict[str, GridNodeInfo]:
def get_node_info(self, host_fqdn: str) -> GridNodeInfo:
def depool_node(self, host_fqdn: str) -> None:
def pool_node(self, hostname: str) -> None:
```