In this epic we should work on GPUs in the context of the ML infrastructure.
These are the main subjects to work on:
- Update AMD drivers to the latest release, and verify with Research that they work as expected.
- Import the AMD k8s device plugin in our repos/configs (will need a follow up with ServiceOps for the security bits).
- Decide what GPU should be deployed on Lift Wing, and order some of them.