Summary
pytorch-rocm on ml-lab1002.eqiad.wmnet is outdated. Update it to a current ROCm-compatible release so MI210 inference experiments have access to recent model support and bug fixes. We should upgrade the environment which lives under /srv/pytorch-rocm/. In the process of doing so we can explore the option of installing also vllm with one of the pre-built wheels that are mentioned in documentation https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#create-a-new-python-environment. From a quick pass in these docs both gfx90a (MI210) and gfx942 (MI300X) should be supported.
Acceptance criteria
- pytorch-rocm on ml-lab1002 is on a new release and runs a smoke-test inference on the MI210 GPUs.
- Install path and upgrade steps are added to https://wikitech.wikimedia.org/wiki/Machine_Learning/ML-Lab.