We would like to conduct a PoC on utilizing CloudGPU on model training.
Currently, the training of models that require GPU compute is nontrivial and with no standard pipeline. This creates difficulties for research scientists for model development work, as well as overhead for model integration, retraining, and maintenance (team agnostic).
This PoC is to test training workflow, there will be no infrastructure build from Research. This project will lay the groundwork for Research model training code base when our WMF infra is ready in the future.
To summarize, this PoC is to explore possibilities for GPU capacity with low cost commercial infra, and establish basic/standardized code base for future integration with ML platform (sandbox, trainWing, etc.) that Research can apply when the infrastructure is ready.
Proposed work flow:
- Dataset pipeline in WMF -> dataset public
- Provision CloudGPU (e.g.lambdalabs)
- Training code (pytorch) self-contained / re-usable
- Fine-tune model -> pull the model back into wmf infrastructure
- Batch evaluation in WMF
- Inference in LiftWing
Candidate models
- text summarization
- revert risk
- reference model (dep. model readiness)