Page MenuHomePhabricator

PoC - general model training support (Cloud GPU)
Closed, ResolvedPublic

Description

We would like to conduct a PoC on utilizing CloudGPU on model training.

Currently, the training of models that require GPU compute is nontrivial and with no standard pipeline. This creates difficulties for research scientists for model development work, as well as overhead for model integration, retraining, and maintenance (team agnostic).

This PoC is to test training workflow, there will be no infrastructure build from Research. This project will lay the groundwork for Research model training code base when our WMF infra is ready in the future.

To summarize, this PoC is to explore possibilities for GPU capacity with low cost commercial infra, and establish basic/standardized code base for future integration with ML platform (sandbox, trainWing, etc.) that Research can apply when the infrastructure is ready.

Proposed work flow:

  1. Dataset pipeline in WMF -> dataset public
  2. Provision CloudGPU (e.g.lambdalabs)
  3. Training code (pytorch) self-contained / re-usable
  4. Fine-tune model -> pull the model back into wmf infrastructure
  5. Batch evaluation in WMF
  6. Inference in LiftWing

Candidate models

  • text summarization
  • revert risk
  • reference model (dep. model readiness)

Event Timeline

Weekly updates

  • Sync with Martin: text simplification fine-tuning is a good candidate to scale existing work to a larger model
  • Sync with Aiko regarding improving support for existing GPU on hadoop cluster (T353814). The infra work has a lot of overlap, especially in regards to end-to-end workflows using airflow; we discussed dedicating a sprint to collaborate more closely.

Weekly updates

  • initial experiments with lambda labs, using text simplification as use case (T354653)
  • tested with A100 (40GB) and H100 (80GB) to validate approach and get an estimate of the cost for fine-tuning runs.
  • for a model size that can be trained on WMF infra (T5 large, 700M params), 1 epoch takes ~24h in WMF infra. On lambda labs 1 epoch costs ~6$ (i.e. time depends on hardware, ~4 h on A100, ~2h on a H100).
  • next up: use a model (3B param model) that can't currently be fine-tuned using WMF infra, but can be served using WMF infra.

Weekly updates

  • Trained the simplification model on a 3 billion parameter model (flan-t5-xl) on a single H100 (80GB). Results look promising.
  • Training for 2 epochs (~10h), running inference on test datasets (~6h), and downloading model weights: total cost ~50$
  • The fine-tuned model model weights are on stat1008. Validated that inference on the currently available GPU in the WMF infra works (it is slow)

Weekly updates

  • Interesting development with the ml team, there is a conversation with an European HPC infra provider about getting compute resources, and research projects are good candidates. Naturally this is relevant to this cloud GPU initiative, and research is very interested.

@fkaelin what's the latest (we are almost done with this as far as I understand), can we close it?

I am closing this as done - a summary:

  • Successful fine-tuning experiment for text-simplification using FlanT5-XL on a cloud gpu (from lambdalabs), trained model weights copied to wmf data infra.
  • Investigation for using the fine-tuned model (7b params) inside WMF infrastructure.
    • It is possible to serve using existing GPU models.
    • Promising results for doing CPU based inference on a quantized version of the model (benchmarks). We will pursue this work independent of the cloud gpu question.
  • Downsides
    • Unpractical for production workloads to fine-tune a model, jumping between infras is neither trivial nor future proof
    • Using cloud gpus is very convenient for iterative research / development. However the cost would quickly spiral out of control.
  • Recommendation
    • This experiment validated the value / need for more gpu compute for fine-tuning larger models, in particular for research/development.
    • If there are opportunities to apply for GPU compute credits from HPC providers, we should pursue them