PoC - general model training support (Cloud GPU)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	XiaoXiao-WMF
	Jan 19 2024, 4:40 PM

Description

We would like to conduct a PoC on utilizing CloudGPU on model training.

Currently, the training of models that require GPU compute is nontrivial and with no standard pipeline. This creates difficulties for research scientists for model development work, as well as overhead for model integration, retraining, and maintenance (team agnostic).

This PoC is to test training workflow, there will be no infrastructure build from Research. This project will lay the groundwork for Research model training code base when our WMF infra is ready in the future.

To summarize, this PoC is to explore possibilities for GPU capacity with low cost commercial infra, and establish basic/standardized code base for future integration with ML platform (sandbox, trainWing, etc.) that Research can apply when the infrastructure is ready.

Proposed work flow:

Dataset pipeline in WMF -> dataset public
Provision CloudGPU (e.g.lambdalabs)
Training code (pytorch) self-contained / re-usable
Fine-tune model -> pull the model back into wmf infrastructure
Batch evaluation in WMF
Inference in LiftWing

Candidate models

text summarization
revert risk
reference model (dep. model readiness)

Related Objects

Mentioned Here: T5: Get scap logs into logstash
T354653: Work on model optimization and scaling
T353814: Q3 2024 Goal: A plan for a training infrastructure

Event Timeline

XiaoXiao-WMF created this task.Jan 19 2024, 4:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 19 2024, 4:40 PM

Weekly updates

Sync with Martin: text simplification fine-tuning is a good candidate to scale existing work to a larger model
Sync with Aiko regarding improving support for existing GPU on hadoop cluster (T353814). The infra work has a lot of overlap, especially in regards to end-to-end workflows using airflow; we discussed dedicating a sprint to collaborate more closely.

Weekly updates

initial experiments with lambda labs, using text simplification as use case (T354653)
tested with A100 (40GB) and H100 (80GB) to validate approach and get an estimate of the cost for fine-tuning runs.
for a model size that can be trained on WMF infra (T5 large, 700M params), 1 epoch takes ~24h in WMF infra. On lambda labs 1 epoch costs ~6$ (i.e. time depends on hardware, ~4 h on A100, ~2h on a H100).
next up: use a model (3B param model) that can't currently be fine-tuned using WMF infra, but can be served using WMF infra.

Weekly updates

Trained the simplification model on a 3 billion parameter model (flan-t5-xl) on a single H100 (80GB). Results look promising.
Training for 2 epochs (~10h), running inference on test datasets (~6h), and downloading model weights: total cost ~50$
The fine-tuned model model weights are on stat1008. Validated that inference on the currently available GPU in the WMF infra works (it is slow)

dr0ptp4kt subscribed.Feb 21 2024, 9:34 PM

Weekly updates

Interesting development with the ml team, there is a conversation with an European HPC infra provider about getting compute resources, and research projects are good candidates. Naturally this is relevant to this cloud GPU initiative, and research is very interested.

@fkaelin what's the latest (we are almost done with this as far as I understand), can we close it?

XiaoXiao-WMF moved this task from FY2023-24-Research-January-March to FY2023-24-Research-April-June on the Research board.Apr 15 2024, 11:13 PM

XiaoXiao-WMF edited projects, added Research (FY2023-24-Research-April-June); removed Research (FY2023-24-Research-January-March).

I am closing this as done - a summary:

Successful fine-tuning experiment for text-simplification using FlanT5-XL on a cloud gpu (from lambdalabs), trained model weights copied to wmf data infra.
Investigation for using the fine-tuned model (7b params) inside WMF infrastructure.
- It is possible to serve using existing GPU models.
- Promising results for doing CPU based inference on a quantized version of the model (benchmarks). We will pursue this work independent of the cloud gpu question.
Downsides
- Unpractical for production workloads to fine-tune a model, jumping between infras is neither trivial nor future proof
- Using cloud gpus is very convenient for iterative research / development. However the cost would quickly spiral out of control.
Recommendation
- This experiment validated the value / need for more gpu compute for fine-tuning larger models, in particular for research/development.
- If there are opportunities to apply for GPU compute credits from HPC providers, we should pursue them

PoC - general model training support (Cloud GPU)Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

PoC - general model training support (Cloud GPU)
Closed, ResolvedPublic
Actions