Page MenuHomePhabricator

Q3 2024 Goal: A plan for a training infrastructure
Open, Needs TriagePublic

Description

One part: Test GPU on Hadoop

Event Timeline

calbon renamed this task from Goal: A document describing a plan for a training infrastructure to Goal: A plan for a training infrastructure.Dec 20 2023, 3:18 PM
calbon renamed this task from Goal: A plan for a training infrastructure to Goal: A plan for a training infrastructure .Dec 20 2023, 3:21 PM
calbon updated the task description. (Show Details)

Regarding testing the GPU on Hadoop, I spoke with @fkaelin yesterday. He suggested a potentially suitable project - an end-to-end airflow pipeline. This would include a spark task to create the training dataset, a GPU task to train the model, and a spark task to batch evaluate.

For the training part, it should work with a Hadoop GPU, and with minor code changes, it could also work with a cloud GPU, which the research team plans to test. In the future, we could apply this end-to-end airflow pipeline with a GPU on ml-train for training a LLM. Therefore, it should be worth experimenting with.

The revertrisk-multilingual model could be a good first target for this pipeline. The model is trained using the same GPU on statbox, and an airflow dag has been created for generating the training datasets by Muniza.

  • Training servers ordered.
  • GCP credits likely.

Aiko to work on spike about GPU on Hadoop workflow and end to end airflow pipelne (data prep pipeline, training pipeline, model evaluation).

calbon renamed this task from Goal: A plan for a training infrastructure to Q3 2024 Goal: A plan for a training infrastructure .Apr 16 2024, 2:51 PM