User Story
As a Wikimedia machine learning engineer or researcher, I want to be able to develop or train a machine learning model on the Data Science and Engineering Kubernetes Cluster using Kubeflow, so that I can rapidly iterate in model development.
Acceptance Criteria
- The engineer or researcher should be able to develop or train a machine learning model on the Data Science and Engineering Kubernetes Cluster using Kubeflow and Ceph.
- The engineer or researcher should be able to iterate rapidly in model development.
- The Kubeflow should provide the necessary tools and functionalities to make model development easy and efficient.
Outstanding Questions:
- Which components of Kubeflow are required vs. optional for us? e.g.
- Jupyter Notebooks
- Pipelines
- Distributed training (Pytorch, tf-jobs)
- Central UI Dashboard
- Can we run Jupyter Notebook that uses data from HDFS?