The Web team currently generates article summaries using huggingface transformers on ml-lab1001 (see runbook) or the Cohere API.
In T395019#10847999, the ML team tested a vllm backend, which offers faster processing compared to a huggingface backend.
In this ask, we'll create an end-to-end pipeline that uses this new vLLM image: docker-registry.wikimedia.org/amd-vllm085 that is currently accessible on ml-lab1002. The pipeline will handle:
- fetching article data
- generating summaries
- performing quality evaluations
- returning an output in the desired format
This work will build upon the existing simple-summaries project developed by the Research team.