Page MenuHomePhabricator

Article Summary Generation and Evaluation Pipeline using vLLM image
Closed, ResolvedPublic

Description

The Web team currently generates article summaries using huggingface transformers on ml-lab1001 (see runbook) or the Cohere API.

In T395019#10847999, the ML team tested a vllm backend, which offers faster processing compared to a huggingface backend.

In this ask, we'll create an end-to-end pipeline that uses this new vLLM image: docker-registry.wikimedia.org/amd-vllm085 that is currently accessible on ml-lab1002. The pipeline will handle:

  • fetching article data
  • generating summaries
  • performing quality evaluations
  • returning an output in the desired format

This work will build upon the existing simple-summaries project developed by the Research team.

Event Timeline

Building on top of the Research team's work that runs on ml-lab1001 and stat1008, I have worked on an initial pipeline for generating and evaluating article summaries using the vllm image on ml-lab1002. When you run this pipeline using the instructions in this README.md doc, it executes the following steps:

1.Input Processing: Parses the input JSON (list of articles with title and language).
2.Article Data Fetching: For each article, retrieves the lead section text using the Wikipedia API.
3.Summary Generation:

  • Initializes the vLLM engine with the aya-expanse-32b model.
  • Formats prompts based on the article text and language.
  • Generates a simple summary using the vLLM backend.

4.Evaluation: Calculates various quality metrics for the generated summary against the original text, including:

  • Simplicity (e.g., Flesch-Kincaid Grade Level)
  • Fluency (grammatical correctness via LanguageTool)
  • Meaning Preservation (semantic similarity via SummaCZS)
  • Language Preservation (correct language detection)
  • Tone Check (detection of peacock language)

5.Output Generation:

  • Combines the input data, fetched text, generated summary, and all evaluation scores.
  • Prints the results to the console in JSON format.
  • Saves the results to a specified JSON output file.

Test Run Example:

Below is a test run that generated a summary and evaluation scores for the Wikipedia article: Polissoir:

1$ python3 article_summary_pipeline.py --articles_input='[{"title": "Polissoir", "lang": "en"}]'
2INFO 05-27 05:34:50 [importing.py:53] Triton module has been replaced with a placeholder.
3INFO 05-27 05:34:51 [__init__.py:239] Automatically detected platform rocm.
4Created output directory: output
5Initializing vLLM with model: /srv/app/models/aya-expanse-32b...
6INFO 05-27 05:35:10 [config.py:716] This model supports multiple tasks: {'generate', 'embed', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
7INFO 05-27 05:35:19 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
8INFO 05-27 05:35:19 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
9INFO 05-27 05:35:19 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='/srv/app/models/aya-expanse-32b', speculative_config=None, tokenizer='/srv/app/models/aya-expanse-32b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=5296, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/srv/app/models/aya-expanse-32b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
10INFO 05-27 05:35:19 [rocm.py:186] None is not supported in AMD GPUs.
11INFO 05-27 05:35:19 [rocm.py:187] Using ROCmFlashAttention backend.
12[W527 05:35:20.120227347 ProcessGroupNCCL.cpp:1028] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
13[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
14[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
15[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
16[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
17INFO 05-27 05:35:20 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
18INFO 05-27 05:35:20 [model_runner.py:1120] Starting to load model /srv/app/models/aya-expanse-32b...
19Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
20Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:02<00:37, 2.86s/it]
21Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:06<00:36, 3.04s/it]
22Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:09<00:33, 3.09s/it]
23Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:12<00:31, 3.12s/it]
24Loading safetensors checkpoint shards: 36% Completed | 5/14 [00:15<00:27, 3.11s/it]
25Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:18<00:24, 3.09s/it]
26Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:19<00:16, 2.34s/it]
27Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:22<00:15, 2.52s/it]
28Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:25<00:13, 2.71s/it]
29Loading safetensors checkpoint shards: 71% Completed | 10/14 [00:28<00:11, 2.85s/it]
30Loading safetensors checkpoint shards: 79% Completed | 11/14 [00:31<00:08, 2.92s/it]
31Loading safetensors checkpoint shards: 86% Completed | 12/14 [00:34<00:05, 2.98s/it]
32Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:37<00:03, 3.03s/it]
33Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:40<00:00, 3.06s/it]
34Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:40<00:00, 2.92s/it]
35
36INFO 05-27 05:36:01 [loader.py:458] Loading weights took 41.25 seconds
37INFO 05-27 05:36:02 [model_runner.py:1152] Model loading took 60.3496 GiB and 41.728975 seconds
38INFO 05-27 05:36:07 [worker.py:287] Memory profiling takes 4.97 seconds
39INFO 05-27 05:36:07 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (1.00) = 63.98GiB
40INFO 05-27 05:36:07 [worker.py:287] model weights take 60.35GiB; non_torch_memory takes 0.29GiB; PyTorch activation peak memory takes 2.40GiB; the rest of the memory reserved for KV Cache is 0.95GiB.
41INFO 05-27 05:36:07 [executor_base.py:112] # rocm blocks: 388, # CPU blocks: 1638
42INFO 05-27 05:36:07 [executor_base.py:117] Maximum concurrency for 5296 tokens per request: 1.17x
43INFO 05-27 05:36:08 [model_runner.py:1462] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
44Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:25<00:00, 1.39it/s]
45INFO 05-27 05:36:33 [model_runner.py:1604] Graph capturing finished in 25 secs, took 0.28 GiB
46INFO 05-27 05:36:33 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 31.35 seconds
47vLLM initialized.
48Loaded articles from JSON string input.
49Fetching data for 1 articles...
50Processing article: Polissoir (en)
51Fetching 'Polissoir' from en.wikipedia.org...
52Generating 1 summaries with vLLM (prompt_id: 01)...
53Processed prompts: 100%|██████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.13s/it, est. speed input: 35.37 toks/s, output: 15.99 toks/s]
54INFO: Starting evaluations...
55INFO: Calculating simplicity scores...
56INFO: ReadabilityModel (for simplicity) will attempt to run on cuda:1.
57Some weights of XLMRobertaModel were not initialized from the model checkpoint at /srv/app/models/xlm-roberta-longformer-base-4096 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
58You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
59pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.12G/1.12G [00:03<00:00, 288MB/s]
60tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 670/670 [00:00<00:00, 5.67MB/s]
61sentencepiece.bpe.model: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 5.07M/5.07M [00:00<00:00, 113MB/s]
62tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.1M/17.1M [00:00<00:00, 371MB/s]
63special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 167/167 [00:00<00:00, 1.37MB/s]
64/srv/venv/lib/python3.11/site-packages/sklearn/base.py:380: InconsistentVersionWarning: Trying to unpickle estimator LinearRegression from version 1.3.2 when using version 1.6.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
65https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
66 warnings.warn(
67INFO: ReadabilityModel (for simplicity) will attempt to run on cuda:1.
68Some weights of XLMRobertaModel were not initialized from the model checkpoint at /srv/app/models/xlm-roberta-longformer-base-4096 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
69You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
70INFO: Simplicity scores calculated.
71INFO: Calculating fluency scores...
72INFO: Fluency scores calculated.
73INFO: Calculating meaning preservation scores...
74INFO: SummaCZS (for meaning preservation) is configured to run on cpu.
75tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 217/217 [00:00<00:00, 1.85MB/s]
76config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.09k/1.09k [00:00<00:00, 9.14MB/s]
77spiece.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 760k/760k [00:00<00:00, 149MB/s]
78special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [00:00<00:00, 1.23MB/s]
79model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 235M/235M [00:00<00:00, 393MB/s]
80INFO: Meaning preservation scores calculated.
81INFO: Calculating language preservation scores...
82INFO: Language preservation scores calculated.
83INFO: Calculating tone scores (peacock)...
84INFO: Peacock model (for tone evaluation) is configured to run on cpu.
85Device set to use cpu
86/srv/venv/lib/python3.11/site-packages/transformers/pipelines/text_classification.py:106: UserWarning: `return_all_scores` is now deprecated, if want a similar functionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`.
87 warnings.warn(
88INFO: Tone scores (peacock) calculated.
89INFO: All evaluations finished.
90
91--- Pipeline Results (JSON) ---
92[
93 {
94 "title": "Polissoir",
95 "text": "A polissoir (French for \"polisher\") or polishing stone is a Neolithic stone tool used for polishing and sharpening stone objects, particularly axes. Polissoirs contrast with grindstones, which are stones used to grind or sharpen ferrous objects. These artifacts, dating to approximately 5,000 years ago, provide insight into the technological advancements and craftsmanship of Neolithic societies.",
96 "language_name": "En",
97 "language_db": "en",
98 "generated_summary": "A polissoir, or polishing stone, is a tool from the Neolithic era, around 5,000 years ago. People used it to polish and sharpen stone axes. Unlike grindstones, which sharpen metal tools, polissoirs were for stone objects. They show how skilled and technologically advanced Neolithic people were.",
99 "evaluation": {
100 "simplicity_fkgl_model": 8.13,
101 "simplicity_fkgl_diff": -5.74,
102 "fluency_nerrors_lt": 0.0,
103 "meaning_preservation_summac": 0.96,
104 "language_preservation_langid": 1.0,
105 "tone_peacock": 0.35823455452919006
106 }
107 }
108]
109--- End of Pipeline Results ---
110
111Results saved to output/pipeline_results.json
112Article summary pipeline finished.

Currently the pipeline tightly couples model loading and inference in both summary generation and evaluation just like the Research team's prototype. Next step will be to decouple these functionalities as we do in the ML isvcs.

I have decoupled the monolithic pipeline we ran in T395246#10858281 into two KServe custom model-servers:

1. Summary Generation Server

A KServe custom model-server for generating simple article summaries using the vLLM backend. It loads the aya-expanse-32b model using vllm and handles input preprocessing, prompt formatting, and summary generation. Below are results of a test run:

1.1.Start this model-server to serve the aya-expanse-32b model with vllm:

1$ python3 summary_generation_server.py
2INFO 05-29 05:07:16 [importing.py:53] Triton module has been replaced with a placeholder.
3INFO 05-29 05:07:17 [__init__.py:239] Automatically detected platform rocm.
4Initializing vLLM engine with model: /srv/app/models/aya-expanse-32b...
5INFO 05-29 05:07:34 [config.py:716] This model supports multiple tasks: {'reward', 'generate', 'embed', 'classify', 'score'}. Defaulting to 'generate'.
6INFO 05-29 05:07:42 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
7INFO 05-29 05:07:42 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
8INFO 05-29 05:07:42 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='/srv/app/models/aya-expanse-32b', speculative_config=None, tokenizer='/srv/app/models/aya-expanse-32b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=5296, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/srv/app/models/aya-expanse-32b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
9INFO 05-29 05:07:42 [rocm.py:186] None is not supported in AMD GPUs.
10INFO 05-29 05:07:42 [rocm.py:187] Using ROCmFlashAttention backend.
11[W529 05:07:42.894186313 ProcessGroupNCCL.cpp:1028] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
12[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
13[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
14[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
15[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
16INFO 05-29 05:07:42 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
17INFO 05-29 05:07:42 [model_runner.py:1120] Starting to load model /srv/app/models/aya-expanse-32b...
18Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
19Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:02<00:36, 2.79s/it]
20Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:05<00:35, 2.97s/it]
21Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:08<00:33, 3.03s/it]
22Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:12<00:30, 3.06s/it]
23Loading safetensors checkpoint shards: 36% Completed | 5/14 [00:15<00:27, 3.05s/it]
24Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:18<00:24, 3.04s/it]
25Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:18<00:16, 2.30s/it]
26Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:21<00:14, 2.46s/it]
27Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:24<00:13, 2.66s/it]
28Loading safetensors checkpoint shards: 71% Completed | 10/14 [00:27<00:11, 2.79s/it]
29Loading safetensors checkpoint shards: 79% Completed | 11/14 [00:30<00:08, 2.87s/it]
30Loading safetensors checkpoint shards: 86% Completed | 12/14 [00:34<00:05, 2.93s/it]
31Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:37<00:02, 2.98s/it]
32Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:40<00:00, 3.00s/it]
33Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:40<00:00, 2.87s/it]
34
35INFO 05-29 05:08:24 [loader.py:458] Loading weights took 40.47 seconds
36INFO 05-29 05:08:24 [model_runner.py:1152] Model loading took 60.3496 GiB and 41.101542 seconds
37INFO 05-29 05:08:29 [worker.py:287] Memory profiling takes 5.09 seconds
38INFO 05-29 05:08:29 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (1.00) = 63.98GiB
39INFO 05-29 05:08:29 [worker.py:287] model weights take 60.35GiB; non_torch_memory takes 0.29GiB; PyTorch activation peak memory takes 2.40GiB; the rest of the memory reserved for KV Cache is 0.95GiB.
40INFO 05-29 05:08:29 [executor_base.py:112] # rocm blocks: 388, # CPU blocks: 1638
41INFO 05-29 05:08:29 [executor_base.py:117] Maximum concurrency for 5296 tokens per request: 1.17x
42INFO 05-29 05:08:30 [model_runner.py:1462] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
43Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:25<00:00, 1.38it/s]
44INFO 05-29 05:08:55 [model_runner.py:1604] Graph capturing finished in 25 secs, took 0.28 GiB
45INFO 05-29 05:08:55 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 31.66 seconds
46vLLM engine initialized.
47Loading prompt dictionary...
48Prompt dictionary loaded.
49Model summary-generation loaded and ready.
502025-05-29 05:08:55.901 584 kserve INFO [model_server.py:register_model():398] Registering model: summary-generation
512025-05-29 05:08:55.902 584 kserve INFO [model_server.py:setup_event_loop():278] Setting max asyncio worker threads as 32
522025-05-29 05:08:55.920 584 kserve INFO [server.py:_register_endpoints():108] OpenAI endpoints registered
532025-05-29 05:08:55.920 584 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers
542025-05-29 05:08:55.931 584 uvicorn.error INFO: Started server process [584]
552025-05-29 05:08:55.931 584 uvicorn.error INFO: Waiting for application startup.
562025-05-29 05:08:55.934 584 kserve INFO [server.py:start():70] Starting gRPC server with 4 workers
572025-05-29 05:08:55.934 584 kserve INFO [server.py:start():71] Starting gRPC server on [::]:8081
582025-05-29 05:08:55.934 584 uvicorn.error INFO: Application startup complete.
592025-05-29 05:08:55.934 584 uvicorn.error INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

1.2.Query the isvc with a test request:

1$ curl -s localhost:8080/v1/models/summary-generation:predict -X POST -d '{ "instances": [ { "title": "Polissoir", "lang": "en", "prompt_id": "01", "max_new_tokens": 256 } ] }' -i -H "Content-type: application/json"
2HTTP/1.1 200 OK
3date: Thu, 29 May 2025 05:09:47 GMT
4server: uvicorn
5content-length: 340
6content-type: application/json
7
8{
9 "predictions": [
10 {
11 "generated_summary": "A polissoir, or polishing stone, is a tool from the Neolithic period, around 5,000 years ago. People used it to polish and sharpen stone axes. Unlike grindstones, which sharpen metal tools, polissoirs were for stone objects. They show how skilled and technologically advanced Neolithic people were."
12 }
13 ]
14}
15

2. Summary Evaluation Server

A KServe custom model-server for evaluating simple article summaries using various metrics. It loads all necessary models (using huggingface transformers) and resources (Readability, SummaCZS for meaning preservation, Peacock for tone, LanguageTool for fluency, NLTK for text processing, and LiftWing API for language detection) and calculates a suite of quality metrics based on original text, generated summary, and language. Below are results of a test run:

2.1.Start this model-server to serve multiple models with huggingface transformers:

1$ python3 summary_evaluation_server.py
2SummaryEvaluationModel configured to use device: cpu
3Loading evaluation models...
4Loading ReadabilityModel (TRank) from trokhymovych/TRank_readability with base /srv/app/models/xlm-roberta-longformer-base-4096...
5pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.12G/1.12G [00:02<00:00, 483MB/s]
6tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 670/670 [00:00<00:00, 6.34MB/s]
7sentencepiece.bpe.model: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 5.07M/5.07M [00:00<00:00, 176MB/s]
8tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.1M/17.1M [00:00<00:00, 156MB/s]
9special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 167/167 [00:00<00:00, 1.34MB/s]
10/srv/venv/lib/python3.11/site-packages/sklearn/base.py:380: InconsistentVersionWarning: Trying to unpickle estimator LinearRegression from version 1.3.2 when using version 1.6.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
11https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
12 warnings.warn(
13Readability components loaded.
14Loading SummaCZS model (vitc) on cpu...
15SummaCZS model loaded.
16Loading Peacock model from /srv/app/models/edit-check/peacock/ on cpu...
17Device set to use cpu
18Peacock model pipeline loaded.
19Model summary-evaluation loaded and ready.
202025-05-29 05:01:27.223 36 kserve INFO [model_server.py:register_model():398] Registering model: summary-evaluation
212025-05-29 05:01:27.224 36 kserve INFO [model_server.py:setup_event_loop():278] Setting max asyncio worker threads as 32
22INFO 05-29 05:01:27 [importing.py:53] Triton module has been replaced with a placeholder.
23INFO 05-29 05:01:27 [__init__.py:239] Automatically detected platform rocm.
242025-05-29 05:01:28.742 36 kserve INFO [server.py:_register_endpoints():108] OpenAI endpoints registered
252025-05-29 05:01:28.742 36 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers
262025-05-29 05:01:28.758 36 uvicorn.error INFO: Started server process [36]
272025-05-29 05:01:28.758 36 uvicorn.error INFO: Waiting for application startup.
282025-05-29 05:01:28.764 36 kserve INFO [server.py:start():70] Starting gRPC server with 4 workers
292025-05-29 05:01:28.764 36 kserve INFO [server.py:start():71] Starting gRPC server on [::]:8081
302025-05-29 05:01:28.764 36 uvicorn.error INFO: Application startup complete.
312025-05-29 05:01:28.764 36 uvicorn.error INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

2.2.Query the isvc with a test request:

1$ curl -s localhost:8080/v1/models/summary-evaluation:predict -X POST -d '{ "instances": [ { "original_text": "A polissoir (French for polisher) or polishing stone is a Neolithic stone tool used for polishing and sharpening stone objects, particularly axes. Polissoirs contrast with grindstones, which are stones used to grind or sharpen ferrous objects. These artifacts, dating to approximately 5,000 years ago, provide insight into the technological advancements and craftsmanship of Neolithic societies.", "generated_summary": "A polissoir, or polishing stone, is a tool from the Neolithic era, around 5,000 years ago. People used it to polish and sharpen stone axes. Unlike grindstones, which sharpen metal tools, polissoirs were for stone objects. They show how skilled and technologically advanced Neolithic people were.", "language_db": "en" } ] }' -i -H "Content-type: application/json"
2HTTP/1.1 200 OK
3date: Thu, 29 May 2025 05:05:45 GMT
4server: uvicorn
5content-length: 335
6content-type: application/json
7
8{
9 "predictions": [
10 {
11 "simplicity_fkgl_model": 8.13,
12 "simplicity_fkgl_original": 13.87,
13 "simplicity_fkgl_diff": -5.74,
14 "fluency_nerrors_lt": 0.0,
15 "meaning_preservation_summac": 0.99,
16 "language_preservation_detected_code": "en",
17 "language_preservation_detection_score": 0.9853431582450867,
18 "language_preservation_correct_lang": 1.0,
19 "tone_peacock": 0.3582
20 }
21 ]
22}

Next step is to work on a KServe custom transformer that will orchestrate the two model-servers. We usually connect one tranformer to one predictor as shown in the outlink_topic_model model-server (OutlinkTransformer, OutlinksTopicModel).

In this case, we want to connect one tranformer to two predictors ( SummaryGenerationModel and SummaryEvaluationModel), I am looking at the KServe multi-model tranformer docs to understand how best to implement this orchestration.

First of all, the above is great work, Kevin! However, since this is not yet a production-level service (nor has it been requested as such), I’d suggest we keep things simpler for now. That would also help limit the amount of work required for this task.
The initial request can just be tackled within a notebook but we want to use the vllm image so that we can a) generate the summaries faster , b) could easily move this to production in the future.
So I think we just need a notebook that given a list of article titles :

  1. it would generate the summaries and store it in json files,
  2. enrich this json with the aforementioned evaluation metrics.

The 1st is required and 2nd is nice to have for additional evaluation.

If we still want to go down the route of having this as a kserve model server which would make it really easy to move to production I would suggest we use a single predictor with a pre and post process function.
In both cases (notebook or model server) the deliverable for this would be the json files with 1 and 2.

The initial request can just be tackled within a notebook but we want to use the vllm image so that we can a) generate the summaries faster , b) could easily move this to production in the future.
So I think we just need a notebook that given a list of article titles :

  1. it would generate the summaries and store it in json files,
  2. enrich this json with the aforementioned evaluation metrics.

The 1st is required and 2nd is nice to have for additional evaluation.

Thank you for providing clarification on the current requirements for this project. Following the meeting we had, I created a Jupyter notebook that reads files from the ArticleSummaries/resources/summaries/enwiki/ directory, sends requests to the SummaryGenerationModel service that is currently running on ml-lab1002, and saves the generated summaries back to .json files in a new output directory (e.g., enwiki-20250529).

The notebook has been added to our gitlab repository: https://gitlab.wikimedia.org/repos/machine-learning/exploratory-notebook/-/tree/main/simple-summaries
and instructions on how to run it in the README.md file: https://gitlab.wikimedia.org/repos/machine-learning/exploratory-notebook/-/blob/main/simple-summaries/README.md

Using prompt ID: 04 as you suggested, I generated summaries for the 5341 articles in the enwiki directory. The results are available at: /home/kevinbazira/simple-summaries/summaries/enwiki-20250529 on ml-lab1002.

Next step will be to add a summary evaluation step to the notebook, which will rely on the SummaryEvaluationModel service on ml-lab1002.

Awesome work Kevin!
@kevinbazira Could you rerun these next week using also prompt 05? Looking at T389845 I noticed that it is prompt number 05 that matches the description in that task.
I apologize for the extra work created here. I should have cross checked before you started generating them.

Could you rerun these next week using also prompt 05? Looking at T389845 I noticed that it is prompt number 05 that matches the description in that task.

Sure sure, no problem! Using prompt ID: 05, I generated summaries for the 5341 articles in the enwiki directory.
The results are available at: /home/kevinbazira/exploratory-notebook/simple-summaries/summaries/enwiki-20250601 on ml-lab1002.

Thank you! I see some of the filenames are enclosed in either single or double quotes, while the original ones are not. Can you make sure the filenames match the original ones for future generations of the same summaries?
How many hours does it take approximately to generate the ~5000 summaries?

I think one is missing from the directory. If I'm not mistaken it is ...And Justice for All (album).json

I see some of the filenames are enclosed in either single or double quotes, while the original ones are not. Can you make sure the filenames match the original ones for future generations of the same summaries?

The notebook used to generate summaries does not modify the filenames. These changed in an extra step where ungenerated summaries were because the titles provided don't exist on Wikipedia (see details P76726#308568). The filenames in the output directory now match the original ones, but titles have been updated to those that exist on Wikipedia.

How many hours does it take approximately to generate the ~5000 summaries?

At the moment, it takes about 8hrs to generate summaries for ~5000 articles using the SummaryGenerationModel service on ml-lab1002. This performance could be improved (especially if we are to add it to prod). Key optimizations could include adding async calls, batch inference, etc. Following your previous suggestion (T395246#10866609) about prioritizing a simpler implementation for now, these optimizations have not yet been integrated.

I have added an optional summary evaluation step to the notebook, which uses the SummaryEvaluationModel service running on ml-lab1002. This step collects the original text and generated summaries during the summary generation process, then batches and sends them to the evaluation service. The evaluation results are saved to a .json file for further analysis (e.g summaries/enwiki-20250529/evaluation_results.json).

I've also added documentation to make the notebook easier to use for both summary generation and evaluation tasks.

Thank you for generating the summaries! Can we us this pipeline/notebook to do the following?
First step would be to read the titles and generate the summaries and save to a directory one json for each article:

{
  "title": "1",
  "summary": "The number 1 is the first and smallest positive integer, serving as the foundation for counting and measurement. It signifies the leading or top element in a group and has various uses across fields like science and sports. In mathematics, 1 is the multiplicative identity, leaving any number unchanged when multiplied. It is not considered a prime number. Digitally, 1 represents the \"on\" state in binary code. Philosophically, 1 often symbolizes the ultimate reality or source of existence. Historically, its representation evolved from ancient symbols to the modern Arabic numeral."
}

During the second step (evaluation) we produce again one enriched json for each file and save it to a different directory

{
  "title": "1",
  "summary": "The number 1 is the first and smallest positive integer, serving as the foundation for counting and measurement. It signifies the leading or top element in a group and has various uses across fields like science and sports. In mathematics, 1 is the multiplicative identity, leaving any number unchanged when multiplied. It is not considered a prime number. Digitally, 1 represents the \"on\" state in binary code. Philosophically, 1 often symbolizes the ultimate reality or source of existence. Historically, its representation evolved from ancient symbols to the modern Arabic numeral.",
  "simplicity_fkgl_model": 10.56,
  "simplicity_fkgl_original": 8.38,
  "simplicity_fkgl_diff": 2.18,
  "fluency_nerrors_lt": 0.0,
  "meaning_preservation_summac": 0.05,
  "language_preservation_detected_code": "en",
  "language_preservation_detection_score": 0.9638887643814087,
  "language_preservation_correct_lang": 1.0,
  "tone_peacock": 0.6206
}

The deliverable for this work is the generated summaries and the evaluation metrics (two different directories). I am going to share the extracted summaries with the Web team and will share the evaluation metrics also when we have them

Looking at the summaries available in /home/kevinbazira/simple-summaries/summaries/enwiki-2025060 I've noticed the following inconsistencies with the original filenames

  • Missing summary: I think this summary is still missing.

I think one is missing from the directory. If I'm not mistaken it is ...And Justice for All (album).json

  • I have found the following differences in filenames. There are more which follow the same patterns (underscore instead of whitespace, /_ instead of whitespace etc). The article titles in the json seem to be ok.
    • "Hello,_World!"_program.json instead of Hello, World! program.json
    • Furiosa/_A_Mad_Max_Saga.json instead of Furiosa A Mad Max Saga.json
    • Demon_Slayer/_Kimetsu_no_Yaiba.json instead of Demon Slayer Kimetsu no Yaiba.json
    • El_Camino/_A_Breaking_Bad_Movie.json instead of El Camino A Breaking Bad Movie.json
    • Hanseatic_League.json instead of Hanseatic League.json
    • Horizon_Zero_Dawn.json instead of Horizon Zero Dawn.json
    • Jude_the_Apostle.json instead of Jude the Apostle.json
    • ...

Can we generate the summary for the missing article?
Regarding the filename differences I'd suggest to not do something for the time being and wait and see if the Web team has any issue with this.

Looking at the summaries available in /home/kevinbazira/simple-summaries/summaries/enwiki-2025060 I've noticed the following inconsistencies with the original filenames

  • Missing summary: I think this summary is still missing.

I think one is missing from the directory. If I'm not mistaken it is ...And Justice for All (album).json

  • I have found the following differences in filenames. There are more which follow the same patterns (underscore instead of whitespace, /_ instead of whitespace etc). The article titles in the json seem to be ok.
    • "Hello,_World!"_program.json instead of Hello, World! program.json
    • Furiosa/_A_Mad_Max_Saga.json instead of Furiosa A Mad Max Saga.json
    • Demon_Slayer/_Kimetsu_no_Yaiba.json instead of Demon Slayer Kimetsu no Yaiba.json
    • El_Camino/_A_Breaking_Bad_Movie.json instead of El Camino A Breaking Bad Movie.json
    • Hanseatic_League.json instead of Hanseatic League.json
    • Horizon_Zero_Dawn.json instead of Horizon Zero Dawn.json
    • Jude_the_Apostle.json instead of Jude the Apostle.json
    • ...

Can we generate the summary for the missing article?
Regarding the filename differences I'd suggest to not do something for the time being and wait and see if the Web team has any issue with this.

@isarantopoulos you are referencing an old directory, as mentioned in T395246#10875933 and on IRC earlier today, the current data is in: /home/kevinbazira/exploratory-notebook/simple-summaries/summaries/enwiki-20250601

Ack! sorry for the confusion. Also there is no missing summary, as it was listed as a hidden file :) We can totally disregard my previous comment then!
Thank you Kevin!

Question about the SummaryEvaluationModel service:

2. Summary Evaluation Server

A KServe custom model-server for evaluating simple article summaries using various metrics. It loads all necessary models (using huggingface transformers) and resources (Readability, SummaCZS for meaning preservation, Peacock for tone, LanguageTool for fluency, NLTK for text processing, and LiftWing API for language detection) and calculates a suite of quality metrics based on original text, generated summary, and language.

I think this is not using the Research team's multilingual readability model, which is on LiftWing (more info). Maybe I am misunderstanding? Did the team decide that the multilingual model wasn't suitable for this task? I do notice that the readability score object that the multilingual model emits does provide fewer quality metrics than the evaluation in the notebook does. Or is there maybe some other reason for choosing a different approach?

I just started digging into understanding this project today (because of a conversation thread on Bluesky) so I'm likely missing something. :-)

mfossati subscribed.

Thank you ML folks for your work!
We're decommissioning the ArticleSummaries MediaWiki extension in T411558: ArticleSummaries: Decommission the extension (code changes) and T411560: ArticleSummaries: Decommission the extension (documentation updates), so I'm closing this task. Feel free to re-open if you plan to work again on it.

isarantopoulos changed the task status from Invalid to Resolved.Mon, Apr 20, 5:33 AM