Page MenuHomePhabricator

Identify, select, and implement relevant quality metrics for evaluation
Closed, ResolvedPublic

Description

When evaluating the effect of, e.g., quanitzation of the simple summaries model, we also need to make sure that the quality of the model output is still reasonable. In previous work, we have identified 3 potential guardrail metrics for the simple summaries model (readability, fluency, meaning preservation).

In this task, we want to systematize and expand the set of metrics.
Relevant aspects to consider:

  • Should we consider additional aspects such as NPOV? Are there existing metrics to operationalize this aspect in a guardrail metric?
  • Can we implement the metrics for languages beyond English? For example, for readability we now consider the Flesch-Kincaid grade level which works well for English; however, we can use the multilingual readability model to score readability for other languages.

Details

Due Date
Mar 31 2025, 4:00 AM

Event Timeline

We initially considered 3 guardrail metrics from simplification research:

  • Simplicity
  • Fluency
  • Meaning preservation

I am now considering 2 additional guardrail metrics that are relevant to ensure quality of the model output.

Language confusion: This measures whether the output of the model is in the correct language.

Tone: This measures whether the simple summary is written in an encyclopedic tone.

  • Background: In the first round of experiments, we discovered that sometimes the tone of the generated simple summaries is not very encyclopedic. We would thus like to automatically detect such cases.
  • Idea: @diego developed a peacock detection model to detect policy violations supporting Edit Check (T368274). This is based on the peacock template indicating an article "contains wording that promotes the subject in a subjective manner without imparting real information". We can use the model to detect similar issues in the simple summaries generated by the model. Specifically, we use a multilingual version of the model all_bert-base-multilingual-cased_peacock_512/checkpoint-9662 (see notebook with example implementation).
  • Implementation: For each simple summary, we obtain a score between 0 (tone is good) to 1 (tone is not encyclopedic).
  • Example: From the first round of experiments, the highest-scoring simple summary (indicating that the tone is not encyclopedic) is (bold added by me):

São Paulo is Brazil's biggest city and an important place in the world for business, art, and fun. It's named after Paul the Apostle and has people from many countries living there. The city started with Jesuit priests in 1554 and grew strong during the coffee trade. Now, it's a huge economic center with lots of big companies and a cool cultural scene. São Paulo hosts big events like the World Cup and has awesome museums, parks, and festivals. It's also home to super tall buildings!

Great ideas both of them and I'm liking this general concept of using models that we develop to support editors to also help with evaluating proposed generative models! Tone can be at least a starting proxy for NPOV. "Meaning preservation" I think captures the No Original Research ideas -- essentially checking for hallucinations? And then I guess in this case of simple summaries, Verifiability is not something we're expecting the model to do (providing citations) so it's not worthwhile to have a metric aimed at that (example).

"Meaning preservation" I think captures the No Original Research ideas -- essentially checking for hallucinations?

Exactly. The idea is to check whether there is anything in the simple summary that is not supported by statements in the original article.

Implemented 5 quality metrics:

  • Simplicity: Change in readability score, i.e. the difference in the Flesch-Kincaid grade level between simple summary and original as FKGL_summary - FKGL_original. Thus, a negative score indicates that the simple summary is easier to read. In contrast, a neutral or positive score will indicate that the simple summary is not easier to read than the original.
  • Fluency: Number of grammatical errors in the simple summary. The fewer, the better.
  • Meaning conservation: Confidence score that content in the simple summary is supported by the text of the original. Low scores might indicate hallucinations.
  • Language confusion: Confidence score that the simple summary is written in the expected language using a language detection model. A low score indicates that simple summary is not in the correct language.
  • Tone: Confidence score that the tone is not encyclopedic using the peacock detection model. A high score indicates that the tone is not encyclopedic.

Documentation:

I consider those giving a good overview on the quality of different aspects of the model and capturing the main problems identified in the first round of experiments. Therefore, I will not be focusing on expanding the set of metrics. Instead, the main challenge is to expand/verify the metrics to languages beyond English:

Thus, my main priority next week will be to implement the multilingual readability model and identifying a multilingual alternative for meaning preservation.

I adapted the quality metrics to evaluate simple summaries in a multilingual setting (see utils_eval.py https://gitlab.wikimedia.org/repos/research/simple-summaries/-/blob/main/utils_eval.py).
The evaluation can now also be run on the multilingual benchmark (see simple-summary_experiment-02_benchmark_evaluate.ipynb https://gitlab.wikimedia.org/repos/research/simple-summaries/-/blob/main/simple-summary_experiment-02_benchmark_evaluate.ipynb)

Which languages are supported for the evaluation?

  • Fluency is implemented using LanguageTool, which supports 31 languages: ar ast be br ca crh da de el en eo es fa fr ga gl it ja km nl pl pt ro ru sk sl sv ta tl uk zh. If additional languages are required, one relatively straightforward option would be to use spellcheckers for that specific language using libraries such as pyenchant. Open spellcheckers can be found via, e.g., LibreOffice Language Support (bn, cs, etc.)
  • Language confusion is implemented using a Language identification model which supports 188 languages (that are explicitly matched with a Wikipedia project). So this should capture most relevant cases for now.
  • Simplicity, Meaning preservation, and Tone are implemented using different fine-tuned smaller multilingual language models. They do not have a well-defined language coverage. Their backbone models support ~100 languages. They have been explicitly validated for 10-20 languages in zero-shot settings with good results. Therefore, they likely also generalize well for other languages but it often depends on how well that language is captured by the model. A good compromise here is to work with the top-20 languages (ar,cs,de,en,es,fa,fr,he,id,it,ja,nl,no,ro,ru,pl,pt,tr,uk,zh)
    • Simplicity. This uses the readability model from Trokhymovich et al 2024. Its been explicitly validated to work in zero-shot scenario (i.e. without being fine-tuned on those languages explicitly) for: ca,de,el,es,eu,fr,hy,it,nl,oc,pt,ru,scn (it was trained on English).
    • Meaning preservation. This uses the summaC model from Laban et al. 2022. In the original paper it has been only validated on English data. A recent paper by Kang et al. 2024 evaluated summaC (among other methods) in a multilingual setting to detect hallucinations in text generation. They conclude that i) “[summaC] effectively detect sentence-level hallucinations in high-resource languages when compared to human evaluations” and ii) “[summaC] outperform supervised approaches at detecting hallucinations that can be verified or refuted by the reference text”. They mention “NLI metrics” but use summaC for the implementation (“we adopt the NLI-based zero-shot sentence-level SUMMAC (SummaCzs) scoring system (Laban et al., 2021) to evaluate hallucinations.”). The high-resource languages considered are: en, es, fr, id, vi, zh. In addition, I qualitatively checked the examples in German from multi-core-01 benchmark (generated with prompt_id=01) and found that high scores (>0.5) constituted simple summaries with preserved meaning, while low scores (<0.5) showed some form of hallucinations.
    • Tone. This uses the peacock detection model from @diego developed in T368274: Detecting Peacock behavior with LLMs. The multilingual version of the model has been validated for 10 languages: ar, de, en, es, fr , ja, nl, pt, ru, zh. It is suspected that the model works well for other languages as well (especially those among the top-20 or so language versions in Wikipedia) but evaluation is currently ongoing in T387925: Determine language support for Peacock Check (v1).
Isaac triaged this task as High priority.Mar 17 2025, 9:39 PM
Isaac set Due Date to Mar 31 2025, 4:00 AM.
Isaac moved this task from Backlog to In Progress on the Research board.

After implementing the multilingual quality model, I used them for 2 test-cases.

Test case 1: Prompt optimization
I tested the quality metrics for improving the prompt. My main goal was to improve upon the prompt id=04: while that one improved issues we observed with tone and language confusion, the simple summaries were not simpler to read in terms of their readability scores (Flesch-Kincaid grade level). Results are in this spreadsheet (internal only)

I tested 8 different prompts on one of the English benchmark datasets calculating all 5 quality metrics for evaluation. From this experiment, candidate prompt 05f seems to yield the best results.
In comparison to prompt (d 04 it creates simple summaries that are substantially easier to read: the FKGL score decreases by ~3.5 levels (in comparison to 0.35). At the same time, the scores for the tone (checking for peacock language) for id 05f (0.35) is almost at the same as for id 04 (0.31) but still substantially better than for the initial prompt id 01 (0.44) where we detected these issues. The other metrics are also similar (meaning preservation) or even better (fluency, language confusion). So the prompt 05f seems like a good compromise
between our first prompt (01) with good simplification and prompt (04) with good tone without giving up much on the other metrics as well.
Interestingly, looking through the results for the other prompts, it seemed that it was hard to improve, both, simplicity AND tone -- when one improved, the other would typically decrease. My rationale for the new prompt 05f was to keep the prompt more concise and provide explicit guidelines for each of the dimensions of our quality metrics we are using to evaluate the simple summaries.

Test case 2: Multilingual performance
I ran the evaluation framework on a set of 22 languages that are supported by the Aya-expanse model used for generating the simple summaries. This provides quantitative insight about how well the simple summaries model is performing for languages beyond English. The results will inform about choosing the most suitable languages for potential pilot experiments.
The results are in this spreadsheet (internal only)

With this the goal of this task is completed - we have a set of multilingual quality metrics for the simple summaries model. Thus, closing this task as completed for now.