When evaluating the effect of, e.g., quanitzation of the simple summaries model, we also need to make sure that the quality of the model output is still reasonable. In previous work, we have identified 3 potential guardrail metrics for the simple summaries model (readability, fluency, meaning preservation).
In this task, we want to systematize and expand the set of metrics.
Relevant aspects to consider:
- Should we consider additional aspects such as NPOV? Are there existing metrics to operationalize this aspect in a guardrail metric?
- Can we implement the metrics for languages beyond English? For example, for readability we now consider the Flesch-Kincaid grade level which works well for English; however, we can use the multilingual readability model to score readability for other languages.