Page MenuHomePhabricator

Set up benchmark dataset for evaluation
Closed, ResolvedPublic

Description

Define one (or more) representative benchmark data that we use for evaluating the different models to generate simple summaries.

Relevant aspects to consider

  • size: dataset should be large enough to have good statistics on metrics but also not be too large so that running time is reasonable (probably something between 100 and 10,000 articles)
  • representativeness: should the dataset be a random subsample or contain known edge cases (e.g. in terms of readability)?
  • languages: so far we only ran the model on articles from English Wikipedia. Ideally, we would like to put together the same articles from other languages for multilingual evaluation

Event Timeline

These criteria make a lot of sense to me. Something to consider around representativeness: for the article-country model (T369120#10106230), I put together a few different datasets. In that case, it was a small random dataset to get truly representative numbers, a dataset of outliers, and then a dataset for English Wikipedia specifically because I knew this would feed into Content Translation where English is a very common source language. Sounds like you're proposing something similar here where the aim is to have a dataset to give a sense of overall performance, a dataset to help identify potential issues, and a dataset to better capture how end-users might experience the outputs. For this particular case, I wonder if the second part (edge cases) might benefit from having things like a sample of protected pages (assuming that they're higher sensitivity)? Might be other ways to target it including manually as I think you're suggesting.

First iteration of benchmark datasets

  • Starting point is the filtered set of 8,148 articles from the first experiment. The reason is that this constitutes a sample of relevant articles which are considered, in principle, suitable for simple summaries (for example, this excludes Biographies of Living Persons).
  • Each benchmark dataset contains 100 articles. This is considered sufficient to obtain accurate statistics about average performance while limiting the time it takes to run the model on all articles. If needed, this can be adapted at a later point.
  • The following benchmarks are considered relevant:
  1. English random: random sample of articles from English Wikipedia
  2. English edge cases: sample of edge cases of articles from English Wikipedia. Edge-cases are identified is the worst-scoring articles in each of the guardrail metrics from the first round of experiments. Specifically, if there are 5 guardrail metrics, we choose N/5 worst-scoring articles from each of the metrics (see T386448)
  3. Multilingual core: random sample of articles from Wikipedia language versions that are explicitly supported by the Aya model (23 languages). Specifically, we select the same set of articles as “English random” but for each article we select one of the language versions randomly in which it is available.
  4. Multilingual extended: random sample of articles from Wikipedia language versions that are not explicitly supported by the Aya model. Specifically, we select the same set of articles as “English random” but for each article we select one of the language versions randomly in which it is available.

Generally this makes sense to me. Another thought that just occurred to me though -- T383090 documents some ideas around taking a step towards figuring out how much a model's knowledge cut-off matters for benchmarking. We haven't prioritized it but it might be nice for this experiment to include a set of newer articles as a step in that direction. I don't think we know the exact cut-off for the Aya models but choosing a date soon after they were released should be a reasonable choice and still hopefully allow for the 100-article threshold. I wouldn't put in the effort to really robustly try to match that dataset with the other ones you're considering (so it won't really be a strong apple-to-apples comparison) but even just a random subset of new articles would be a start. You should be able to use mediawiki-history to get this list of newer page IDs without too much trouble (event_entity = "page" AND event_type = "create"). I'll let you decide whether to include this new pages sample in this initial round or leave to a follow-up but I think would be good to include at some point.

T383090 documents some ideas around taking a step towards figuring out how much a model's knowledge cut-off matters for benchmarking.

Good idea. I will include this as an additional benchmark dataset.

Generated 5 benchmark datasets (each consisting of 100 articles, lead text only):

  • en-random: 100 randomly selected from first round of experiment in English Wikipedia
  • en-edgecases: 100 with lowest scores from first round of experiment in English Wikipedia (with 5 metrics, choose 20 lowest-scoring cases from each dimension)
  • en-cutoff: 100 random articles from English Wikipedia that were created after the model was released
  • multi-core: random sample of articles from Wikipedia language versions that are explicitly supported by the Aya model (23 languages). Specifically, we select the same set of articles as “en-random” but for each article we select one of the language versions randomly in which it is available.
  • multi-ext: random sample of articles from Wikipedia language versions that are not explicitly supported by the Aya model (23 languages). Specifically, we select the same set of articles as “en-random” but for each article we select one of the language versions randomly in which it is available

Documentation:

As a test, I ran the simple summaries model on the benchmarks using different prompts. Results: https://docs.google.com/spreadsheets/d/1USF7IQhpi7Z8AcSp_44LUv7iN-vFnWyJidGXYp--jgA/edit?gid=0#gid=0
This allows to systematically investigate trade-offs between using different prompts. For example, some prompts lead to improvements in tone (more encyclopedic) but decrease the level of simplification.

Task is completed. If benchmark datasets need to be revised, this will be captured in follow-up tasks.