Define one (or more) representative benchmark data that we use for evaluating the different models to generate simple summaries.
Relevant aspects to consider
- size: dataset should be large enough to have good statistics on metrics but also not be too large so that running time is reasonable (probably something between 100 and 10,000 articles)
- representativeness: should the dataset be a random subsample or contain known edge cases (e.g. in terms of readability)?
- languages: so far we only ran the model on articles from English Wikipedia. Ideally, we would like to put together the same articles from other languages for multilingual evaluation