This task captures work in Q1 on hypothesis WE.3.1.3:
If we develop models for remixing content such as a content simplification or summarization that can be hosted and served via our infrastructure (e.g. LiftWing), we will establish the technical direction for work focused on increasing reader retention through new content discovery features.
We have previously done some exploratory experiments on simplification which helped us understand how we could use existing data to train and evaluate such models. Over the past year, the space around LLMs has been evolving very fast with many new available models. At the same time, our internal infrastructure is improving with new GPUs likely being available in the short term. These developments allow us to improve upon our exploratory analysis.
- Get a better understanding of different requirements
- Infrastructure: What models can we train and serve in our infrastructure (LiftWing). This includes, among others, model size or time it takes to run inference for individual requests
- Language: how many and which languages do we want to aim to support? Many models only support English; though some models are multilingual.
- Quality: what metrics should we use to decide whether the model works well?
- Context: what is the intended use of the simplification? Simplify individual sentences or paragraphs or full article?
- Review latest available models that are, in principle, compatible with these requirements.
- (stretch) Implement at least one of the models in our internal infrastructure.
Documentation on meta: https://meta.wikimedia.org/wiki/Research:Develop_a_model_for_text_simplification_to_improve_readability_of_Wikipedia_articles/FY24-25_WE.3.1.3_content_simplification