Page MenuHomePhabricator

[WE.3.1.3] Building a model for content simplification (Q2)
Closed, ResolvedPublic

Description

This task captures work in Q1 on hypothesis WE.3.1.3:

If we develop models for remixing content such as a content simplification or summarization that can be hosted and served via our infrastructure (e.g. LiftWing), we will establish the technical direction for work focused on increasing reader retention through new content discovery features.

We have previously done some exploratory experiments on simplification which helped us understand how we could use existing data to train and evaluate such models. Over the past year, the space around LLMs has been evolving very fast with many new available models. At the same time, our internal infrastructure is improving with new GPUs likely being available in the short term. These developments allow us to improve upon our exploratory analysis.

  • Get a better understanding of different requirements
  • Infrastructure: What models can we train and serve in our infrastructure (LiftWing). This includes, among others, model size or time it takes to run inference for individual requests
  • Language: how many and which languages do we want to aim to support? Many models only support English; though some models are multilingual.
  • Quality: what metrics should we use to decide whether the model works well?
  • Context: what is the intended use of the simplification? Simplify individual sentences or paragraphs or full article?
  • Review latest available models that are, in principle, compatible with these requirements.
  • (stretch) Implement at least one of the models in our internal infrastructure.

Documentation on meta: https://meta.wikimedia.org/wiki/Research:Develop_a_model_for_text_simplification_to_improve_readability_of_Wikipedia_articles/FY24-25_WE.3.1.3_content_simplification

Event Timeline

MGerlach renamed this task from [WE.3.1.3] Building a model for content simplification to [WE.3.1.3] Building a model for content simplification (Q1).Jul 4 2024, 1:48 PM

Weekly update:

  • Systematizing open questions for successful model development
    • Infrastructure requirements: size, performance (inference time)
    • Product requirements: Languages, Quality, Use-case
    • Model requirements: Openness (can we host), Effectiveness (does the model have a chance to perform well), training (supervised, in-context learning, zero-shot as is), evaluation
  • Gathering input from product team (Web) on intended use to tailor model specifications (languages, input/output format, quality boundaries). These are ongoing discussions but should become more clear in the next week or so
  • Learning about updates in ML infrastructure which expands the potential set of candidate models. The ML team announced that we will have new servers will GPUs available for model training and hosting. This will allow us to use larger (and potentially better) models. Similarly, I am following the test-deployment of the Gemma2-model on LiftWing T369055. If successful, this could constitute a promising candidate model for the simplification task (or models with similar size/architecture).
  • Reviewing scientific literature to compile list of candidate models for the task. I have identified more than 10 recent papers (2023/2024) on using LLMs for text simplification approach. This is very insightful to understand which models are most promising for the specific task. For example, a promising model is Aya-23, an open model with dedicatedly multilingual support and, in principle, compatible with out infrastructure constraints (see above)..

weekly summary:

  • Clarified the criteria and constraints for the model
    • Multilingual: support (some) languages other than English
    • Openness: it needs to be open.
    • Resources: We need to be able to host the model in our infrastructure in LiftWing with reasonable inference time.
    • Use-case: We need to define the use-case; for example, should simplification be on the level of sentence, paragraph, section, or the full article?
    • Quality: Ensure the output is useful in practice according to some metric
  • Did a deep-dive on recent works on text simplification with language models (reviewing 26 papers from the past 2-3 years see below [1]). This helped me to understand the most common and promising strategies to approach the task and to identify challenges.
  • Some of the main learnings came from a paper (Paper Plain) which aims to improve access to medical papers. Based on interviews with readers about barriers for interacting with content and usability testing, they identify section gists as a valuable and most frequently-used feature by non-expert readers.
    • Operationalize simplification as a plain-language summary of a section
    • For example, use a prompt: “Summarize content you are provided with for a fifth-grade student.”
    • This approach seems highly effective as most LLMs are explicitly trained on the task of summarization across languages (e.g. XLSum)
  • I also reviewed some of the recent large language models that could be good candidates. I identified the Aya-23 model as a promising candidate model
    • Multilingual: It is multilingual supporting 23 languages (these languages cover approximately half the world’s population)
    • It is an open-weight model and available in Hugging Face
    • Given the successful test-deployment of the similarly-sized Gemma2-27B-it on LiftWing (T369055), it seems that this model could in principle be also hosted.
    • The model can be prompted to generate plain-language summaries of sections without fine-tuning
    • Some initial tests with the API-endpoint looked promising across several languages.

Next steps:

  • Reach out to ML Team about resource-constraints for considered candidate models (such as Aya-23)
  • Reach out to Web Team to get feedback about use-case as section-gists (plain-language summaries)
  • Set up an experiment with a small dataset implementing simplification as section-gists with candidate models (such as Aya-23)

[1] List of reviewed papers:

weekly update:

  • Ran small-scale experiments to automatically generate section-gists (i.e. plain-language summaries of a section) for Wikipedia articles using different models.
    • Test-dataset with original and simplified text from only the lead section of 10 articles in English, German, Portuguese (see spreadsheet)
    • Sample-dataset of several sections from the same article without the reference-simplification (see spreadsheet)
  • We are able to automatically generate section-gists (plain-language summaries of a section) of Wikipedia articles across languages using the LLM Aya-23. Based on qualitative evaluation of small sample (<100), the results are promising that the model can provide a concise and easy-to-understand overview of a long piece of text. I successfully tested English, German, and Portuguese, but the model supports 23 languages supposedly covering nearly half the world’s population in terms of speakers (which is way more than any other comparable LLM that I am aware of).
  • Evaluation of text simplification is challenging. Automated metrics (such as SARI) dont seem to be very useful to rigorously assess model performance for specific use cases. Simple baselines such as returning a truncated version of the input-text can yield surprisingly good results. As a result, we should probably not rely on these metrics and, instead, resort to manually judging (samples of) the model output which is, however, much more resource-intensive.

weekly update:

  • Built two working model prototypes for content simplification/summarization
    • 1) Simplification: Generate a simplified version of a section/paragraph using simpler language.
    • 2) Section-gist: Generate a plain language summary of a section (i.e. combining simplification and summarization)
  • Put together detailed documentation about the two models with examples and tutorial notebooks on how the models can be run (doc). I will add these updates to the project-meta page too in the next week or so.
  • Exploring alternatives for automatic evaluation of models to make it easier to iterate through different model variants without the need for manual evaluation.
  • Experimenting with the test-deployment in the staging instance of LiftWing (available thanks to the ML-Team) of the Aya-23 model (Aya-23-8B) which is used for the section gist model. The model works and returns requests in a reasonable time. Though the model quality seems substantially lower than the larger model I used in the experiments using the external API. ML Team is planning to test-deploy the larger model version (Aya-23-35B) in the next weeks.

weekly update:

  • Updated the project page on meta with current status: https://meta.wikimedia.org/wiki/Research:Develop_a_model_for_text_simplification_to_improve_readability_of_Wikipedia_articles/FY24-25_WE.3.1.3_content_simplification
  • Identified informative/interpretable metrics to evaluate performance of simplification models. For the manual evaluation, the most common approach is to judge 3 dimensions: simplicity (is the text simpler), fluency (is the text grammatical), and adequacy (does the text preserve the meaning). Looking into the literature, we can define automated metrics which approximate judgements along these dimensions.
    • Simplicity: Measure the change in readability score using our readability scoring model for Wikipedia articles.
    • Fluency: Count the number of grammatical errors using LanguageTool
    • Adequacy: Measure the factual consistency between the original and simplified text to detect, e.g., “model hallucination”. This is a very active field of research and several methods have been proposed recently such as FactCC (based on a trained classification model), SummaC (based on textual entailment), or QuestEval (based on question generation and answering).
  • The advantage of these metrics is that they are more interpretable and that they dont require reference samples from a ground truth dataset. This will hopefully make it easier to obtain confidence about whether models are “good enough” for potential use in production.
  • Started to implement metrics so we can automatically measure model performance.
  • Next step: Finalize implementation of metrics for evaluating model performance and run on representative sample with existing model.

weekly update

  • no updates this week as I was attending ACL 2024 conference.

as a heads-up, I will be out on sabbatical starting next week until mid-October, so updates here will likely slow down in the coming weeks.

How does this relate to Simple Wiki – could it be used to create article drafts the user can then improve (maybe even dynamically generated simple versions that are in sync with changes to the sources article)? I think at this point one of English Wikipedia's greatest challenges is that articles and sections are just sooo long – most people don't read all of it but removing things is usually also not due. I think adding section summaries to the top of each section is one useful approach, could the use of AI summarization be another? For example, maybe there could be a [button] to show a summary of a given section. Re "what metrics should we use to decide whether the model works well" I think editors should be able to adjust generated texts similar to what's described here and the adjustments and corrections would be used as a key metric to show how well a model works (e.g. less adjustments of checked generated texts marked as 'correction' = better model). WP articles should stay as indepth / complete as they are but at the same time people want short texts on demand and best with the option to easily see the more complete version as well if it missed some subject or is interesting.

Miriam renamed this task from [WE.3.1.3] Building a model for content simplification (Q1) to [WE.3.1.3] Building a model for content simplification (Q2).Oct 8 2024, 1:42 PM

weekly update:

  • I have been mostly busy this week with catching up on what has happened over the past 2 months while I was out.
  • I have been coordinating the work with ML team to deploy the larger 35B version of the Aya-23 model in a test-instance of LiftWing (the smaller 8B version was already tested successfully and thus constitutes a valid backup solution in case the former does not work)

@Prototyperspective thanks for reaching out.

How does this relate to Simple Wiki – could it be used to create article drafts the user can then improve (maybe even dynamically generated simple versions that are in sync with changes to the sources article)? I think at this point one of English Wikipedia's greatest challenges is that articles and sections are just sooo long – most people don't read all of it but removing things is usually also not due.

In principle, yes. One model generates simplified versions of articles. It is trained on pairs of the same article existing in English Wikipedia and Simple English Wikipedia (dataset). One could use the model to then generate simplified versions of articles that do not yet exist in Simple English Wikipedia but are already in English Wikipedia (of which there are many). However, it currently considers only the plain text of the article and will thus not include any links or references (which would be crucial for a draft of an actual article). Please note, that this is exploratory research to assess the feasibility of such a model.

I think adding section summaries to the top of each section is one useful approach, could the use of AI summarization be another? For example, maybe there could be a [button] to show a summary of a given section. Re "what metrics should we use to decide whether the model works well" I think editors should be able to adjust generated texts similar to what's described here and the adjustments and corrections would be used as a key metric to show how well a model works (e.g. less adjustments of checked generated texts marked as 'correction' = better model). WP articles should stay as indepth / complete as they are but at the same time people want short texts on demand and best with the option to easily see the more complete version as well if it missed some subject or is interesting.

Indeed. One other model I have been working on is to generate simple summaries of sections. The Web Team is currently running an experiment to test how useful these would be for readers of Wikipedia.

I hope this answers your questions. Feel free to also follow up on the talk page of the project page on meta: https://meta.wikimedia.org/wiki/Research:Develop_a_model_for_text_simplification_to_improve_readability_of_Wikipedia_articles

weekly update:

  • Ongoing work by ML Team to test-deploy the Aya-23 35B version of the model in LiftWing. Due to the size of the model, this requires some workarounds with datatypes which affects package dependencies.
  • Selected and implemented three interpretable metrics to automatically check the quality of the automatically generated simple summaries. This will serve as a tool to make informed decisions about whether the simple summaries meet some minimum quality requirements before considering use in practice or whether they should be discarded . These metrics capture: i) simplicity (is the model output simpler to read than the original?); ii) fluency (is the model output grammatically correct?); iii) meaning preservation (is the model output factually consistent with the original text?).
  • Example notebook: https://gitlab.wikimedia.org/repos/research/text-simplification/-/blob/main/evaluate_simple-summaries.ipynb

weekly update:

  • Putting together update for public documentation of the model
  • Set up meeting with ML Team next week to discuss test-deployment of aya23-35b model (used in the summaries experiment by Web Team) or potential alternative candidates

weekly update:

  • Started work with the ML Team on a dedicated subtask to test-deploy Aya models on LiftWing T379052
  • While we have successfully tested the smaller Aya23-8b model, we have not been able to run the larger Aya23-35b model as it requires too much memory than is available in the available GPUs. We are thus testing the next generation of the Aya models (Aya-expanse) for test-deployment because they have a smaller memory footprint and thus might be easier to run in our infrastructure and, at the same time, are reportedly strictly better than the previous Aya-23 (so we would probably switch to the newer version in future experiments) while also supporting the same set of 23 languages.
  • We ran first successful experiments with the larger Aya-expanse-32b model in the ML-Lab machines, where we were able to load and run inference.

weekly update:

  • no updates because I was attending the team offsite during this week
Miriam triaged this task as High priority.Nov 20 2024, 1:52 PM

weekly update:

  • We switched the model from Aya-23 to the next version Aya-expanse. https://phabricator.wikimedia.org/T379052#10314444
  • I implemented the Aya-expanse models in our internal infrastructure using the new GPUs on the ml-lab servers. We run the model to generate simple summaries using the Aya-expanse-8b or Aya-expanse-32b model.
    • I implemented some quantization techniques so that the models would fit into memory (the larger version does not work out of the box) https://www.deeplearning.ai/short-courses/quantization-fundamentals-with-hugging-face/
    • Specifically, we can use the model with different datatypes. For example, using float16 instead of the default float32 reduces the memory footprint of the model in half. In turn, this also reduces time needed for inference. At the same time, it is generally believed that this comes with little to no decrease in model performance.
    • For example, the aya-expanse-32b model could not be loaded into memory of the GPU with the default datatype. Instead, using float16, the model’s memory footprint is 60.16GB and thus fits into memory of the GPU. Similarly, for the smaller Aya-expanse-8b the memory footprint decreases from 29.91GB to 14.95GB requiring only half the time to run a single example query (8s vs 3s).
    • We can implement additional quantization techniques to further improve the memory footprint and inference time using, for example, the quanto library https://huggingface.co/blog/quanto-introduction This will require more thorough experiments to not only understand different options and potential dependencies that need to be resolved in LiftWing, but also make sure that model quality is preserved. I believe that this is beyond the scope of the current task and should be scoped as a separate task/hypothesis, if we have enough evidence that the model is useful in practice.

Learnings

  • We managed to run the Aya-model (aya-expanse-32b) for generating simple summaries in our infrastructure using the GPUs in the new ml-lab servers for offline experiments. This makes it very likely that we will be able to set up the Aya-model also in LiftWing as an API endpoint (the smaller Aya-expanse-8b is already deployed in a test-instance). However, substantial work would be required to optimize performance of the model (memory, inference time) using quantization.

Next steps:

  • Wrapping up documentation for project page on meta
  • Capturing learnings about quantization for potential dedicated follow-up tasks as it is relevant to any other LLM we would want to use in our infrastructure.

weekly update:

Closing this task as work is completed.

  • The hypothesis was confirmed.
    • We implemented an LLM in our infrastructure to generate simple summaries of sections of Wikipedia articles. The new model is hosted internally and currently tested using the browser extension, thus establishing a technical direction.
  • Main deliverables:
  • Major lessons
    • Our new ML-Lab servers support state-of-the-art multilingual models for text generation. However, additional work on optimizing these models to reduce latency and memory footprint.
    • Evaluation of the model output is challenging. The lack of simple metrics to judge the quality of the simples summaries (and any generated text) makes it difficult to iteratively improve the model via offline experiments (without asking human raters).
  • Next steps
    • The crucial next step is to optimize the model latency (memory footprint and inference time) via quantization or other suitable approaches.