Overview
When seeking to understand what changes were made by a given Wikipedia edit, there are three general sources of information: the actual diff of an edit (example), the edit summary, and edit tags. Edit tags are generally pretty simple and constrained to tool information or basic regexes -- e.g., whether an edit occurred via mobile or easy-to-define content changes like blanked a section or a new, short article). The edit diff is much richer -- showing the specific changes made in context -- but takes up too much visual space for tasks such as quick browsing of an edit history or patrolling for suspicious edits. The edit summary is intended to be a good middle ground -- a succinct but flexible, editor-provided description of what the edit did (and maybe why).
Edit summaries have a number of drawbacks but also are very valuable. Vandals may provide misleading summaries and many editors leave the summary blank or use a canned edit summary that may or may not accurately reflect the impact of the edit. Despite these drawbacks, edit summaries are invaluable for tasks such as understanding work on Wikipedia, building datasets of edits for study, and contextualizing edits.
Task
This research will focus on developing a machine-learning model to auto-generate edit summaries for Wikipedia. It will primarily be applied machine learning research. A potential set of steps include:
- Familiarize yourself with the data -- i.e. common patterns in edit summaries (similar to the proposed research in T287992)
- Build a supervised training dataset:
- Data: you'll likely want to use the mediawiki history dumps (example on PAWS) as the primary source as they include both the revision text that you can use to generate diffs (features) and edit summaries (labels).
- Features: in a perfect world, a model could merely be provided the previous revision and current revision of a Wikipedia article as inputs and output an edit summary using e.g., a transformer model. This has a few major challenges though:
- Length: wikitext generally is going to be quite large -- a featured article on Wikipedia can easily exceed 6000 words -- and the changes will usually be to just a small part of the article.
- Meaning: wikitext has very specific syntax such as {{templates}} or [[links]] of <ref>references</ref> and the model would have to effectively learn to become a wikitext parser to properly ascribe meaning to any changes.
- Architecture: while transformer models can handle a wide range of inputs and associated tasks, I'm not sure if there is a format that supports two passages as input (previous+current wikitext) with arbitrarily-generated text as output (summary). BART may be the closest analogue but likely the previous+current wikitext has to be pre-processed in such a way that it comprises a single input.
- A few options to make the problem more tractable come to mind then:
- Basic aligned wikitext: concatenate only the changed wikitext paragraphs and hope the model can learn to compare them and learn the meaning / spatial interpretation of wikitext syntax -- e.g., any changed content between {{ and }} is part of a template.
- Reducing the space: focus the model just on instances where article text is changed. Inputs can then just be e.g., the changed sentences or paragraphs with syntax stripped out. This model would ignore changes like Categories, template edits, etc. but they can probably can be described through simpler approaches like merely providing a summary of how many elements were changed.
- Structured inputs: use the mwedittypes library to extract potential input features -- e.g., changed sentences in plaintext, non-text elements that were edited (templates, references, etc.). The different non-text edit types could be given special tokens to help the model incorporate this info.
- Labels: Build a dataset of edit summaries for English Wikipedia (or any other languages). Any dataset should filter out edits that were vandalism or patrolling-related (reverts) -- e.g., via mwreverts with the history dumps or building a list of revert(ed) edits via the denormalized edit history dumps. Further filtering might then be beneficial as many edit summaries are likely provided via dropdowns from tools (and thus might be overly general or just skew the data due to their prevalence). Different heuristics might be useful here -- e.g., semi-unique summaries provided by veteran editors -- but ultimately this will probably require some trial-and-error/iteration plus intuition about what seems to provide a large, high-quality, and maximally diverse dataset.
This task is considered [long] and it pre-supposes familiarity with ML modeling -- in particular, transformer models similar to the models described in Descartes: Generating Short Descriptions of Wikipedia Articles. In general, it's expected that the task will take a several months of consistent work and is a good fit for someone with some research experience or interest in being involved in research. The actual time needed, however, will depend greatly on your level of experience.
Rationale
In order to make edit summaries more useful for editors, they likely need to be more complete and trustworthy. Edit summary recommendations would have several benefits:
- They could be used to help new editors learn norms around what to put in edit summaries -- e.g., similar to T265163
- As a more "objective" summary than self-provided summaries, they could help patrollers filter which edits to check in more depth
Recommended Skills
- This task primarily requires some experience with machine learning, particular with fine-tuning pre-trained transformer models as a likely solution.
Acceptance Criteria
- The output of this task will be a Meta report describing the research and findings (example). Depending on researcher and mentor interest, this could be expanded into a more formal publication.
Process
- If you are interested in this task and it is not assigned to anyone, you may begin work on it. Please leave a comment on the task and tag @Isaac so that he is aware.
- If you have made some progress on the task (an initial dataset of features and labels) and would like to continue, share a link to your current draft and let @Isaac know so that he can assign the task to you and help you to plot out the next steps.
- Generally, @Isaac will be able to answer any questions about the task and try to respond quickly when clarification is necessary but response times may be slow if help is needed for more general debugging etc.