Page MenuHomePhabricator

Research support for Copyediting as a structured tasks (Q2)
Closed, ResolvedPublic

Description

I conducted background research on possible approaches for copyediting T288240. Summary and recommendations can be read here: https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task#Literature_Review

Before continuing, I need to coordinate with the Growth what are the next steps based on the recommendations.

  • Discuss and iterate with Growth Team to better scope the task
  • Initiate steps identified in discussions with Growth Team (to be specified)

Event Timeline

Update week 2021-10-18:

  • no updates -- I have not received any feedback from Growth Team yet

Update week 2022-02-28:

  • started exploratory analysis to use LanguageTool to detect spelling/grammar errors in Wikipedia articles using the HTML-version of an article https://gitlab.wikimedia.org/mgerlach/copyedit/-/blob/main/languagetool_exploratory-01.ipynb
  • the main challenge in applying this or similar tools for Wikipedia articles is that they are very sensitive and flag many errors that are false positives. for example, it is very common that names of entities (such as the title or the links) are flagged as misspellings. the HTML-version of an article makes it much easier to identify relevant text that should be checked for copyedits compared to the wikitext. this substantially reduces the false positives and increases the precision.
  • the approach is very promising as we could generate copyedit-recommendations at scale taking advantage of the newly available HTML-dumps

Update week 2022-03-07:

  • started to build an experimental API on toolforge to use LanguageTool to detect copyedits for an article in 30 different languages (https://gitlab.wikimedia.org/mgerlach/copyedit-api ). currently blocked by the fact that I couldnt figure out how to run the local instance of LanguageTool on toolforge given it is a java tool
  • discussing with Djellel, we are planning to work towards evaluating LanguageTool for detecting copyedits: i) use an existing corpus of copyedits to measure precision, ii) manually annotate LanguageTool's results when running on, say 100, random Wikipedia articles to get a sense of the amount of false positives and test different filters to decrease false positives.

@MGerlach thanks for the update. You're welcome to reach out to Baha to ask for help with running Language Tool on toolforge. (@Miriam fyi)

  • solved the issues from last week and deployed an experimental API for detecting copyedits in Wikipedia articles using languagetool
  • Call to API by providing language (e.g. “en”) and page-title. Some example calls for different wikis/articles:
  • The API applies aggressive filtering to reduce the number of false positives. Using the HTML-version of the article makes it much easier to: i) identify only plain text (and avoiding tables, infoboxes, or transcluded content from templates; ii) remove errors that overlap with annotated text such as links, bold, italics, etc, which often yields wrong errors.
  • Code (and documentation): https://gitlab.wikimedia.org/repos/research/copyedit-api
  • Solution was to set up an endpoint on cloud-vps running our own instance of LanguageTool (thanks @Isaac )
  • Started with evaluation of the precision and recall of LanguageTool on a ground-truth dataset in English (non-Wikipedia); the plan is to perform similar evaluation on Wikipedia articles.

Based on the above observations of the API surfacing the copyedits from LanguageTool and @Miriam's feedback after their discussion with Growth I will focus on trying to get a quantitative estimate of the precision of copyedits from LanguageTool when applying to Wikipedia articles. This will be captured in a separate tasks.