Page MenuHomePhabricator

Copyedit structured task API
Open, Needs TriagePublic

Description

Request Status: New Request
Request Type: project support request
Related OKRs:

Request Title: Copyedit structured task API

  • Request Description: Structured tasks are meant to break down editing tasks into step-by-step workflows that make sense for newcomers and make sense on mobile devices. The Growth team believes that introducing these new kinds of editing workflows will allow more new people to begin participating on Wikipedia, some of whom will learn to do more substantial edits and get involved with their communities. We have built two: link suggestions and image suggestions. Even as we built those, communities repeatedly have explained that the task they desire most is around copyediting -- something related to spelling, grammar, punctuation, tone, etc. The Research team has surveyed the state of the art in terms of open source resources and APIs that we could use to identify simple spelling suggestions in many languages. The end product for the Growth team would be a service that suggests spelling changes for Wikipedia articles across dozens of languages, which incorporates Wikipedia-specific logic (e.g. suggestions are only given for appropriate parts of the articles content). A more general product might also make such suggestions available in the Visual Editor while the user is working on an article, so that they can incorporate them before saving their edit. We will have support from the Research team in identifying, validating, and adapting the spelling solution.
  • Indicate Priority Level: Medium
  • Main Requestors: Growth and Editing teams
  • Ideal Delivery Date: November 2022
  • Stakeholders: Marshall Miller and Peter Pelberg

Request Documentation

Document TypeRequired?Document/Link
Related PHAB TicketsYes<add link here>
Product One PagerYes<add link here>
Product Requirements Document (PRD)Yes<add link here>
Product RoadmapNo<add link here>
Product Planning/Business CaseNo<add link here>
Product BriefNo<add link here>
Other LinksNoproject page

Event Timeline

Related: There is a spellchecker API at https://spell.toolforge.org/ supporting 90 languages

One aspect to consider is whether the communities will be able to adjust the rules used for "spelling, grammar, punctuation, tone, etc." When theLanguage team explored this space, one interesting aspect of projects such as LanguageTool was the possibility for the communities to define rules to adjust them to their policies (e.g., support for multiple variants of a language, discouraging certain non-neutral expressions or suggesting certain words for consistency, etc.).
The integration of this kind of systems is very relevant for the Language team, since it is a common need surfacing in recent research ("Provide more tools related to translation, such as dictionaries and spell check."). In this case, for surfacing issues to the user before publishing (which seems related to the editor guidance described in T265163: Create a system to encode best practices into editing experiences)

Also these types of tools tend to rely on a custom dictionary (when you use a word it doesn't recognize, you can add it to the dictionary so you don't get warned about it every single time) - would we need something like that? Would it be per-user? Per-page? Where would it be stored and how would it be exposed to moderation / anti-abuse workflows?

Relatedly, if someone rejects a spellcheck issue, ideally that would be stored somewhere - when the page is edited, and someone runs another spellcheck, we don't want them to go through all the issues that have already been resolved previously (although I guess whether this matters depends on the rate of false positives). Again, how would that be stored? (MCR?) How would it survive edits?

It seems like we would need some kind of community configuration for spellcheck per language wiki. We could do some research to see what size limitations we'd have by placing it all in a single, per-wiki MediaWiki namespace JSON page; that would have the advantage of giving us a lot of things for free in terms of moderation and oversight.

DAbad moved this task from Backlog to Investigate on the Foundational Technology Requests board.
DAbad added a subscriber: lbowmaker.

October 26, 2022 Tech SC Meeting Notes

  • From Growth team: "The high level summary is: we want to provide suggestions to users about "copy edit" fixes they can make to articles. The suggestions will come from a LanguageTool application instance, and those suggestions will be further filtered by a Python application that is responsible for proxying requests to/from LanguageTool (some updates / more reading here https://phabricator.wikimedia.org/T315086). Given that architecture, we would have two services (LanguageTool, and Copyedit) in Wikimedia's Kubernetes cluster."
  • After initial review with the Technology steering committee this does not include an ML/AI model and falls more closely in line with what has been implemented for SD image suggestions
  • Given that LanguageTool and Copyedit would be developed similarly to image suggestions

Next Steps:

  • @lbowmaker will run point on this request to dig deeper with the team on potential architecture support needs and draft for recommendation on path forward
  • From Growth team: "The high level summary is: we want to provide suggestions to users about "copy edit" fixes they can make to articles. The suggestions will come from a LanguageTool application instance, and those suggestions will be further filtered by a Python application that is responsible for proxying requests to/from LanguageTool (some updates / more reading here https://phabricator.wikimedia.org/T315086). Given that architecture, we would have two services (LanguageTool, and Copyedit) in Wikimedia's Kubernetes cluster."

The Language team would be also very interested in using the LanguageTool service. I'm sharing some usecases that are relevant for us in case they help the service to support a broader set of needs (or be architected in a way that is easier to support them in the future):

Content/Section translation could use the LanguageTool service to surface issues in the translations at different times: (a) while users edit them (e.g., showing issues after completing a paragraph or sentence), (b) in the process of publishing (e.g., prompting users to review if there are major issue detected) and/or (c) after publishing (i.e., recommending to improve the article by correcting the detected issues).

The types of issues could range from (a) spellchecking mistakes the user made, (b) grammar issues resulting in unnatural sentence formation based on light modifications of machine translation, or (c) word choice preferences by the community (i.e., to encourage consistency in the way certain terms are translated).

Next Steps:

  • @lbowmaker will run point on this request to dig deeper with the team on potential architecture support needs and draft for recommendation on path forward

@lbowmaker quick heads up, I'm working on some diagrams and documentation recap in T321020: Research Spike: Copy Edit, I'll share it and ping you back asap.