Page MenuHomePhabricator

Community-defined translation collections: Technical Exploration
Closed, ResolvedPublic

Description

Background

Hypothesis 2.1.4:

If we developed a proof of concept that adds translation tasks sourced from WikiProjects and other list-building initiatives, and present them as suggestions within the CX mobile workflow, then more editors would discover and translate articles focused on topical gaps. By introducing an option that allows editors to select translation suggestions based on topical lists, we would test whether this approach increases the content coverage in our projects.

By making topical/campaign lists visible in existing editorial workflows, in this case, the mobile translation workflow, where editors come to translate articles:
a) We hope to increase the visibility/discovery of topical gaps already being curated by organizers and communities -> If this works, we increase the chances that more editors become aware of these worklists.
b) We hope to increase quality content contributions through translations -> If this approach works, topical/campaign worklists can be a great, alternative source of translation suggestions for editors.

Description
ConsiderationsNotesPoC scope Future Opportunities
1.List ManagementLists are knowledge gap areas that are curated collaboratively by organizers/communities using different tools e.g. petscan, and wiki-data queries.a) Event/Wikiproject pages with translation worklists can be read, parsed, and fed into the content translation toola) Future requirements like sharing, re-using, and tracking of worklists are possible. b)There's also the aspect of private/personal lists vs public/collaborative lists on wiki. c) Worklists are presented differently on meta/ campaign pages.
2.Tasks/ContributionsTypes of contributions/actionsTranslation tasks with attributes such as source article title, source language, target language, QID.Other use cases: creating new articles, adding images, adding info-boxes, adding references and citations, etc.
3. ToolsEntry-points to surface & action the listsMobile translation tool has a more updated dashboard + tools.Other tools: desktop translation, newcomer tools, suggested edits(apps), event page

This ticket aims to put together a PoC that can do the following:
-> Have a way to parse the topical/campaign lists to the mobile translation tool to expose the lists beyond the current campaign pages/wiki projects.
-> Have a way to allow editors to opt in or opt out of this service; so organizers can choose when to include their lists on the CX mobile flow.
-> Have a way to feed the worklists into the recommendation API: T369484; so that editors can receive these as suggestions.
-> Have a way to incorporate/allow changes made to the worklists which also reflect on the CX mobile flow; translators should have the most up-to-date translation suggestions presented.
-> Have a way to update already translated tasks; organizers should be able to tell the progress made against their lists.

Acceptance Criteria

Recommended approach
-> defined below in the comments.

Follow-up technical tasks:
-> T371515

Event Timeline

PWaigi-WMF renamed this task from Technical Exploration: How we can have a List Pipeline/Infrastructure for translation tasks to Technical Exploration: List Pipeline/Infrastructure for Translation tasks.Jun 28 2024, 11:33 AM
PWaigi-WMF renamed this task from Technical Exploration: List Pipeline/Infrastructure for Translation tasks to Technical Exploration: List Pipeline/Infrastructure that supports External Lists.Jun 28 2024, 11:38 AM
PWaigi-WMF renamed this task from Technical Exploration: List Pipeline/Infrastructure that supports External Lists to Technical Exploration: List Infrastructure/Pipeline/Manager that supports External Lists.Jun 28 2024, 12:13 PM
PWaigi-WMF updated the task description. (Show Details)
PWaigi-WMF updated the task description. (Show Details)
PWaigi-WMF renamed this task from Technical Exploration: List Infrastructure/Pipeline/Manager that supports External Lists to Community-defined translation lists: Technical Exploration.Jul 9 2024, 9:16 PM
PWaigi-WMF updated the task description. (Show Details)

This ticket aims to put together a PoC that can do the following:
-> Have a service that parses the topical/campaign lists to the mobile translation tool; so that there's a way of exposing these lists beyond the current campaign pages/wiki projects.

The CX/SX dashboards get article suggestions from the recommendation system. The recommendation system would be extracting article titles from campaigns that have opted-in this feature.

-> Allow editors to opt in or opt out of this service; so that organizers can choose when to allow their lists to surface on the mobile translation tool.

Topical collaboration organizers/admins could be tagging their campaign with special markup that allows the recommendation system to find them and get some information about them.

For example, it could be a parser function with parameters but no visual output. Similar to how Extension:PageAssessments does it.

{{#translation_campaign: Wiki For Human Rights 
| location=meta:WikiForHumanRights
| desc=Wikipedia articles that directly describe the UN Declaration of Human Rights, its history, and core UN bodies.
| list=meta:WikiForHumanRights/List
}}

A simpler short term solution could be to use a template in much the same way. Especially on meta since a MW extension is needed to define a parser function and ContentTranslation doesn't appear to be installed there.

-> Any changes made to the worklists on the campaign page/wiki projects also reflect on the mobile translation tool; so that editors can view the most up-to-date translation suggestions.

Since the recommendation system would be reading the articles list from the campaigns pages in real time with little to no caching, suggestions in the dashboard would be relatively up to date.

-> Have a way to feed the worklists into the recommendation API: T369484; so that editors can receive these as suggestions.

The recommendation system would be reading from the worklists, as described above.

-> Sync already translated tasks back to the campaign page/wiki project; so that organizers can view & track progress.

By magic of the wiki, red links (for articles that don't exist) will be turned into blue links as articles are created and the recommendation system would stop providing those recommendations. No specific syncing back in this PoC but I believe this point would be addressed.

CampaignEvents is installed though - if we want to get an API based output like api.php?action=query&prop=translationcampaign&titles=WikiForHumanRights returning the List page we can enhance that extension. For an MVP, some marker in the page is enough so that these pages can be retrieved using search API, probably using incontent or hastemplate as outlined in https://www.mediawiki.org/wiki/Help:CirrusSearch.

I've been thinking about this challenge too: a translation list is about creating articles on a specific target Wikipedia but redlinks aren't tracked as entities in our APIs so the worklist must be accessible via our APIs on the potential source articles. For an MVP, I guess you go with whatever works but eventually we'll need to resolve this tension in a way that:

  • Makes it easy for organizers to build and update translation lists -- i.e. ideally they can create a list either via a list of Wikidata items associated with the articles they want created or via a list of specific source articles that could be translated over. I don't think it's reasonable to expect them to edit other language editions to generate their list. Maybe it's reasonable to ask them to edit Wikidata?
  • Does not create a bunch of bloat / noise on the source wikis -- i.e. I think we want to avoid solutions that require adding some sort of tag to source articles or their talk pages. That realm should belong to that language edition and patrollers on that language edition won't want to receive watchlist notifications etc. whenever a source article is tagged as a good translation candidate. One exception is that again I think it might be reasonable to consider having this tagging happen on Wikidata though that community would have to weigh in on whether something like on focus list of Wikimedia project (P5008) has been an approach that they would like to build on or not.
  • Does not greatly complicate the existing recommendation API -- i.e. we currently rely on a very small and quick number of API calls that takes as inputs the desired source and target language (as well as various topic filters to apply to the target language). To Santhosh's point, I'm pretty sure the tagging of an article as part of a translation campaign has to also be done via a filter that can be applied directly to a Search query. This could be via things like incontent or hastemplate but I think long-term we'll probably have to create another WeightedTag like we do for articletopic. The only other alternative that has occurred to me is fully switching to Wikidata as the backend for this -- i.e using P5008 that I mentioned above to tag articles, WDQS for gathering candidate lists and sitelinks for filtering, and then finding someway to recreate the morelike and articletopic functionality in Wikidata. But that would be a big lift and latency would be a major challenge.
  • Allows the translation lists to be filtered by our existing filters like articletopic or morelike -- here I'm thinking of the example where you have a worklist of all women scientists that is created via SPARQL and an editor comes in and they want to create an article in Spanish Wikipedia with English Wikipedia as the source and to further filter down to women who are connected to their home country of Ecuador (this could either be done soon via the country filter I'm working on or just by adding morelike:Ecuador as a filter). If the translation list is accessible as a filter on the Search index of English Wikipedia, this is easy. If it's not, it's much much harder/slower.

Possible solutions:

  • For the MVP, I suspect you'll want to work with the Search team to create a WeightedTag (this is what was done for the GLAM pilot where we added a few custom country-specific filters to the Search index for those events). And then collect lists of Wikidata items associated with the articles to be translated and map them to all the available source articles and manually upload them to the respective search indexes so they can be included as a filter when searching for candidates there. The recommendation API can then be easily modified to add that filter without increasing latency/complexity and participants can easily combine the campaign filter with the existing Content Translation functionality.
  • Long-term, I think we have to choose between three options:
    • Back-end magic (like the parser call mentioned by Stephane) to take a list of links or Wikidata items and automatically propagate those items to be weighted tags in the Search index on all or some subset of target languages. This is convenient for end-users and doesn't cause bloat on the wikis but does raise challenges about how to keep the search indexes updated so they don't end bloated with a massive number of these tags and can keep up with changes to the worklists.
    • Shifting all of our translation recommendations to Wikidata as the backend instead of the Search index. This feels like a reasonable place for tracking worklists though would require all events/campaigns/wikiprojects to have a Wikidata item. Wikidata natively has access to sitelinks for filtering candidates but the challenge will be adding the other topical filters to Wikidata. This is probably solve-able though would be a big change because Wikidata currently is not built for machine-generated annotations like predicted topics and WDQS isn't as reliable/fast as Mediawiki search queries.
    • Create a whole new system where worklists are created on metawiki and the recommendation API doesn't rely on Wikipedia Search or Wikidata. We productionize the list-building tool which would replicate the morelike functionality; we add the topics and langlinks as filters to the list-building tool as well (new streaming pipelines but doable). This is definitely long-term as list-building relies on nearest-neighbor search which would be a new production functionality but there are good technologies out there for it (including potentially as an extension to OpenSearch). The main benefit here is that we're building from a blank slate as opposed to trying to shoehorn even more functionality into Wikidata or the Wikipedia Search indexes.

Change #1056589 had a related patch set uploaded (by Sbisson; author: Sbisson):

[research/recommendation-api@master] [PoC] Translation recommendations based on article lists

https://gerrit.wikimedia.org/r/1056589

Based on this week's discussions, regarding these 2 approaches (simply defined):

  1. try to parse work lists as they are from campaign pages
  2. define a simple list storage and management workflow that campaigns & organizers can adapt ->We be will attempt this for the MVP; the next steps will be to

-define where worklists should be stored and how they should look like.
-define what organizers would need to provide + a simple, workflow to match the experience.

As per the discussions regarding early technical iterations towards this goal, we decided the following:

List definition and storage

  • We are not going to retrospectively support any of the existing campaigns pages. We will define a simple mechanism to mark a campaign page part of "translation campaign".
  • These are normal wikipages. The pages can be edited as usual. Only difference is the page will have a 'marker' that tells this page is translation campaign source
  • The 'marker' should have a way to define translation metadata in structured way so that programs can read. Example metadata: campaign-id, campaign-name, 'campaign-list-source-language`, 'campaign-list-target-languages`.
  • For the initial iteration, these lists are going to be defined at meta.wikimedia.org
  • Document this process.

Candidate recommendation

  • The recommendation api, if the API request includes featured-lists param, will look for all pages with the above defined 'marker' using appropriate cirrus-search.(see above comments by Isaac)
  • For these pages, find all links in it. That article list, with the source language information forms featured list candidates. Note that these articles need to go through the fitering stage to find all translatable articles, and then ranking
  • Cache the pages we find with that 'marker' with a reasonable cache eviction policy. Cache the article lists too. I recommend diskcache for this purpose. Also recommends filling this cache on api server startup
  • API response can contain additional information on from where the candidates were sourced. This may not be useful for UI at this point, but I consider this info will help development and debugging workflow

Presentation and translation workflow initiation

  • CX Dashboard provides an option to select featured-lists in the UI - along with topics. If featured-lists is chosen, that info will be part of recommendation api request.

Closing this task as we now have a follow-up implementation task -> T371515

PWaigi-WMF updated the task description. (Show Details)

Change #1056589 abandoned by Sbisson:

[research/recommendation-api@master] [PoC] Translation recommendations based on article lists

https://gerrit.wikimedia.org/r/1056589

PWaigi-WMF renamed this task from Community-defined translation lists: Technical Exploration to Community-defined translation collections: Technical Exploration.Nov 5 2024, 11:46 AM