Page MenuHomePhabricator

Custom translation suggestions: Basic topic selection
Open, HighPublic

Description

As part of the work to support Custom translation suggestions (T113257), the initial iteration is about exposing the option for users to customize the list of suggestions.

This involves the following elements:

  • Filter status. Showing at the suggestions list which is the active filter and options to change it.
  • Adjust suggestions view. A view listing topic areas in different categories for the user to pick.

Basic topic selection.png (1×2 px, 293 KB)

More details about each element below. You can also inspect the designs in Figma

Filter status

Basic topic selection - Filter status.png (609×1 px, 38 KB)

The filter status is a short list of filter options where three elements are shown:

  • The default filter. Which is the option selected initially in the absence of URL parameters or previous changes by the user. This will be "For you" if the user has previous edits that can be used as seed articles, or "Popular topics" if not.
  • The quick alternative. It is another filter option the user can just switch to by taping on it. It represents the next option that will be available if the user access the full list of filters, but provided directly. If the default filter is "For you", the quick alternative will be "Popular topics".
  • Access to all filters. An icon with 'More' label chip using the Ellipsis icon from Codex provides access to the "Adjust suggestions" view (see section below for more details).

The filters will be supported using a FilterChip component. This type of component was discussed for the incorporation into Codex (T324223), but unlike the InfoChip, it has not been implemented yet.

Adjust suggestions view.

Basic topic selection - Adjust suggestions.png (704×1 px, 40 KB)

The "Adjust suggestions" view provide access to all the filters for the user to select the one to activate. The view is composed of the following parts:

  • Header. An 'Done' button action to confirm topic selection, a close using the Close icon to discard topic selection, and an "Adjust suggestions" title provides context to the users.
  • Filter group title. Filters are organized in different groups. The initial group, "Automatic" is a special group since it does not represent a specific topic area. The rest of the groups are based on the ORES taxonomy (more on this below).
  • Filter tags. FilterChip components

Responsive adjustments
The Adjust suggestions view will be supported as a dialog on wider screens, and taking the whole viewport on narrow screens.

Future iterations

The "Adjust suggestions" view is designed to provide flexibility for users to define the knowledge gap they are interested in as narrow or broad as they want. However, this ticket covers only the basics where only one topic area can be selected. Tickets for future iterations will expand the support with multi-selection, searhing and additional filtering options.

Filter taxonomy

The specific filters ("Architecture", "Arts", etc.) and the way they are grouped ("Culture", "Geography", etc.) is based on the Growth Newcomer Task project

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Pginer-WMF raised the priority of this task from Medium to High.Aug 2 2024, 10:13 AM

Change #1065191 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] SX: Add api method for fetching most popular recommendations

https://gerrit.wikimedia.org/r/1065191

Change #1065192 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] Unified Dashboard: Add "most popular" suggestion option

https://gerrit.wikimedia.org/r/1065192

One challenge of this feature is getting the ORES topics and their localized labels.

Looking at how it's done in the Growth Experiments user homepage, they read the topics from MediaWiki:NewcomerTopicsOres.json and the labels are provided in each wiki on the MediaWiki namespace (example).

While we could technically re-use these artifacts here it doesn't seem clean to do so. Would it make sense to have the ORES extension make available it's official and up to date topics hierarchy as well as their localized labels? These could be translated in translatewiki.net as usual.

@Isaac Regarding @SBisson's question above, what is the plan about localized topic names? If I remember correctly, the topic list is also going to be more granular soon.

Change #1068060 had a related patch set uploaded (by Sbisson; author: Sbisson):

[mediawiki/extensions/ORES@master] PoC: Make available official list of topics and localized labels for reuse

https://gerrit.wikimedia.org/r/1068060

One challenge of this feature is getting the ORES topics and their localized labels.

@santhosh @SBisson good questions. I don't have a strong opinion about the technical solution for getting localized names for the topics beyond that it be shared across Newcomer Tasks (@KStoller-WMF) and Content Translation so there isn't duplicate work to translate the labels. Ideally future recommender systems would be able to use the functionality as well -- e.g., if Android wanted to expand their Suggested Edits module.

If I remember correctly, the topic list is also going to be more granular soon.

Yes, as for the stability / size of the topic list:

  • I'm working on adding countries as an option (T366273). This would add 250 more labels but I'm hopeful there are ways to add most of the localized names by default so this isn't a huge ask to folks doing the translation? For instance, each country is associated with its QID so perhaps we could seed the translation or fall-back using the label on Wikidata (example for the US) if the community has not set one via Mediawiki namespace pages or translatewiki?
  • The plan is that the original topic taxonomy that Newcomer Topics are based on will also undergo some minor modifications (Hypothesis WE 1.1.3 starting in Q2 hopefully with implementation work likely happening in Q3/Q4). This will likely be adding a few new topics (related to human rights / sustainability) and merging/dropping a few as well (the less-used ones like radio) but we'll try to keep the majority of the topic list intact. I don't think should be a blocker to your work and we'll manage the roll-out such that the old tags don't stop working, but I call it out to indicate that we'll likely have to go through the translation/update process again in several months when the next generation of topic model taxonomy is rolled out and Content Translation is ready to adopt.

@Isaac is the article country model a distinct model with it's own search index tag and query syntax (i.e. "articlecountry:Ghana") or is it a bunch of new values for the article topic tag (i.e. "articletopic:country-ghana")?

About countries localized names, we also have them from CLDR so it's mostly a matter of using the same codes.

is the article country model a distinct model with it's own search index tag and query syntax (i.e. "articlecountry:Ghana") or is it a bunch of new values for the article topic tag (i.e. "articletopic:country-ghana")?

@SBisson technically the country model is a separate model from the other topics but right now the plan is to merge these two outputs under the existing articletopic: tag on the Search index. So probably something more like articletopic:country-ghana. Reasoning being that from the standpoint of our end-users, I don't think there's a major difference between the "standard" topics and countries. You can see a bit of the discussion at T301671#10052411 and then Eric's reply.

About countries localized names, we also have them from CLDR so it's mostly a matter of using the same codes.

Oh I'd be plenty happy with this solution! I would still be curious if this would solve the issue for Growth too but at least it doesn't sound like any extra work for translators. You can see the details of the full country list for the model in the README. I did a quick check against the CLDR list and it looks like all of the iso_code values in my list map up to what you'd expect, at least for the English labels: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/cldr/+/refs/heads/master/CldrNames/CldrNamesEn.php#984

If we go this route, I assume the easiest thing would actually to have something like articletopic:country-gh (for Ghana) as the tag in Search? If so, I can coordinate with them and ML Platform to make sure the model outputs the iso codes.

Making the taxonomy of topics available to any product consistently seems really great. Having a machine-processable localizable single source of truth seems useful to make sure the same latest version of topic areas are available to our users across our products.

One of the aspects that I was curious about whether that has been included is the grouping. Looking at the current taxonomy Literature is shown under the "Culture" group. Are those groups encoded in the system in a way different tools can use them consistently?

For the specific case of countries and user interaction, the idea is to show only some of the options initially (e.g., those relevant to the user based on location, previous choices or similar), with an option to access the rest (in addition to being exposed to the search option T369595). The parent ticket will link more details about this at point "5. Location", but the idea could be along the following lines:

Mobile.png (99×318 px, 4 KB)

@Isaac to make this taxonomy localizable, we need to decide a single source of truth. Do you think ORES is a good candidate? @SBisson prepared a POC patch with ORES.

[...]
One of the aspects that I was curious about whether that has been included is the grouping. Looking at the current taxonomy Literature is shown under the "Culture" group. Are those groups encoded in the system in a way different tools can use them consistently?
[...]

I hope to have the topics and groups represented as formally as possible so all products can use them the same way. One particularity of the grouping is that some group headers are actual filters for everything in the group while some aren't. For instance articletopic:stem will return everything in stem (biology, chemistry, etc) while articletopic:culture or articletopic:americas won't return anything. Maybe something to think about for the upcoming taxonomy update.

If you look at the taxonomy used for the newcomers tasks, which is very similar to what we have in the screenshots in this task, you'll notice that they made some recategorisation, exclusion and grouping to the original. I'm wondering if the official version we create should be closer to the original with all the levels and granularity or a version we believe is better suited to be presented to the users in various products, just like what Growth did.

To the general question of the mismatch between the ORES taxonomy and the Newcomer Tasks taxonomy, I'll train to explain but I admit it's a bit messy. Feel free to skip if not interested :)

The ORES taxonomy is not just conceptual but also reflects a bit how the model was trained. From the model standpoint, it has 64 independent topics that can be assigned to any given article. We were following the English WikiProject Directory with some manual cleaning to make it a bit more coherent/feasible for our purposes. But while some topics were well-defined, others were a bit hazier. For instance, there's several topics listed under the group Media in the ORES taxonomy but there were WikiProjects that were clearly "Media" but didn't really fall into the specific sub-topics. So we'd create a special topic called Media* that, during model training, held these outliers as well as everything else in that Media group. For example, if an article is tagged with a WikiProject that is grouped under Radio, then it also gets automatically tagged with Media* while training. This is also helpful at inference time too (where we don't do this automatic assigning) because we might encounter an article that kinda looks like a Radio article to the model but is low confidence and the hope is that it might still be picked up by that catch-all Media* topic.

These 64 possible topics from the LiftWing model (e.g., Culture.Media.Media*) are then converted into simple keywords for Search like media that are stored under the articletopic: tag. So these are unfiltered outputs from the model and e.g., @SBisson your example about articletopic:stem actually has nothing to do with the Newcomer Tasks grouping but is instead capturing the catch-all STEM* topic from the model. Now when it got to Newcomer Tasks and how they would use these tags, they had two additional constraints (at least from my standpoint):

  • They didn't want to provide topics that would have really low coverage. For example, there really aren't a ton of radio articles and so these are just grouped with entertainment rather than being their own topic.
  • I presume there's also a simplicity of UI / ease-of-translation aspect that might have come into play too.

Thankfully, I assume their constraints are generally shared by all of the other potential recommender systems so we can hopefully come to a reasonable and global solution moving forward.

For the next iteration of the model, I do plan to drop at least some of these low-coverage topics that are just getting grouped in with other, higher-coverage ones. Hopefully that'll simplify a bit this grouping that Newcomer Tasks has put together. That said, WMF recommender systems aren't our only stakeholder so I do have to take that into account: we also use topics for research/analyses and in theory anyone could use them in Search or build their own tools around them that do e.g., keep radio as its own topic.

to make this taxonomy localizable, we need to decide a single source of truth. Do you think ORES is a good candidate?

@santhosh my recommendation is probably to actually stick to Growth's choices of grouping unless you have a strong reason not to. That should ease the sharing of translated topic labels and their choices were pretty reasonable and likely apply to your use-case too. It also helps with making sure that when we do finally update the model, updating the UIs/backends is hopefully simpler because there will be one shared approach to adjust as opposed to different logic for each system.

Regarding the POC patch, I would assume that the ORES extension is actually not the right place (despite the name). It's really about surfacing LiftWing predictions to RecentChanges and so is only deployed in a small subset of wikis (extloc) and also is a very different purpose than the recommender systems. If you do want to go that route or learn more, I'd reach out to @isarantopoulos who likely is the best person for questions.

For the specific case of countries and user interaction, the idea is to show only some of the options initially (e.g., those relevant to the user based on location, previous choices or similar), with an option to access the rest

@Pginer-WMF that makes a lot of sense! If you ever decide you want a more hierarchical approach, all of the countries have an associated subcontinent and continent region (data) so e.g., you could also expose Europe, which on click would show Northern Europe; Southern Europe; etc. and then clicking on Northern Europe would show Denmark; Estonia; etc.. Even if you don't go that route, you could always expose these choices via the search option if you think they'd be useful to end users. And then on the backend, they presumably would just be converted to the full list of countries.

If you look at the taxonomy used for the newcomers tasks, which is very similar to what we have in the screenshots in this task, you'll notice that they made some recategorisation, exclusion and grouping to the original. I'm wondering if the official version we create should be closer to the original with all the levels and granularity or a version we believe is better suited to be presented to the users in various products, just like what Growth did.

Based on the input from @Isaac, I think it makes sense to use the Growth Newcomer groupings. Thinking of these as the official user-facing groupings. In order for thee to actually become the official user-facing groupings, it may be great to:

  • Make this resource more global. That is, code/schemas should not be associated/requiring a particular Growth extension (I don't know if this is the case or not currently). Any tool should be able to make use of this resource to present topic areas to users.
  • Ensure it is always in sync with the underlying technical ORES model. If new categories are added, removed, or they change names, the user facing groupings should be updated too. Otherwise, this will lead to exposing the wrong topic areas.

Having said that, for the purpose of validating our hypothesis, I'm totally happy to hear about pros and cons of other options. It seems ok to start with a less polished solution, being it the direct reuse of Growth groupings, or the use of ORES.

The ORES taxonomy is not just conceptual but also reflects a bit how the model was trained. From the model standpoint, it has 64 independent topics that can be assigned to any given article. We were following the English WikiProject Directory with some manual cleaning to make it a bit more coherent/feasible for our purposes. But while some topics were well-defined, others were a bit hazier. For instance, there's several topics listed under the group Media in the ORES taxonomy but there were WikiProjects that were clearly "Media" but didn't really fall into the specific sub-topics. So we'd create a special topic called Media* that, during model training, held these outliers as well as everything else in that Media group. For example, if an article is tagged with a WikiProject that is grouped under Radio, then it also gets automatically tagged with Media* while training. This is also helpful at inference time too (where we don't do this automatic assigning) because we might encounter an article that kinda looks like a Radio article to the model but is low confidence and the hope is that it might still be picked up by that catch-all Media* topic.

Thanks for providing the context on this @Isaac. So it seems that the special topic such as Media* are used when sub-topics are not covering all article and there could be left-overs unclassified otherwise. That makes perfect sense. However, I think it may still be valid to consider whether adding special topics to those fully covered by sub-topics can be useful. For example, having Culture* to represent articles on any of the Culture sub-topics may be useful.

For our particular case, where we want to support multiple selection of topics to define intersections (logical AND), it may be useful to select "All Culture" and "Africa" for those interested in translating about all kinds of culture-related articles connected with Africa. Maybe this can be internally supported by our tool even if the underlying model does not have a specific Culture* tag, but could benefit from having it available already.

Just chiming in to say this all sounds great to me @Pginer-WMF !

However, I think it may still be valid to consider whether adding special topics to those fully covered by sub-topics can be useful. For example, having Culture* to represent articles on any of the Culture sub-topics may be useful.

Yeah, that should be easy to define on back-end and makes sense to me. You would get more control then too so could even e.g., decide to exclude biographies explicitly from that even though the original taxonomy has them grouped under Culture.

Updated designs post usability tests. In these designs we have replaced back arrow with close icon and introduced "Done" button to confirm topic selection. Going to update the task description after discussing it with other folks.

Basic topic selection.png (1×2 px, 293 KB)
Basic topic selection - Filter status.png (609×1 px, 38 KB)
Basic topic selection - Adjust suggestions.png (704×1 px, 40 KB)

Change #1071021 had a related patch set uploaded (by Sbisson; author: Sbisson):

[mediawiki/extensions/ContentTranslation@master] SX: translation recommendations based on topics

https://gerrit.wikimedia.org/r/1071021

Change #1068060 abandoned by Sbisson:

[mediawiki/extensions/ORES@master] PoC: Make available official list of topics and localized labels for reuse

Reason:

Need to find a better place for this. Main issue is this extension is not running everywhere.

https://gerrit.wikimedia.org/r/1068060

Change #1065191 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] SX: Add api method for fetching most popular recommendations

https://gerrit.wikimedia.org/r/1065191

Change #1065192 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] Unified Dashboard: Add "most popular" suggestion option

https://gerrit.wikimedia.org/r/1065192

Change #1071021 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] SX: translation recommendations based on topics

https://gerrit.wikimedia.org/r/1071021

Change #1075030 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20240923

https://gerrit.wikimedia.org/r/1075030

Change #1075231 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20240923

https://gerrit.wikimedia.org/r/1075231

Change #1075030 abandoned by Nik Gkountas:

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20240923

Reason:

Abandoned in favor of Iac0466a3d2bd906a33c1c6052a92b3be98f5b028

https://gerrit.wikimedia.org/r/1075030

Change #1075231 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20240925

https://gerrit.wikimedia.org/r/1075231

Change #1075567 had a related patch set uploaded (by Sbisson; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@wmf/1.43.0-wmf.24] CX3 Build 0.2.0+20240925

https://gerrit.wikimedia.org/r/1075567

Change #1075567 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@wmf/1.43.0-wmf.24] CX3 Build 0.2.0+20240925

https://gerrit.wikimedia.org/r/1075567

Mentioned in SAL (#wikimedia-operations) [2024-09-25T14:22:59Z] <kartik@deploy1003> Finished scap sync-world: Backport for [[gerrit:1075567|CX3 Build 0.2.0+20240925 (T374387 T370746 T368422 T374567 T355780 T374559 T374886 T375410)]] (duration: 14m 06s)

Translator's note: label "More" is apparently hardcoded and cannot be translated.

Change #1077017 had a related patch set uploaded (by Sbisson; author: Sbisson):

[mediawiki/extensions/ContentTranslation@master] Fix: Localize 'More" filters button

https://gerrit.wikimedia.org/r/1077017

Translator's note: label "More" is apparently hardcoded and cannot be translated.

Good catch! I made a patch for it. Should be resolved soon.

Change #1077017 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] Fix: Localize 'More" filters button

https://gerrit.wikimedia.org/r/1077017

Change #1079550 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20241011

https://gerrit.wikimedia.org/r/1079550

Change #1079550 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20241018

https://gerrit.wikimedia.org/r/1079550

This has been reviewed and tested ad nauseam by engineers so I'm moving it to "Design signoff" for @Pginer-WMF and @SGautam_WMF to take a look and play with it before it goes to "Product signoff".