Page MenuHomePhabricator

Explore ways to restrict suggestions to a given knowledge area
Closed, ResolvedPublic

Description

Content translation provides users with suggestions of articles to translate. Based on user feedback, it would be useful to provide some general control about the knowledge area these suggestions are about (T113257).

Currently, the suggestion system provides either general suggestions or suggestions based on a given "seed article" used as an example to get similar suggestions.

As an initial step, this ticket tries to explore possible answers the following questions:

  • How to represent the wide knowledge areas(e.g., science, philosophy...)? (e.g., vital article structure, categories, ORES topic model, Wikidata, etc.)
  • How to use the current recommendation system or expand it to get suggestions on a selected area? (e.g., pick a random article from category and use as seed)

This ticket is focused on the technical exploration.
The mockups below are just illustrations of how this could be supported in Content translation:

Structured set of categories for user to pickSearch for specific categories (with a predefined initial set)
cx-dash-suggest-config-arts.png (963×1 px, 226 KB)
cx-dash-selected-suggestion.png (972×1 px, 178 KB)

Result

The recommended approach is to use a Wikidata query based on the instance of property (P31):

  • Topic areas will be compiled manually. A set of Wikidata elements will be manually selected to represent the topic areas (example). The selection may be based on main topic classifications or the list of (vital articles level 2). From there, items representing the topic areas such as Arts or Food will be extracted.
  • Query Wikidata for related articles. A Wikidata query (example) will find articles related to the selected topic area through the P31 property that can be translated (i.e., exist in source language but are missing in the target one).
  • Wikidata parameters for detemining relevance. For each item in the results, additional Wikidata parameters such as the number of languages in which the topic is available, and the number of properties are used as indicators of the relevance of the topic, showing them higher in the list of results.

A proof-of-concept was developed to try the approach. By default it shows articles related to a topic area, but the URL can be adjusted to check articles missing in a given language (example for Spanish)

image.png (892×994 px, 625 KB)

Design implications
When exploring UI options for adjusting the suggestions (T113257), the following considerations need to be taken into account:

  • Between 40 and 100 topic areas to support. Main topic classification provide 39 topic areas, while the vital articles provide 99 topic areas organized in 9 groups. The layout needs to account for this volume, making the groups more or less explicit (e.g., separate views for each group, groups in the same view, ordering in a flat list, etc.)
  • Only one topic area at a time. Wikidata query performance is heavily affected when there is more than one topic area selected. Thus, the selection will be presented as exclusive: only one topic area at a time.
  • Design for the wait. After selecting a topic area users may have to wait a few seconds. Thus, we may need to think on how to provide a waiting indicator, and potentially an option to cancel or avoid blocking the UI.
  • Design for empty results. The query may lead to empty results for some languages. We need to explore how to better communicate that there were no results, try to prevent it or provide alternatives.

Related and very relevant exploration and experiments from the Growth team: T231506#5487829

Event Timeline

Pginer-WMF updated the task description. (Show Details)
Pginer-WMF updated the task description. (Show Details)

About this:

How to represent the wide knowledge areas(e.g., science, philosophy...)?

I've been pushing for start a project on cross-lingual topic model, that allow us to make topic comparisons across languages. Unfortunately, this has been not prioritized for the current fiscal, although that we are receiving this requirement for multiple teams within the WMF.

However, I'll be happy to share some ideas of how to tackle this problem, although I wont have enough time to implement and test these solutions.

Another idea worth considering is for users to search/browse for their topic area using Categories. Then the system can pick 5 articles from that category and use them as seeds to get recommendations, remove duplicates and show in the suggestions.

MMiller_WMF subscribed.

The Growth team is investigating ways to approach this very challenge in T231506.

Updates:

I was exloring the Wikidata queries based on Property:P31 https://www.wikidata.org/wiki/Property:P31 and the subclasses. Here is a sample query https://w.wiki/9Cb - it tries to find all articles in a given wiki page with P31 or subclasses of P31 associated with.

For this ticket, we also need to find the articles that exist in one language and a category AND not exist in another given language.

I am still trying to fine tune my query, the topic classification kind of works,

SELECT DISTINCT ?categoryLabel ?item ?itemLabel ?itemDescription ?image
    WHERE
    {
      VALUES ?category {
        wd:Q2095
      }
      ?item p:P31/ps:P31/wdt:P279* ?category. # Item is of type or subtype an item in the above list of categories
      ?article schema:about ?item ; schema:isPartOf <https://en.wikipedia.org/>. # Exist in English
      OPTIONAL {?item wdt:P18 ?image }
      SERVICE wikibase:label {bd:serviceParam wikibase:language "en" .}
    }
    LIMIT 50

but the filtering based on article exists or not is currently exiting with timeout if I use ?item p:P31/ps:P31/wdt:P279* ?category so I used just the direct membershipt as below

SELECT DISTINCT ?categoryLabel ?item ?itemLabel ?itemDescription ?image
    WHERE
    {
      VALUES ?category {
                wd:Q2095
      }
      ?item wdt:P31 ?category. # Item is of type an item in the above list of categories
      ?article schema:about ?item ; schema:isPartOf <https://en.wikipedia.org/>. # Exist in English
      OPTIONAL {?item wdt:P18 ?image }
      FILTER NOT EXISTS { ?wen schema:about ?item ; schema:inLanguage "ml" } # Does not exist in ml
      SERVICE wikibase:label {bd:serviceParam wikibase:language "en" .}
    }
     LIMIT 50

I created a demo of the system if you are interested in trying out:

image.png (892×994 px, 653 KB)
image.png (892×994 px, 659 KB)

image.png (892×994 px, 625 KB)

I used to make some efforts personally to determine articles with most languages that are missing in Serbian, regardless of topic. Tried Wikidata first, but without any additional filtering, query would timeout every time. Then, I started querying with instanceof and similar properties, which yielded some results. Some topics with large amount of items would still timeout, like movies, actors, etc. Many others complete successfully, like entrepreneur or bank.
I tried modifying your query slightly, most notably adding ORDER BY DESC(?linkcount) and running it for all categories you defined in topics.json. Some topics returned low quality (and quantity) suggestions, like Q1071 (Geography). Others worked nicely.
I see this approach could be useful. It depends strongly on Wikidata inter-item connections. We will need to make additional efforts to make some topics usable, maybe combining multiple categories.
Also, there is a question of how generic or specific we want our topics to be and how many of them. Do we want them configurable? Those are questions for @Pginer-WMF.

I solved problem of my Wikidata timeouts with SQL query, run on Quarry. We could make other efforts of running queries which take longer by cxserver (with some cronjob) and maintaining the results to be served to users. Other solutions could also be explored to be served by cxserver as well. Such solution will most likely bring additional maintenance cost, which would be good to avoid.

Best solution would be to find nice set of categories (or other properties) that produce satisfyable results across all desired topics, and have query complete in a reasonable amount of time.

About the timeout of queries, one think I observed is, the time taken is a multiple of number of P31 values we use at a time. If we use only one topic at a time, I see that the time taken is in reasonable limits. This can be compensated by allowing only topic selection at a time as I did in my demo application.

You are right about choosing the correct Q value to cover a topic area. I found this tool https://www.npmjs.com/package/wikidata-taxonomy quite useful there. If you have a specific article, you can find the taxonomy from low level to high level as shown below:

$ nodejs wdtaxonomy.js -r Q1568
Hindi (Q1568) •186
└──Hindustani (Q11051) •93
   └──western Hindi languages (Q12600937) •2
      └──Hindi languages (Q2397881) •5
         └──Central Indo-Aryan languages (Q10979187) •5
            └──Indo-Aryan languages (Q33577) •88
               └──Indo-Iranian languages (Q33514) •90
                  └──Indo-European languages (Q19860) •168
                     ├──Nostratic languages (Q276314) •31 ×1
                     │  └──human language (Q20162172) •2 ×4 ↑
                     │     ├──language (Q315) •269 ×40
                     │     │  ├──communication medium (Q340169) •50 ×79
                     │     │  │  ├──information (Q11028) •134 ×48

I edited the description to capture a summary of the outcome and the design implications. Please, @santhosh and @Petar.petkovic, check and suggest any corrections.

I have some further questions about the implications of the approach:

  • Is it possible to support the option to discard suggestions? Current suggestions have an option ("x") to discard them, preventing them to be shown again. I was wondering whether such system would work out-of-the-box with this new kind of suggestion, or they require further consideration (e.g., hiding such option for these suggestions, or support it differently).
  • Is it feasible to get images for the topic areas? Topic areas are represented by Wikidata items which have an image associated, so I guess that image can be used if needed for the designs, but I wanted to check if this has bigger performance implications.

I edited the description to capture a summary of the outcome and the design implications. Please, @santhosh and @Petar.petkovic, check and suggest any corrections.

I have some further questions about the implications of the approach:

  • Is it possible to support the option to discard suggestions? Current suggestions have an option ("x") to discard them, preventing them to be shown again. I was wondering whether such system would work out-of-the-box with this new kind of suggestion, or they require further consideration (e.g., hiding such option for these suggestions, or support it differently).

I would recommend abandoning the discard option all together - for this suggestion and the suggestions from recommend tool. Discarding a suggestion, irrespective of the source of the suggestion, require permanant storage of unique identifiers for each title in MediaWiki database. Since this storage is outside the suggestion sources and suggestion sources don't take these discarded item as input to their queries, it is an overhead to prepare suggestions by mixing discarded and proposed ones. The extra code, database storage and time taken to mixup - I don't think it has any significant impact. Instead, we should provide a way to refresh suggestions easily.

  • Is it feasible to get images for the topic areas? Topic areas are represented by Wikidata items which have an image associated, so I guess that image can be used if needed for the designs, but I wanted to check if this has bigger performance implications.

It is possible to get images from wikidata query. Descriptions too. We will need a fallback image though since not all items has images. Fetching image or description has no performance issues from my experiments.

  • About the 100 categories, I would suggest to reduce to a dozen or so. The use case is not exploration by topic, but exploration by topic with missing articles. From my experiments a dozen topics itself is good start. The more narrower the selections, the more cognitive load on classification on your interest. The more easy to choose the topic the best. I doubt whether I would like to explore through the topics and sub topics. The more narrow the topic, the results will start become less easy to understand why it is in this category or not.

One of the reason I did the demo system is to really see the filtering in action and the above recommendations are based on that.

I would recommend abandoning the discard option all together - for this suggestion and the suggestions from recommend tool.

I think it makes sense. Currently, we are supporting different ways of operating the suggestions but it makes sense to simplify and provide a more clear intended use: start translating, pick for later and refresh the list seem enough options for the different cases. I'll create a ticket to capture it.

Fetching image or description has no performance issues from my experiments.

This is good to know.

  • About the 100 categories, I would suggest to reduce to a dozen or so. The use case is not exploration by topic, but exploration by topic with missing articles. From my experiments a dozen topics itself is good start. The more narrower the selections, the more cognitive load on classification on your interest. The more easy to choose the topic the best. I doubt whether I would like to explore through the topics and sub topics. The more narrow the topic, the results will start become less easy to understand why it is in this category or not.

I think you are bringing two separate considerations here:

  • One concern seems to be about too-narrow topic areas not leading to enough relevant results. I think the topics in the list of vital articles seem still quite general for having several articles available for translation. "Fire" is the only exception I found that does not look like a general category of topics. Maybe you can provide some examples based on those to illustrate the limitation.
  • The other seems to be about the right size for solving the user needs. I agree we need to find a balance between supporting users to focus on their specific interests (i.e., a rich catalogue) and facilitating the choice (i.e., quick selection). For design explorations I think it makes sense to design a system that could work for a list of items from 40 to 100. Compared with the thousands of Wikipedia categories, I think that the ~100 topics form vital articles are not that many if organized well, but that is something worth checking with users. It is also a relevant question what seems quicker to navigate to users, 9 groups of ~10 items each (from vital articles) each, or a flat list of 39 items (from main categories)?

One of the reason I did the demo system is to really see the filtering in action and the above recommendations are based on that.

Ok. thanks for sharing the main topic classification. I'll look into the category alternatives in more detail.

With discarding feature, we have following workflow: 1) Load CX, 2) Open suggestions, 3) Discard unwated, 4) Open CX next time, 5) If nothing changed for seed articles, we get same suggestions, minus discarded one.
Without discard feature, when we get same set of articles, we need to reload all to get rid of some that we don't want to see. Maybe the behavior where we see a lot of the same suggestions is the problem here, but it annoyed me as a user.
What I propose is try to have discard for free or with minimal effort inside new suggestion system. If it requires more effort to modify than to remove (removing also takes some effort, usually minimal), abandon the feature.

When it comes to topics, I prefer us to have more categories (those ~100 proposed) grouped by main category sets, then have only 8-10 big categories.
I haven't queried Wikidata a lot, but being more specific about the topic usually yields better results. We also saw that using something like "Geography" with "instance of" produces terrible results. I remember me wanting to search for articles about rivers with most interwiki links, that are missing in Serbian, and that query times out. Being more specific (than "Geography") in that case is not a good option. Every category and subcategory needs to be carefully and manually selected, queries need to be crafted and exploration is needed in general.

I think we need to recognize two extremes of category based suggestions. One is to find articles from a narrow category such as "rivers in serbia". Another one is "I am interested in Sports". The second type of users is more common because it is an abstract way of expressing an interest. The first one is about a specific search. If we try to serve both of these using a single interface, it is going to complicate. We should also see how this selection is done in non-wikipedia applications. I have used news reader applications, applications like magzter etc, where they provide a single flat list just enough to fill a mobile screen. For the first type of selection-narrow selection- a search box is given. If we enhance our source selector so that it can search in "Wikipedia article categories" we still can support such use cases. The nature of a second level or narrower topic is that, you can easily find one that is not in that list.

In short, I still recommend start with a small, flat list of topics. Pau's draft design of category selection in this ticket is too complicated for usability and technical implimention(considering current state of our dashboard code and front libraries).

With discarding feature, we have following workflow: 1) Load CX, 2) Open suggestions, 3) Discard unwated, 4) Open CX next time, 5) If nothing changed for seed articles, we get same suggestions, minus discarded one.
Without discard feature, when we get same set of articles, we need to reload all to get rid of some that we don't want to see. Maybe the behavior where we see a lot of the same suggestions is the problem here, but it annoyed me as a user.

Thats a very useful observation. I think that in this case the underlying issue is that the list of suggestions need a bit more of randomness to make sure that when a user access the list of suggestions on two different moments the list of suggestions is different. It does not encourage going to the suggestions if you always expect the same static list. I think that fixing this may make the discard option no longer needed. I'll keep these considerations.

What I propose is try to have discard for free or with minimal effort inside new suggestion system. If it requires more effort to modify than to remove (removing also takes some effort, usually minimal), abandon the feature.

Well, if it comes for free we would remove a blocker. We could support the new suggestion system and then decide on removing the "X" (and associated changes as the one outlined above) independently. I'm ok with going step by step in iterations.

I think we need to recognize two extremes of category based suggestions. One is to find articles from a narrow category such as "rivers in serbia". Another one is "I am interested in Sports". The second type of users is more common because it is an abstract way of expressing an interest. The first one is about a specific search. If we try to serve both of these using a single interface, it is going to complicate.

I don't think that anyone proposed to support the "rivers in serbia" case or getting close to that extreme. I agree we want to provide general topics, but there is still a spectrum on how generic. Let's imagine that a user that is interested on "Geography". Looking at the vital articles Level 2, there are the following topics in the "Geography" group:

  • Geography
  • City
  • Country
  • Land
  • Sea
  • Africa
  • Asia
  • Europe
  • North America
  • Oceania
  • South America

Would it be more useful for this user to have only one general "Geography" option (based on Wikipedia main topics) or having the above list? I think that it is worth designing for both possibilities and learn from users. I really on't think that we can discard the approach based on the level 2 of vital articles by claiming that it is way too specific since it is still quite far from the "rivers in serbia" case.

We should also see how this selection is done in non-wikipedia applications. I have used news reader applications, applications like magzter etc, where they provide a single flat list just enough to fill a mobile screen. For the first type of selection-narrow selection- a search box is given. If we enhance our source selector so that it can search in "Wikipedia article categories" we still can support such use cases. The nature of a second level or narrower topic is that, you can easily find one that is not in that list.

Considering a certain number of topic areas, it does not mean we need to show all of them at the same time. It would be possible to use the level 2 vital articles list, show only the main options and provide access to the rest through search. In any case, that's part of the design exploration for T235186: Design a way to control the criteria for suggestions