Page MenuHomePhabricator

Newcomer tasks: prototype topic matching
Closed, ResolvedPublic

Description

At the core of the newcomer tasks feature is the list of recommended tasks. We want the tasks to be relevant to the newcomer's skills and interests. Via investigations in T230246, we have decided that the first version of this feature will use maintenance templates to find articles that need to be improved in specific ways. Via investigations in T230248, we have discussed several different approaches to matching those articles to a newcomer's interests.

We recognize that we won't know what approach makes the most sense until we try some out. So in this task, the idea is to produce some lists of recommendations in our target language to evaluate whether they look like good recommendations for a newcomer with a given set of topic interests.

Task types

We've decided to draw tasks from maintenance templates. Via T229430, our ambassadors have gathered the lists of maintenance templates that our target wikis use into this workbook. The tabs called "cswiki MM", "arwiki MM", and "kowiki MM" indicate via the "Task type" column which specific templates map to which task type. Some of the task types only exist in one wiki -- that is okay for the purposes of this prototyping. Templates that have no value in the "Task type" column are not part of this.

Topic matching

Our conversations have identified several possible approaches to narrow the recommended tasks to a user's topics of interest. The list below may not contain everything we've discussed, or the explanations may be wrong, so this prototyping can pursue whatever options look promising.

  • User selects high level topics from a list (e.g. "Art", "Music", "History"). Each of those topics has a hard-coded list of archetypical articles associated with them (e.g. "Music" may have "Orchestra", "Rock and roll", "Hip hop", "Beethoven", "Drum"), and a "More Like" algorithm is used to find similar articles. This method can draw on the topics currently being used in Question 3 of the welcome survey in each of our target wikis. The hard-coded seed lists could come from "vital articles lists" and use Wikidata to translate that set across languages.
  • User enters some topics of interest into a free text field (e.g. "Skateboarding superstars") that brings up results from article search (e.g. "Skateboarding"). The user selects some resulting article, and then we use the "More Like" algorithm to find similar articles.
  • User can type in a free text field or select from a list to choose amongst the categories available on the articles that have maintenance templates. To effectively use categories, this approach might need to crawl up or down the category tree. It's common that the category actually on an article is much more specific than something a user would type in (e.g. "16th century Dutch painters"), but higher up in the tree is a category they would type (e.g. "Painting").

Outputs

These are some of the desired outputs from this task (other useful outputs are welcome):

  • Some lists of topic inputs and resulting recommendations for each target wiki. In other words, what articles do we get if the user selects these topic options, or enters this free text?
  • Those same lists when narrowed to different task type groupings, along with counts of how many results there are.

We'll then get help from our ambassadors to determine if the outputs look like useful and relevant recommendations for newcomers.

Related Objects

Event Timeline

I think it's likely that the engineer working on this task will need help to define some of the inputs that a user would give. Please let me know what help it looks like you need!

I'm putting this straight into Ready for Development.

Sorry if I missed anything, but I'd like to add "topics are connected with high level categories, and subcategories are used to find articles" as an option. Comminities put some work into making categories, it makes sense to use them in this case IMO. That would need a way for missing categories through, but high level categories aren't likely to be changed, and relying on ambassadors for fixes if that's needed is acceptable IMO.

Looking at our topics and attempting to map them to the list of vital articles I see:

"welcomesurvey-question-topics-option-arts", // 1.4, 42 articles
"welcomesurvey-question-topics-option-science", // 1.9, 198 articles
"welcomesurvey-question-topics-option-geography", // 1.3, 96 articles
"welcomesurvey-question-topics-option-history", // 1.2, 78 articles
"welcomesurvey-question-topics-option-music", // 1.4.5, 8 articles
"welcomesurvey-question-topics-option-sports", // 1.6.5 (recreation and entertainment)
"welcomesurvey-question-topics-option-literature", // 1.4.4, 8 articles
"welcomesurvey-question-topics-option-religion", // 1.5.2, 15 articles
"welcomesurvey-question-topics-option-popular-culture", // 1.7.1 (general society and social sciences), 12 articles
"welcomesurvey-question-topics-option-entertainment", // 1.6.5 (recreation and entertainment), 15 articles
"welcomesurvey-question-topics-option-food-drink", // 1.6.4, 23 articles
"welcomesurvey-question-topics-option-biography", // 1.1 People, 129 articles
"welcomesurvey-question-topics-option-military", // XXX Doesn't map cleanly
"welcomesurvey-question-topics-option-economics", // 1.7.5 (25 articles)
"welcomesurvey-question-topics-option-technology", // 1.10 (101 articles)
"welcomesurvey-question-topics-option-film", // 1.4.7 (Visual arts? 10 articles)
"welcomesurvey-question-topics-option-philosophy", // 1.5.1 (20 articles)
"welcomesurvey-question-topics-option-business", // 1.7.5 (25 articles)
"welcomesurvey-question-topics-option-politics", // 1.7.4 (Politics and government, 25 articles)
"welcomesurvey-question-topics-option-government", // 1.7.4 (politics and government, 25 articles)
"welcomesurvey-question-topics-option-engineering", // 1.10 (Technology, 101 articles)
"welcomesurvey-question-topics-option-crafts-hobbies", // 1.4 (Arts? 42 articles)
"welcomesurvey-question-topics-option-games", // 1.6.5 (Recreation and entertainment [15 articles] )
"welcomesurvey-question-topics-option-health", // 1.8 (40 articles)
"welcomesurvey-question-topics-option-social-science", // 1.7 (society and social sciences)
"welcomesurvey-question-topics-option-transportation", // 1.10.10 (Transportation, 6 articles)
"welcomesurvey-question-topics-option-education" // 1.7.6 social issues, 33 articles

One idea would be to use the headings in the list of vital articles (with a few exceptions) as the new list of topics for the welcome survey. A morelike search using the articles in each "topic" (where "topic" equals a heading in the list of vital articles) would probably be a reasonable proxy for getting a set of articles a user might be interested in.

Also, I guess there is a charlimit on what you can pass to morelike (e.g. morelike:Communism|Politics|Fascism|Political party|Political sciences|Colonialism|Imperialism|Government|Democracy|Dictatorship|Monrachy|Theocracy|Ideology|Anarchism|Conservatism|Liberalism|Nationalism|Socialism|State|Diplomacy|Military|European Union) is 251 characters, can't add more. This yields 508,869 articles on enwiki. I think we'd want to choose a random set of 10 articles from within a section for the morelike query.

T159321 is that task that we will need resolved in order to test out these queries in production.

@kostajh -- I talked with @Pginer-WMF today about the Language Team's work on recommendations. They are currently successfully using the Research team's API to recommend articles (the same API that backs this). My understanding of how it works is:

  • If a user has any edits, it takes the most recent article they edited and finds similar articles that don't exist in the target language. When they click refresh to get more recommendations, it does the same thing, but with the second most recently edited article. So, for instance, the user might have most recently edited "Egg tart", and so they get a bunch of pastry recommendations. Then, second most recently, they might have edited "Chair", so when they click refresh, they get a bunch of furniture recommendations. (@Pginer-WMF -- when I looked at the recommendations I'm getting, it actually seems like my recommendations are all intermixed, so I'm not sure if I have recorded the rules correctly).
  • If a user has no edits, it just recommends articles that are missing in the target language, without respect to any topics.

Users like getting recommendations, but they are asking to be able to filter them by topic, like "Art" or "Music". So @Pginer-WMF created T229242 to figure out how to further filter the recommendations by topic. The two methods he hypotheses in that task look like two that we're exploring, so they'll also be keeping an eye on our work here to advise how they want to proceed.

  • If a user has any edits, it takes the most recent article they edited and finds similar articles that don't exist in the target language. When they click refresh to get more recommendations, it does the same thing, but with the second most recently edited article. So, for instance, the user might have most recently edited "Egg tart", and so they get a bunch of pastry recommendations. Then, second most recently, they might have edited "Chair", so when they click refresh, they get a bunch of furniture recommendations. (@Pginer-WMF -- when I looked at the recommendations I'm getting, it actually seems like my recommendations are all intermixed, so I'm not sure if I have recorded the rules correctly).

My understanding of the current behavior is as described, but maybe I'm missing something (@santhosh may be able to clarify).

Inspecting the page I see that there is only one request from Content Translation to the recommendation API using "Sushi" as seed article, which is expected.

Screenshot 2019-09-06 at 13.44.19.png (794×1 px, 187 KB)

However the results shown in Content Translation and GapFinder are different:

  • 6 food-related results are present in both Content Translation and GapFinder (Rendang, Malaysian Indian cuisine, etc.)
  • Content translation also include 6 actresses/actors that seem unrelated to "sushi" (Andrea Martin, Tammin Sursok, Kristy McNichol, etc.).
  • GapFinder includes 6 other food-related results only shown there (Street food in South Korea, Bulgarian cuisine, etc.)
Recommendations on Content translationRecommendations on GapFinder
screencapture-en-wikipedia-org-wiki-Special-ContentTranslation-2019-09-06-13_41_26.png (1×1 px, 376 KB)
screencapture-recommend-wmflabs-org-2019-09-06-13_42_12.png (992×1 px, 786 KB)

The surprising part is that those movie star results are included as part of the response we get from the recommendation API:

1[{"wikidata_id": "Q1520293", "title": "Rendang", "pageviews": 5550, "rank": 499.0},
2 {"wikidata_id": "Q6741970", "title": "Malaysian_Indian_cuisine", "pageviews": 932, "rank": 497.0},
3 {"wikidata_id": "Q442309", "title": "Andrea_Martin", "pageviews": 12418, "rank": 497.0},
4 {"wikidata_id": "Q439895", "title": "Tammin_Sursok", "pageviews": 6610, "rank": 496.0},
5 {"wikidata_id": "Q5439557", "title": "Feast_of_the_Seven_Fishes", "pageviews": 2840, "rank": 494.0},
6 {"wikidata_id": "Q441452", "title": "Kristy_McNichol", "pageviews": 19313, "rank": 493.0},
7 {"wikidata_id": "Q240658", "title": "Brenda_Vaccaro", "pageviews": 19083, "rank": 492.0},
8 {"wikidata_id": "Q6054477", "title": "International_availability_of_McDonald's_products", "pageviews": 4850, "rank": 491.0},
9 {"wikidata_id": "Q433059", "title": "Stephanie_March", "pageviews": 32661, "rank": 491.0},
10 {"wikidata_id": "Q5407615", "title": "Guatemalan_cuisine", "pageviews": 1190, "rank": 490.0},
11 {"wikidata_id": "Q1154469", "title": "Kikuko_Inoue", "pageviews": 2086, "rank": 488.0},
12 {"wikidata_id": "Q716245", "title": "Banchan", "pageviews": 3454, "rank": 485.0}]

So Content translation is just rendering the suggestions provided for "sushi" from the API.

It is not clear to me why the Recommendation API is omitting six relevant results about food and returning six unrelated ones. Also, why that is not the case for the FindGap UI. There were some changes in the API versions and maybe we are not pointing to the correct one, or some regression happened. Maybe @leila can help to clarify.

In any case, thanks for pointing to this @MMiller_WMF.

Overdue summary of work-in-progress on this:

There is a prototype (source) for experimenting with these ideas. The configuration for the prototype is managed on wiki, via newcomertasks/topics/{lang}.json and newcomertasks/templates/{lang}.json, for example https://www.mediawiki.org/w/index.php?title=User:KHarlan_(WMF)/newcomertasks/topics/cs.json https://www.mediawiki.org/w/index.php?title=User:KHarlan_(WMF)/newcomertasks/templates/cs.json . The templates file is easier to describe, as it contains a mapping of templates we've identified in a Google sheet. A particular task type, e.g. "Copy editing", can be associated with multiple templates.

Topics is trickier. I have experimented with two approaches, one which tries to approximate topics using "Morelike" on ElasticSearch and the other which uses a category tree. Both share in common the idea that, when a user selects a topic in the UI like "Philosophy", we cast a wide net to find articles that we think the user would agree belong to that topic. Both also share in common that we use the [hastemplate keyword](https://www.mediawiki.org/wiki/Help:CirrusSearch#Hastemplate) to pare down the results into articles which have "tasks" associated with them.

The morelike approach uses a mapping where, for example, we say that "Filosofie" in Czech wiki should use these articles as values to populate a morelikethis query:

"titles": [
           "Filosofie",
           "Poznatek",
           "Etika",
           "Logika",
           "Východní_filosofie",
           "Estetika",
           "Gnozeologie"
       ],

This generally works OK, but there are some articles returned in the results which will make no sense to the end user, although there is an internal logic to why they appear. For example on cswiki, we have the topic "Engineering" (Inženýrství) set to use morelikethis search for three articles in Czech wiki: Inženýrství (Engineering), Stavebnictví (Construction), Strojírenství (Mechanical engineering). If you go to the prototype and select "cs", "engineering" and "Links", you'll get a single result for Gruyères (https://cs.wikipedia.org/wiki/Gruy%C3%A8res). The search query is morelikethis:"Inženýrství|Stavebnictví|Strojírenství" hastemplate:"Wikifikovat"and in the text of the article on Gruyeres, there is a paragraph with the word "Stavebnictví" in it:'

...In 2008, the total number of full-time jobs was 601. The number of jobs in the primary sector was 44, of which 39 in agriculture and 5 in forestry and timber production. The number of jobs in the secondary sector was 215, of which 120 (55.8%) in manufacturing and 95 (44.2%) in construction...

So, morelike did the right thing in finding this article, but for the end user it doesn't make sense to see this.

This led me to try a different approach using a category tree (as @Urbanecm suggested here T231506#5453257). The overview is: map each "topic" shown to the user in the UI and map it to a top (or close to top) level category on the wiki, for example for "Filosofie" I have mapped that to "Kategorie:Filosofie".

Then, execute two searches, one that does incategory:Filosofie hastemplate:{template} and a second which does deepcat:Filosofie hastemplate:{template}. The first query will pick up higher level pages which are directly under Philosophy and are very likely to be relevant. The second deepcat search will look through the category tree to find articles. However it will error out if the category has too many levels, so in that case the code gets all subcategories (Kategorie:Filosofie has 15 subcategories), and then it does the searches again (incategory + hastemplate / deepcat + hastemplate). And so on, until the category tree is exhausted.

Putting aside the silly number of API requests this involves [0], the deeper you go down the category tree, the less relevant the results become. After some experimentation on a single wiki (cswiki), it looks to me like crawling down 3 levels, or at most 4, is optimum for getting the most results while avoiding irrelevant ones. The strategy I have been working on is:

  • User selects "Filosofie" and "Kopírovat úpravy" for copy editing
  • For each template in "Kopírovat úpravy"
    • Perform an incategory:Filosofie hastemplate:{currentTemplate}, store results
    • Perform a deepcat:Filosofie hastemplate:{currentTemplate} search, store results.
      • If the deepcat search fails because we are too high up the tree, get the subcategories of the current category we're looking at (Filosofie), then execute the previous steps (incategory search). If the "Max depth" to crawl has not yet been reached, also do a "deepcat" search. If that fails because we are too high up the tree, get the subcategories of the subcategory, and do the steps again, etc etc.

Again, this is a really silly number of API requests (thousands) so it's infeasible to do client side, but we could do something like this on the server side periodically and write the data to a table.

I'll clean up my implementation of the category search and push that tomorrow so @MMiller_WMF and others can experiment with it. In the meantime the current prototype offers the morelike search, so you can experiment with that for now.

My current recommendation is to pursue the category tree approach rather than crossing our fingers and hoping that morelike is good enough, especially because if we find that some articles seem to be misplaced, we (or whoever) can edit their categories, but if morelike search for a user looking for Engineering results returns odd results because "construction" appears in an article on a town that's famous for its cheese, we don't have many good options.

[0] It does not seem possible to use a logical OR for template queries, so while it would be lovely to do deepcat:Filosofie hastemplate:Upravit|Kdo?|Kdy?|etc, CirrusSearch does not interpret multiple values to hastemplate as an OR, so instead each template + category is a unique request, like incategory:Filosofie hastemplate:Upravit, incategory:Filosofie hastemplate:Kdo?, etc etc.

@kostajh -- thank you for making the prototype, for experimenting different ways, and for your notes on this Phab task and in chat. I spent some time with the prototype, and I have some initial reactions and questions.

General questions

  • Have you considered using the API that @Pginer-WMF said they are using for recommendations with the Content Translation tool?
  • I have not been able to play with the seed lists (I think I need help to do that right), but I noticed that some of the seed articles can be pretty short, like this Czech one on "Knowledge". Maybe longer seed articles are better -- though I know we probably would need to pick a single seed list that would be used for all languages.
  • Having worked with the categories, what is your perception of how comparable and complete the category hierarchies across in the different languages? Does it seem like we can rely on categories in an arbitrary wiki?
  • I don't totally understand what kind of information we receive with which to rank or cutoff results. Are we able to get lots of results and just present them in ranked order? Or do we need to choose a cutoff?

Notes on the prototype itself

  • The morelike prototype isn't letting me scroll past 9 results, even when it says that hundreds of results are found.
  • The category prototype doesn't give results for Arabic or Korean -- probably you haven't added that yet.
  • You likely intend to do this, but a toggle for morelike vs. category inside one prototype would be great.
  • If the morelike search returns some kind of match score, could that be listed with the results? Perhaps then we could start getting a feel for a cutoff, if we need to have a cutoff. Maybe Gruyères has a really low score.
  • Is it easy to add more topics if we produce the list of seed articles? And find the right places on the category hierarchy?

Early results

I produced some results from Czech and Korean Wikipedias using the two prototypes and then looked up the articles to see how good of a match they seem to be. Both these tables below show the first 9 results I get for using both the morelike and category approaches with the topic set to Arts, along with a "no topic" approach as a control group. The task type I used is copy editing. The tables are just three lists next to each other -- there is no relationship between values in the same row. I crossed a value out when I thought that the result is not related to the topic, and the header says how many of the 9 results I thought looked like good results for the topic. The control group ("no topic") shows how many articles might be "Arts" topics as a baseline, to see how much better these approaches do than random. I wasn't able to get results from the category prototype for Korean, so that column is blank.

It looks to me like both the morelike approach and the category approach have potential -- the category approach may be making matches that cleave closer to the topic area.

Czech

morelike (6/9)category (7/9)no topic (2/9)
A philosophy bookAn arts fellowship program“Chinese guardian lions"
“Art museum"“Improvisation"“Caffeine"
“Sundial"“Fine art"A river
“Church of St Peter and Paul"A theater festival“Security token"
“Media literacy"“Mural art"A German WWII soldier
An artistA theater networkA Cambodian singer
A poetAn arts award“Inflation"
A writerA communist martyrA politician
“Information retrieval"“Psychological manipulation"An ancient king

Korean

morelike (4/9)categoryno topic (0/9)
BalzacA soccer tactic
“Cult of personality"“Data Privacy Day"
“Wavelet"An electrical safety standard
“Biblical manuscript"A specific nuclear bunker
RenoirA political concept
“Introduction to quantum mechanics"“First Indochina War"
“Click-through rate"“Leader-member exchange theory"
Serge Gainsbourg“Terrestrial animal"
Ascetic Christian movement“Shortwave radio receiver"

Taking all this together, @kostajh, I think a good place to go is to shore up the prototype so that our ambassadors can spend time with it. I would also spend more time with it to do a longer version of the mini-analysis I did above. What we'll try to decide is: is one of these approaches good enough for a first version? If so, we'll ask the ambassadors to help us with seed lists or categories.

@kostajh -- thank you for making the prototype, for experimenting different ways, and for your notes on this Phab task and in chat. I spent some time with the prototype, and I have some initial reactions and questions.

Of course; thanks for reviewing and for your comments below.

General questions

  • Have you considered using the API that @Pginer-WMF said they are using for recommendations with the Content Translation tool?

Yes, I looked at it early on. I don't see how we could really utilize it for our project. It's focused on providing suggestions for content missing in target language when looking at source language. And on the backend it's using morelike search anyway, but without the ability to simultaneously include filter restrictions with hastemplate, which we need to efficiently pare down the results to articles which have tasks associated with them.

While looking back through the history of this task, I also looked briefly at the ORES drafttopic model. While it's not available on any of our target wikis, if it was available, it could be interesting to do something like:

Is a labeling campaign to get drafttopic enabled for our target wikis totally out of the question?

  • I have not been able to play with the seed lists (I think I need help to do that right), but I noticed that some of the seed articles can be pretty short, like this Czech one on "Knowledge". Maybe longer seed articles are better -- though I know we probably would need to pick a single seed list that would be used for all languages.

Try editing https://www.mediawiki.org/w/index.php?title=User:KHarlan_(WMF)/newcomertasks/topics/cs.json and adding article titles to a specific topic (or add a new topic), and experiment with using a single longer article or multiple shorter ones. Also in regular Special:Search you can do things like morelikethis:SomeArticle|SomeOtherArticle hastemplate:Upravit if that's easier for experimentation.

  • Having worked with the categories, what is your perception of how comparable and complete the category hierarchies across in the different languages? Does it seem like we can rely on categories in an arbitrary wiki?

I have not compared the different languages. I don't think the hierarchies are necessarily neat and tidy. [citation needed] From experimentation (as you can see with the category tree prototype), once you go one level down the tree and start querying subcategories, you can get some results that you wouldn't really expect to match the top level category, and if you go another step down then you really start to see results that make no sense when considering the top level category (but do make sense if you follow the subcategories of the subcategories of the parent category).

  • I don't totally understand what kind of information we receive with which to rank or cutoff results. Are we able to get lots of results and just present them in ranked order? Or do we need to choose a cutoff?

In theory we are getting back most relevant results from morelike. In practice it doesn't really seem that way all the time, but it depends on what you feed into the morelike search as well.

For category search, I haven't implemented it in the prototype but we could prioritize results found with incategory at top level and second level, and only include articles below that if we don't have enough unique articles to show. It's also a question about preventing collisions among editors.

Notes on the prototype itself

  • The morelike prototype isn't letting me scroll past 9 results, even when it says that hundreds of results are found.

Yes, I set the max display to 10, but I can adjust that.

  • The category prototype doesn't give results for Arabic or Korean -- probably you haven't added that yet.

Right. You can edit https://www.mediawiki.org/w/index.php?title=User:KHarlan_(WMF)/newcomertasks/topics/ar.json&action=edit so it looks like cs.json, which basically means add a new line for category: "some category". However I also just remembered there is some hardcoded logic for czech in that prototype, so I need to update it before it would actually work. Still, if you want to populate the categories, that could be helpful.

  • You likely intend to do this, but a toggle for morelike vs. category inside one prototype would be great.

I hadn't, because it's more work and I'm trying to keep these things pretty nimble, but if we envision prototyping continuing throughout much of next week then I could do that, sure.

  • If the morelike search returns some kind of match score, could that be listed with the results? Perhaps then we could start getting a feel for a cutoff, if we need to have a cutoff. Maybe Gruyères has a really low score.

I need to investigate that.

  • Is it easy to add more topics if we produce the list of seed articles? And find the right places on the category hierarchy?

Yes; as for category hierarchy, that will be per-wiki and I have not investigated closely if our welcome survey topics align well with category trees on all the target wikis.

Taking all this together, @kostajh, I think a good place to go is to shore up the prototype so that our ambassadors can spend time with it. I would also spend more time with it to do a longer version of the mini-analysis I did above. What we'll try to decide is: is one of these approaches good enough for a first version? If so, we'll ask the ambassadors to help us with seed lists or categories.

For me so far the most promising protoype is the search-input strategy, which is up at https://deploy-preview-2--newcomertasks-prototype.netlify.com . It provides a search input field alongside the task type filters, performs searches with hastemplate appended. The idea is to mimic the search UI from MobileFrontend and provide dynamic feedback to the user until they find tasks they are interested in.

We could consider using topics to kick off some of the searches, e.g. selecting "Filosofie" from the MenuTagMultiselectWidget would prefill the search input with "Filosofie" and execute a search. I'll keep working on this one, and also polish the other prototypes a bit too.

One last thought about ORES drafttopic -- it might be worth coming up with a set of "task templates" for enwiki so we could experiment with what a ORES backed prototype looks like, because if it's dramatically better than the other approaches, then maybe the labeling campaign for the target wikis is worth doing.

For me so far the most promising protoype is the search-input strategy, which is up at https://deploy-preview-2--newcomertasks-prototype.netlify.com . It provides a search input field alongside the task type filters, performs searches with hastemplate appended. The idea is to mimic the search UI from MobileFrontend and provide dynamic feedback to the user until they find tasks they are interested in.

We could consider using topics to kick off some of the searches, e.g. selecting "Filosofie" from the MenuTagMultiselectWidget would prefill the search input with "Filosofie" and execute a search. I'll keep working on this one, and also polish the other prototypes a bit too.

Hi @kostajh - thanks for sharing all the prototypes! It's been interesting to play with I agree with @MMiller_WMF about sharing with ambassadors for their comments too (maybe there's a logical reason for the association of Gruyere with Building in Czech...)

My main thought is that it is important to show some 'starter' broad categories for people to easily select in the UI and get a sense of the search results, so breaking out the tags in this first prototype https://newcomertasks-prototype.netlify.com would be preferable, regardless of how the calculation is done in the background.

Secondly, is it just for ease of the prototype that the search terms apply an AND exclusive filter, rather than an OR filtering?
My expectation is that it should be OR for the topic types, but for example when I search for cswiki Copyediting tasks in Art there are 683 results, 'Philosophy' there are 234 results, but Art and Philosophy there are only 529 results.
A similar thing happens when using the search input only prototype when trying "art, design" or "art|design".

Third and final comment/question, in lieu of or instead of a match score, is it possible to sort by pageviews in the last 30d instead? Especially for when the are no topics selected, IMHO it makes sense for suggestions to be shown based on 'popularity' or relevance in terms of readership of the articles.

JSON user subpages of other users cannot be edited without admin rights. Might be worthing making the username configurable, or moving the page in the Project namespace.

@RHo thanks for your comments. I’ll reply in more detail later but for now wanted to leave a separate, brief comment.

Based on my investigation I recommend that the first version we put in front of users has no topic filter. This is because we can easily use ElasticSearch with task type filters (no need for us to create and manage a database table), it buys us more time to figure out an optimal way to provide a topic filter, and lastly it in theory allows us to show a percentage increase of engagement when we enable topic / relevancy filtering in a subsequent release.

Secondly, is it just for ease of the prototype that the search terms apply an AND exclusive filter, rather than an OR filtering?

No, just a buggy prototype. I'm exploring a different approach today but if that doesn't work well I'll circle back to this and update the prototype so that works properly.

Third and final comment/question, in lieu of or instead of a match score, is it possible to sort by pageviews in the last 30d instead? Especially for when the are no topics selected, IMHO it makes sense for suggestions to be shown based on 'popularity' or relevance in terms of readership of the articles.

I think we could do something like that, depending on what the final implementation looks like it may be more or less expensive to do, which then makes it more or less practical. Specifically: if we the task list is generated entirely on the client side then we want to minimize the number of API calls and the time it takes to generate the task list, so there's less additional data we can pull in (like pageviews), but if we end up doing most of it server side, and we also store those results in some permanent storage (like a database table) then it's less of a problem.

For your last point, the idea is that if we have a pool of ~20,000 tasks in Czech (if I recall, this is the total of the articles tagged with various templates), then a user would be presented with a randomized list of 20, and then that specific set is ordered by pageviews?

There's a pageview score in CirrusSearch documents, so while there is no pageview-based sort out of the box, it's probably easy to implement sorting the full resultset by views. (Is that useful though? It would give identical results to all users, which is something we wanted to avoid.)

There's a pageview score in CirrusSearch documents, so while there is no pageview-based sort out of the box, it's probably easy to implement sorting the full resultset by views. (Is that useful though? It would give identical results to all users, which is something we wanted to avoid.)

@Tgr that's what I was getting at with my comment above:

For your last point, the idea is that if we have a pool of ~20,000 tasks in Czech (if I recall, this is the total of the articles tagged with various templates), then a user would be presented with a randomized list of 20, and then that specific set is ordered by pageviews?

So I think we would not provide identical results to all users in this case.

I'm just saying, if we did want to sort all 20,000 tasks by pageviews, that seems technically feasible (which wasn't immediately obvious to me but apparently the CirrusSearch indexes already incorporate view information).

I was thinking about another idea today. It would be to populate the topic selection options in the task recommendations widget with the lists from each wiki's version of https://en.wikipedia.org/wiki/Category:Main_topic_classifications. The number of main topic classifications vary by language:

  • English: 39
  • Czech: 22
  • Arabic: 39
  • Korean: 22
  • Vietnamese: 24

Then, each article that has a task template associated with it would be analyzed by attempting to walk up the category tree to get to a top classification. While a human can, with good judgment and trial and error, walk their way up the category tree to get to a sensible top level topic classification, there are some problems with this due to how categories are added.

For example, here's a newly created page about Nik Krenshaw and it's tagged with the template Upravit so it needs some copy editing. The first category listed on the page is "English singers". It's possible to navigate your way up the tree to end up at the top level classification of "Art" (putting aside that Music is probably more appropriate, however that's not a top level classification on the list on cs wiki).

However it's equally possible to follow the categories of "Born March 1", "Born 1958", or "Born in Bristol" and end up at either a very generic "History" top level classification or "Geography". If our end user is filtering the suggested tasks module and they select "Geography", it really would not make sense for them to see an article about this musician.

Another modification on the above approach is to walk up the tree by considering only the first category, in the hopes that whoever edited the categories had the intention of making the first one the "primary" one. Looking at the Krenshaw example again, you'd follow:

  • English singers
  • British singers
  • Singers by country
  • Singers
  • Musicians
  • Artists
  • Art

OK, that seems fine.

With another random example (Bůh = God):

  • God
  • Deities
  • Mythical creatures and races
  • Myths and rumors
  • Religion
  • Study of religions
  • Humanities
  • Humanities and social sciences
  • Science

I guess that one is a bit more problematic, but a workaround would be to list "Religion" as a top level classification in our software if it's important to us that it appear as a topic.

Another example (Lezení_na_obtížnost, "lead climbing"):

  • Sport climbing (this category has "Sports", a top level topic also tagged)
  • Camp
  • Mountaineering
  • Hiking
  • Tourism
  • Travel
  • Transport
  • Services
  • Everyday life
  • Society

That one isn't really intelligible to the end user, but one idea would be to short circuit walking up the category tree if we see any categories on a particular branch are in the set of top level topic classifications. In that case, we could stop at the first category (Sport climbing) and declare that the article belongs in the "Sports" topic.

So, maybe we could try this approach, and the rules would be:

  • Look at the categories in the article. If any of the categories are top level topics, assign the article to that topic and we're done.
  • If not, look at categories that belong to the first category listed in the article. If any are top level, assign the article to the topic and we're done.
  • If not, look at the first category in the current category ,etc etc, all the way up the tree.

Doing this is not going to be feasible on the client-side because of the number of queries we'd have to make. Instead, we'll want to have either a database table where we store information about the articles, the tasks (templates) associated with them, and the high level topics. Or, we could probably get away with storing the high level topic as a page_prop or as another bit of metadata in ElasticSearch. The advantage of doing the latter is that we'll be able to do hastemplate:{pipe_delimited_list_of_templates} {some_page_prop_query}:{topic} on the client-side.

Assuming the above requires the creation of a database table and scripts / hooks for populating and keeping this table updated, as an interim measure we may want to map top level topic names to articles, so clicking a topic would, on the backend, execute morelikethis:{pipe_delimited_list_of_titles} hastemplate:{pipe_delimited_list_of_templates}, while this approach has serious flaws (see above about Gruyeres appearing when searching for engineering) arguably it's better than no topic matching whatsoever, and it's pretty easy to implement.

That one isn't really intelligible to the end user, but one idea would be to short circuit walking up the category tree if we see any categories on a particular branch are in the set of top level topic classifications. In that case, we could stop at the first category (Sport climbing) and declare that the article belongs in the "Sports" topic.

Another tweak would be to consider portals listed on an article, an inspect the first category with the first portal first, e.g. for https://cs.wikipedia.org/wiki/Mohu%C4%8D, Portal:Geography leads to the final category of Geography much faster (2 levels) than navigating the category at the bottom of the article (~10 levels), which also leads to Geography in the end.

Having to maintain our own DB-based search index seems like a bad place to be. Using ElasticSearch would both be a lot more flexible and more in line with what ES vs. databases are commonly used for. Especially since in a future version we might want to weight multiple factors (relvance to topic filter, user's past contribution history, maybe some amount of randomness) and ES makes that kind of mixing easy while a manual DB-based approach doesn't.

Alternatively, we have a graph database for categories (powering search features like deepcat:) which I imagine might be accessed by application code directly. I don't know much about graph DBs but I imagine it would be able handle the described tree walk efficiently in a single query.

Having to maintain our own DB-based search index seems like a bad place to be. Using ElasticSearch would both be a lot more flexible and more in line with what ES vs. databases are commonly used for

Right, that's what I was getting at with my comment about storing the topic as a page prop or custom metadata field. The main point I'm trying to express is it seems likely that we will need to calculate the topic per article and store it.

On another note, @Halfak suggested an idea for using ORES drafttopic in the short term, which is to find the language link for the article we are looking at, then query the drafttopic model on enwiki, and use that to set the topic.

While there are many articles that won't have an equivalent in enwiki, we could fallback to the walk-up-the-category-tree approach proposed in T231506#5495917 for those.

@MMiller_WMF do you have thoughts on this? You can use this query to get a sense of results with the Upravit template on cswiki and view their equivalent pages on enwiki, then you can go to view history to find the latest rev ID, then you can plug that into https://ores.wikimedia.org/v3/scores/enwiki/{revId}/drafttopic to see the predictions.

For example, cs.wikipedia.org/wiki/Tyrol is tagged with {{Upravit}}, and the equivalent page on enwiki is en.wikipedia.org/wiki/Tyrol_(state), looking at its history page https://en.wikipedia.org/w/index.php?title=Tyrol_(state)&oldid=910351642o get the latest revision ID and plugging that into ORES (https://ores.wikimedia.org/v3/scores/enwiki/910351642/drafttopic) we get Geography.Europe as the topic.

@kostajh -- I've just read over everything and thought about how to proceed forward. You've explored and prototype several different approaches, and you and others have posted notes on their advantages, disadvantages, and risks. I think these are the approaches you've mentioned, but I may be missing others:

  • Seed list for morelike (prototyped)
  • Topic map to category hierarchy (prototyped)
  • Free-text search (prototyped)
  • Crawl from article to category hierarchy (discussed)
  • ORES walkover from English (discussed)

I can tell you that I'm not as optimistic about the latter two "discussed" ones as I am about the three "prototyped" ones. I'm worried about crawling the category hierarchy because we know little about it, and I'm concerned that it is like a maze with unpredictable dead-ends across languages. I think that such an approach would have tons of edge cases that would be hard to notice and to troubleshoot. Regarding the ORES walkover, I know that whatever method we go with first is just a proof-of-concept, but I think it is risky to rely on another team to build models for our target wikis. Also, I think that the ORES topic models are made from WikiProjects, and our target wikis have weak WikiProjects, so I'm not sure how they would get built there? Maybe a thing to ask about.

So anyway, what I would like to be able to do next is to try out several different approaches side-by-side in three target languages, and three you've already prototyped would be good. In other words, I don't think we need to develop more approaches -- just set up the ones we have so we can try them out. So, then, some things I imagine would need to be done:

  • It would be wonderful for them all to be in one app together, but I understand if that's annoying to do. So if we don't do that, we'll just need to know which URL goes with which approach.
  • We'll need to have the sets of seed articles and categories configured for each wiki. I think you could tell us what we'll need to fill in where, and lay those pointers all out for us in one place. I'm not currently able to edit the configs, so we'll need help with that.
  • If there are any idiosyncrasies we should know about the prototypes, that would help. For instance, it looks to me like the free-text prototype only works right if you put in the topic before selecting a task type.
  • We'll want the prototypes to output more than 10 results. Maybe 100 would be good.
  • I'm very interested in knowing if there is any type of match score that can be displayed with the results, so that we can think about cutoffs (if needed).

Thanks for your feedback @MMiller_WMF.

I can tell you that I'm not as optimistic about the latter two "discussed" ones as I am about the three "prototyped" ones. I'm worried about crawling the category hierarchy because we know little about it, and I'm concerned that it is like a maze with unpredictable dead-ends across languages.

Right. But in practice it seems possible to navigate to the top level category classifications between 5-10 steps up the tree. We would want to enforce a maximum number of steps to prevent endless navigation.

I think that such an approach would have tons of edge cases that would be hard to notice and to troubleshoot

I don't know, I think if we see a problem with an article being assigned to an incorrect category then it's easy enough to troubleshoot to see how it was assigned.

Regarding the ORES walkover, I know that whatever method we go with first is just a proof-of-concept, but I think it is risky to rely on another team to build models for our target wikis. Also, I think that the ORES topic models are made from WikiProjects, and our target wikis have weak WikiProjects, so I'm not sure how they would get built there? Maybe a thing to ask about.

This is a bit different than what @Halfak proposed, at least as I understood it. Longer term, yes, having models directly for our target wikis, where we collaborate on what the topic classifications should be, would be ideal. But this is different; you'd find the equivalent article on enwiki, get its topic classifier from ORES drafttopic, then assign that topic to the local wiki article.

In other words, I don't think we need to develop more approaches -- just set up the ones we have so we can try them out.

I had already started on scripting out an analysis of using ORES drafttopic before this comment came in, so this morning I finished up working on it. The data is here https://docs.google.com/spreadsheets/d/10KoicUdToW_cWNG2DSXc5eLvy0CfEwxkIMq2vHxweus/edit?usp=sharing. The code to generate the data works like this:

  1. Get a list of predefined templates from https://www.mediawiki.org/wiki/User:KHarlan_(WMF)/newcomertasks/templates/{lang}.json, e.g. https://www.mediawiki.org/wiki/User:KHarlan_(WMF)/newcomertasks/templates/cs.json
  2. For each template:
    1. Do a hastemplate:{templateName} search on the target wiki for up to 100 items, using a random flag for the sorting option.
    2. For each search result
      1. See if there is a language link to a corresponding enwiki article. If not:
        1. See if there is a wikidata ID.
          1. If not, write a mostly empty record to the database (source language article title, template name) and go on to the next search result
          2. If there is a wikidata ID, make an API request to wikidata to get information about that entity.
            1. If there is not an enwiki entity associated with the wikidata, write a mostly empty record to the database and go to the next search result. Otherwise, set our "enwiki title" value to whatever we got back from wikidata.
      2. Now that we have an enwiki title that corresponds to our local language title, make an API request to enwiki to get its latest revision ID
        1. Sometimes we don't have a revision ID, this happens when the wikidata response includes a label for what the wikidata item is about in English, but it does not reference a specific enwiki article. In that case, write a mostly empty record to the DB and carry on to the next search result.
        2. If we do have a revision ID, now make a request to ORES drafttopic with that revision ID.
      3. ORES drafttopic sometimes does not have a prediction, so in that case leave topic blank. Then write the record to the database.

Then I exported each language from the MySQL database to CSV and uploaded that into Google Sheets.


Some observations:

languagearticles with ORES topicarticles without ores topictotalpercentage with topicsarticles with enwiki equivalents but no ORES prediction
cs57838896659%77
ko55542397856%141
ar783482126561%86

There are a decent chunk of articles that have enwiki equivalents but no ORES prediction (for example, enwiki article for "Visual pollution").

Then there are quite a few more where there doesn't appear to be an enwiki equivalent; from a superficial analysis it seems like these were more niche articles, like an article about a Czech musician who may not be considered notable on enwiki. I don't know if it's better, or worse, or makes no difference whatsoever for a newcomer to edit articles that tend to only exist on one or two language wikis, but that's something to consider.

For the articles without ORES predicted topics in enwiki, it might be worth considering the "walk up the category tree" approach to fill in the blanks. Or, given that we have a return of about 56-61% on getting predictions, perhaps that is enough to have tasks associated with topics, and the remaining uncategorized tasks could be part of the pool that displays to the user when no topic filter is set.

As for the quality of predictions, it seems mostly pretty good although there is stuff that is a stretch, and then some things that are wrong. My subjective impression is that it's better than the morelike attempt to associated a "topic" with a set of articles, but how much better, I'm not really sure.


Re the prototypes, I'm kind of thinking the easiest way to assess the morelike and free-text strategies are by using Special:Search directly. Once MW-1.34-notes (1.34.0-wmf.23; 2019-09-17) is in production on Thursday, you can do something like [morelikethis:Filosofie|Etika|Logika hastemplate:Upravit|Kdy\?|Kdo\?|Pravopis|Sloh|Transkripce|Reklama|NPOV|Kým\?|Jaký\?|Který](https://cs.wikipedia.org/w/index.php?sort=relevance&search=morelikethis%3AFilosofie%7CEtika%7CLogika+hastemplate%3AUpravit%7CKdy%5C%3F%7CKdo%5C%3F%7CPravopis%7CSloh%7CTranskripce%7CReklama%7CNPOV%7CK%C3%BDm%5C%3F%7CJak%C3%BD%5C%3F%7CKter%C3%BD&title=Speci%C3%A1ln%C3%AD%3AHled%C3%A1n%C3%AD&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1); currently you can only use one has template query at a time: [morelikethis:Filosofie|Etika|Logika hastemplate:Upravit](https://cs.wikipedia.org/w/index.php?sort=relevance&search=morelikethis%3AFilosofie%7CEtika%7CLogika+hastemplate%3AUpravit&title=Speci%C3%A1ln%C3%AD%3AHled%C3%A1n%C3%AD&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1). IMO that's a good starting point because you can easily experiment with using different seed articles (or a single article) as the basis of the morelike search, and the hastemplate part doesn't matter so much; you basically want to know if "Does the list of articles titles passed to morelikethis provide search results that we could say belong to a particular topic".

For the topic map to category heirarchy appraoch (walking down the tree from a top level category to find articles with hastemplate for the templates we're interested in), I don't think this is really a great approach. It misses a ton of stuff, is costly in terms of the number of queries you need to make, and its accuracy/relevancy decreases significantly once you expand the depth beyond 2 levels.

I'm very interested in knowing if there is any type of match score that can be displayed with the results, so that we can think about cutoffs (if needed).

There are a bunch of settings you can play with to modify the morelike search. We might need to meet with @dcausse who's been helping me so far in fine tuning the morelike query. @dcausse recommended we start with classic_noboostlinks which is what RelatedArticles uses to show you a related article on mobile (and is what the morelike prototype uses currently).

Re. using the ORES predictions, you probably don't want to use the "prediction" field directly. I would suggest instead pulling in any predicted class that is above 0.05 probability. This seems to work really well.

it is risky to rely on another team to build models for our target wikis.

This is something I hope we would coordinate and develop as a shared goal. We're either going to do it with the intention that y'all will make use of it or we're going to prioritize something else to work on instead.

Also, I think that the ORES topic models are made from WikiProjects, and our target wikis have weak WikiProjects, so I'm not sure how they would get built there?

We'll be using sitelinks to transfer training data from one wiki to another. E.g. https://en.wikipedia.org/wiki/Henry_III_of_England is tagged by WikiProject Military history so it gets labeled as the mid-level category "History_And_Society.Military and warfare". There's an equivalent article in cswiki: https://cs.wikipedia.org/wiki/Jind%C5%99ich_III._Plantagenet. We would use the site link to label that article as "History_And_Society.Military and warfare" and use that as training data for the topic model. Assuming we have enough cross-wiki overlap in our training data, we'll be able to build an effective topic model that can then be applied to any article in cswiki -- whether or not there is a sitelinked entity.

The biggest downside to this approach is that it would be using the taxonomy from enwiki in all other wikis. The upside is that, we can probably get a few new topic models out in a quarter with this kind of approach and you would be able to apply it to any article including new article drafts.

@kostajh and I discussed this today, and here's how we decided to proceed:

  • We want to take seriously these approaches:
    • Seed list for morelike
    • Free-text search
    • ORES walkover from English
  • Basically, that excludes the approaches that rely on categories, because of the difficulty and computational annoyance of using them.
  • @kostajh is going to get webapps ready for those three approaches.
  • To back the "seed list for morelike" approach, I'll be making a spreadsheet for the ambassadors (@Dyolf77_WMF @Urbanecm @revi) to fill in to produce seed articles for a set of about 28 topics. Phabricator task to come on that -- and hoping to have lists by the end of the next week.

Taking all that together, I'd like us to be able to send the prototypes to ambassadors during the week of Sep 30 for them to start trying out. Then we'll have a sense of which method we want to pursue to include topic matching in suggested edits by the end of Q2.

To back the "seed list for morelike" approach, I'll be making a spreadsheet for the ambassadors (@Dyolf77_WMF @Urbanecm @revi) to fill in to produce seed articles for a set of about 28 topics. Phabricator task to come on that -- and hoping to have lists by the end of the next week.

I'm wondering if we could also ask for a representative category (from the local wiki's version of https://en.wikipedia.org/wiki/Category:Main_topic_classifications), if not for each one of these 28 topics, then for say 10 of them, because...

Basically, that excludes the approaches that rely on categories, because of the difficulty and computational annoyance of using them.

I think we should still keep "walk-up-the-category-tree" (T231506#5495917) in our back pocket, in case "morelike", free text, or ORES-from-english don't provide satisfactory results. Prototyping it is a little more complicated because really what we want to do is take the full set of results returned for the hastemplate queries and then post-process all of the individual articles to assign topics to them via the category tree, and then store that so it's only calculated once. I can do that (I've done similar with the ORES topic matching) but would rather have the definitive categories from ambassadors list before putting any time into it.

@kostajh -- let's revisit the category approach if we don't like any of the other three approaches that you're setting up now. We can ask the ambassadors to add categories later if we want to try it out.

Here's the updated prototype for morelike and freetext search: https://newcomertasks-prototype.netlify.com/ (source: https://github.com/kostajh/newcomertasks-prototype/pull/3). Within this prototype you can explore:

  • morelike using a single article (first article) from the topic list
  • morelike with a logical OR filter for articles ( morelike:Sports OR morelike:Football rather than morelike:Football|Sports which works as an AND)
  • `Adjust the qi profile for the morelike search
  • Use free text search to override all of the topic filters and perform a regular keyword search with the hastemplate query from the task type selections

I copied over all the topics from Czech and Arabic in the spreadsheet, and I moved the configuration pages from my username space to Growth/Personalized_first_day/Newcomer_tasks/Prototype/templates and Growth/Personalized_first_day/Newcomer_tasks/Prototype/topics so if someone updates ko.json and vi.json with the correct template and topic values, those will show up in the prototype as well.

Thanks, @kostajh. I have been playing with the prototype today, and I'd like you to check out my notes below. It's not that I want us to perfect this prototype -- it's that I think I'm identifying things that could cause us to inaccurately evaluate the different topic matching methods.

I think there is potentially something wrong with the way the tool is caching things between searches. In a couple of gifs below, you can see me switching between topics ("Food and drink" and "Geography"). It looks to me like after switching, I still mostly get articles from the previous topic. In the first gif, I have a lot of geography articles while looking at "Food and drink" (because I was previously looking at "Geography"), and then after refreshing the page and doing it again, it's mostly food articles. And also after switching a couple times, the limit of articles that comes back is 50 (see second gif). I think there might be something wrong there. If we just need to refresh between each search, that's okay.

test.gif (818×1 px, 1 MB)

test2.gif (818×1 px, 3 MB)

I selected the "copy edit" option in Czech, and waited a while as the results continued to load on the page. According to the spreadsheet with templates, I was expecting about 3,300 results. But it got to over 5,000 before I stopped it. Do we definitely have the right templates mapped in? Or are there somehow duplicates? See screenshot below for the numbers.

image.png (765×839 px, 74 KB)

I was also playing with the "Use logical OR with topic titles" option. Do I understand it correctly: with the logical AND that comes by default, if we have two articles for the "Food and drink" topic, which are "Food" and "Drink", morelike will mash the contents of those two articles together as if they are one big article, and then morelike that. But with the logical OR, you do them each separately, and then union their results in the output of the prototype? If that's right -- then how are you doing the union? Does one of them come first and then the other? As I've played with it for multiple topics, it sometimes look like all the results from one topic are at the topic, and then all the results for the other are afterward.

I think there is potentially something wrong with the way the tool is caching things between searches. In a couple of gifs below, you can see me switching between topics ("Food and drink" and "Geography"). It looks to me like after switching, I still mostly get articles from the previous topic. In the first gif, I have a lot of geography articles while looking at "Food and drink" (because I was previously looking at "Geography"), and then after refreshing the page and doing it again, it's mostly food articles. And also after switching a couple times, the limit of articles that comes back is 50 (see second gif). I think there might be something wrong there. If we just need to refresh between each search, that's okay.

I'll look into this.

I selected the "copy edit" option in Czech, and waited a while as the results continued to load on the page. According to the spreadsheet with templates, I was expecting about 3,300 results. But it got to over 5,000 before I stopped it. Do we definitely have the right templates mapped in? Or are there somehow duplicates? See screenshot below for the numbers.

Ah, I had Upravit in the list of copy edit templates which adds a few thousand results, I thought it was supposed to be included but either I messed that up or the spreadsheet was updated since I set up the configuration. See this diff

The same query should now yield about 2365 results. The total of all the copy edit templates is in the range of 3252, the reason that the app outputs 2365 is because we search for multiple templates at one time. What we call "copyedit" is a collection of a half dozen of templates. The search can return the same article more than one time if the article has multiple templates on it. To me it didn't make sense to render it in the list since it would appear to be a duplicate. However, if an article has a template in both the "Copyedit" group and the "Links" group it will show up twice in the UI, and when you click on the article title you should see which template is associated with that particular result.

I was also playing with the "Use logical OR with topic titles" option. Do I understand it correctly: with the logical AND that comes by default, if we have two articles for the "Food and drink" topic, which are "Food" and "Drink", morelike will mash the contents of those two articles together as if they are one big article, and then morelike that.

Yes, that is my naïve understanding of what's happening.

But with the logical OR, you do them each separately, and then union their results in the output of the prototype? If that's right -- then how are you doing the union? Does one of them come first and then the other? As I've played with it for multiple topics, it sometimes look like all the results from one topic are at the topic, and then all the results for the other are afterward.

Looking at the network inspector will be the best guide for this, because you can see exactly which queries are getting executed. (Let me know if I should just put this in the UI for the app somewhere if that's easier.) But basically, if you select say "Sport" in Czech which has two article titles associated with it ("Sport" and "Sport v Česku"), and a single checkbox for the task type (let's say Pahýl část), then the code does:

  • srsearch: hastemplate:"Pahýl část" morelikethis:"Sport"
  • srsearch: hastemplate:"Pahýl část" morelikethis:"Sport v Česku"

If you select two task types, then the code iterates over each task type (group of templates) and executes a query for each individual topic article, like so:

  • srsearch: hastemplate:"Pahýl část" morelikethis:"Sport"
  • srsearch: hastemplate:"Pahýl část" morelikethis:"Sport v Česku"
  • srsearch: hastemplate:"Wikifikovat" morelikethis:"Sport"
  • srsearch: hastemplate:"Wikifikovat" morelikethis:"Sport v Česku"

The resulting output is grouped by template rather than topic. To make it clearer which query is responsible for a particular result, I added a "Query" section that contains the search query used to obtain the result, e.g.

image.png (348×1 px, 64 KB)

I think there is potentially something wrong with the way the tool is caching things between searches. In a couple of gifs below, you can see me switching between topics ("Food and drink" and "Geography"). It looks to me like after switching, I still mostly get articles from the previous topic. In the first gif, I have a lot of geography articles while looking at "Food and drink" (because I was previously looking at "Geography"), and then after refreshing the page and doing it again, it's mostly food articles. And also after switching a couple times, the limit of articles that comes back is 50 (see second gif). I think there might be something wrong there. If we just need to refresh between each search, that's okay.

This should be fixed now, along with a few other minor issues. Please let me know if you see anything else that's off.

The prototype for ORES drafttopic is ready, although I'll probably try regenerating the seed data later today after double-checking the template listings for each wiki.

The code in this repository includes the logic described in T231506#5502989. One alteration we could try if you're interested is to return the top 3 predictions for each individual article, so instead of assigning an article to a single topic in our dataset, each article would have up to 3 topics assigned to it. Let me know if I should do that please.

The code over here contains a simplified web app that interacts with data generated by the script above. The URL for that prototype is here: https://deploy-preview-4--newcomertasks-prototype.netlify.com/

For this prototype, I recommend checking all of the task type boxes and then clicking on individual results to see if the topic assigned to it makes sense.

As noted in T231506#5502989, the data generation script is currently throwing away a big chunk of search results (~40%) where we don't have a corresponding article in enwiki to draw from. As far as I know this still leaves us with plenty of tasks for our project but if not, we could try to get creative for the 40% without a corresponding enwiki article, for example by doing the walk-up-the-category-tree idea suggested earlier in this thread.

As for the topics shown in the UI, I didn't attempt to shoehorn the ORES drafttopic topics into the topics we used in the welcome survey. But if it helps to make a 1:1 comparison with the morelike prototype, then I could add some code which would do this (so in the UI you'd see Umeni rather than Culture.Arts).

Another variation would be to play with splitting up some of the topics in the UI, so any time you see "STEM.{someTopic}" we could split that out such that you'd have "STEM" as a parent topic, and sub-topics of "Biology", "Mathematics" etc.

Another thing we could do is a bit of post-processing like @dr0ptp4kt is working on, where we'd look at the wikitext of an article to figure out if the topic "Geography.Countries" is appropriate when the article is really about a politician who is/was prominent in said country; or if we would want to parse the wikitext to assign a more granular topic like "Geography.Countries.Greece" rather than just "Countries".

In short, there's a lot of room for experimentation to improve these results, just let me know what you're interested in pursuing further.

I'm about to start the work to derive country from Infobox settlement bearing subjects (effectively replacing the Geography.* mid-level category assignment for such subjects), as that's something we're exploring for the purpose of counting pageviews by topic in a little more fine tuned way as a starting point, but meantime here's what the heuristic output looks like.

https://dr0ptp4kt.github.io/topics-3.html

@Halfak has noted that it's sensible to apply a scaling factor to the drafttopic scores based on the metadata from the drafttopic model, so that's on my potential to-do list in this heuristic.

Thanks, @kostajh. It looks like the morelike prototype is almost ready to go to the ambassadors. @revi and @Urbanecm will finish up their lists in the spreadsheet this week, and then we can load the final lists in.

Here are my notes and questions:

  • I sent an email to @Urbanecm to check about the "Upravit". It's weird that I didn't include it in the list, and I wonder if there was a reason or typo or something.
  • I have a question about the logical OR. I feel like the prototype may be running the queries right, but displaying something wrong. If I run it with wiki = 'cswiki' and topic = 'Ekonomie' with the logical AND default, and then I click on each of the results, I see that they say morelikethis:"Ekonomie|Ekonomika". But if I toggle to the logical AND and run it, when the list is finished populating, no matter where I click in the list, all the results say morelikethis:"Ekonomika". I would have expected some to have morelikethis:"Ekonomika" and some to have morelikethis:"Ekonomie". However, I think it is actually doing that, because when I also toggle to use only the first article in the topic list, I get a different number of results. Could you check on this please?
  • I have a question about the ORES prototype. I went through cswiki and displayed the results for all templates for each topic. I put the counts of the results for each into this spreadsheet. You can see that adding them all up gets to only 427 articles, with most topics being in single digits. But with the way you described it, I expected to see a total of about 60% of all the articles with templates, which should be a few thousand for cswiki. Is this because many articles are not able to be confidently assigned to any topic in the ORES model?

I have a question about the logical OR. I feel like the prototype may be running the queries right, but displaying something wrong. If I run it with wiki = 'cswiki' and topic = 'Ekonomie' with the logical AND default, and then I click on each of the results, I see that they say morelikethis:"Ekonomie|Ekonomika". But if I toggle to the logical AND and run it, when the list is finished populating, no matter where I click in the list, all the results say morelikethis:"Ekonomika". I would have expected some to have morelikethis:"Ekonomika" and some to have morelikethis:"Ekonomie". However, I think it is actually doing that, because when I also toggle to use only the first article in the topic list, I get a different number of results. Could you check on this please?

Good catch. There were two bugs contributing to this, and I fixed them. You can compare the same query now by looking at "Eokonmie" with the "Pahýl část" template, then in the results you'll see two entries for "Talcott Parsons", one of them will show up with "Ekonomie" and the other with "Ekonomika".

I have a question about the ORES prototype. I went through cswiki and displayed the results for all templates for each topic. I put the counts of the results for each into this spreadsheet. You can see that adding them all up gets to only 427 articles, with most topics being in single digits. But with the way you described it, I expected to see a total of about 60% of all the articles with templates, which should be a few thousand for cswiki. Is this because many articles are not able to be confidently assigned to any topic in the ORES model?

No, it's because I only got up to 100 results per template (the process I used is buried a bit in this comment T231506#5502989), so I only started out with ~900 tasks and then discarded a bit more than half. Today I will re-run the script to attempt to grab all possible tasks, while also storing secondary and tertiary topic predictions, so we can experiment a bit more with this approach. I'll let you know when the dataset is updated.

@kostajh -- okay, thanks. Then I think we are good on the morelike/search prototype -- just waiting for @Urbanecm to finish his list of articles and then we can load in all three languages (Arabic and Korean are done).

Let me know when the ORES prototype is ready.

I plan to send these to the ambassadors on Monday so they can have a couple weeks to play with them, while you move on to other newcomer tasks work.

@kostajh -- the ambassadors are finished listing articles, so I think we should be good to populate the prototype with what's currently in there. Please let us know when that's set up, and when the ORES prototype is ready, and I'll send them to the ambassadors to try out.

MMiller_WMF renamed this task from Newcomer tasks: prototype task selection to Newcomer tasks: prototype topic matching.Sep 30 2019, 9:00 PM

Moving this task until there is more specific guidance that emerges from T234272. The prototypes are being evaluated now.

We have completed prototyping and will soon be moving on to building the first version. As described in T234272: Newcomer tasks: evaluate topic matching prototypes, we will be working with the ORES drafttopic model.