Page MenuHomePhabricator

Newcomer tasks: evaluate topic matching prototypes
Closed, ResolvedPublic

Description

In T231506, we explored several methods with which to surface articles to newcomers based on the topical interests of those newcomers. This is difficult because newcomers have no editing history with which to make recommendations.

This task is about evaluating three methods that we've put into interactive prototypes. Below, we describe each of the prototypes and how ambassadors can evaluate them.

1. Morelike

  • Prototype
  • How it works: Newcomer selects from a list of 27 broad topics. Each of those 27 topics has a corresponding list of articles that are pre-set by ambassadors in T233465 (the "seed" articles). For each of the topics that the newcomer selects, the prototype takes the seed articles and does a search for more articles that have a lot of the same words in common with the seed articles. It narrows the results to those that have a maintenance template and displays the results.
  • How to evaluate:
    • Select your language.
    • For each of the 27 topics...
      • Select the topic.
      • Select all the task type checkboxes.
      • Leave all other settings alone.
      • Look at the first ten articles that get returned and count how many of the ten are good results for that topic. For instance, the article on "Elevator" would be a good result for the topic "Engineering". But the article on "Shoes" would not.
      • Write down that score in this sheet.
      • If any topic has fewer than 10 results, indicate that by making a note in the cell.
  • Notes:
    • You can click each result to see details about its templates, categories, and the search that was run.
    • The prototype contains some additional algorithm settings. You are welcome to play with those and record some of your notes about what you notice, but we're evaluating them based on the default settings.
    • Although the prototype allows you to select only certain maintenance templates, we think you should select all of them for this exercise, because we're really only evaluating the topic matching abilities here. We can separately count how many results show up for each maintenance template.

2. Free text

  • Prototype (same as morelike)
  • How it works: Newcomer types in some text in the search field, and it runs a normal search, just like the search bar in Wikipedia, but narrowed to articles with maintenance templates. This allows the user to search for more specific topics.
  • How to evaluate:
    • Select your language.
    • Select all the task type checkboxes.
    • Leave all the other settings alone.
    • Type one topic at a time into the free text field. Please try 15 different topics of your choice, that can more or less specific. A more general one might be "Swimming" and a more specific one might be "Pokemon".
    • Look at the first ten results for that search term, and count how many look like good results.
    • Put the search term and the score in this sheet.

3. ORES

  • Prototype
  • How it works: There is a machine learning model in English Wikipedia that classifies any English article into a topic. The topics are made through the English WikiProject hierarchy and are not the same as the ones from the "morelike" list (but we could align them later if we like this approach). The method takes all the articles with maintenance templates in the target wiki, then finds the ones that also exist in English Wikipedia, then gets their ORES topic score from English and applies it to the target language's version. That means that the only articles that come up are the ones that exist in English, too. That's not optimal because it would mean that we don't recommend any local-language articles for editing, but we still want to try this method out to see how good it is.
  • How to evaluate:
    • Select your language.
    • For each of the 42 topics...
      • Select the topic.
      • Select all the task type checkboxes.
      • Look at the first ten articles that get returned and count how many of the ten are good results for that topic.
      • Write down that score in this sheet.
      • If any topic has fewer than 10 results, indicate that by making a note in the cell.
  • Notes:
    • We only have the English names of the topics, but if we like this method, we would figure out how to translate them to local languages.

Details

Due Date
Oct 16 2019, 12:00 PM

Event Timeline

@kostajh @Trizek-WMF -- this is the evaluation protocol I made for testing the prototypes. Once @kostajh says the prototypes are ready, I'll make a separate task for each ambassador to work on this. I think this will take a lot of time, so maybe the ambassadors will need a couple weeks with it.

In the meantime, please comment or change things if you think it could be better.

@kostajh @Trizek-WMF -- this is the evaluation protocol I made for testing the prototypes. Once @kostajh says the prototypes are ready, I'll make a separate task for each ambassador to work on this. I think this will take a lot of time, so maybe the ambassadors will need a couple weeks with it.

I've copied over all of the topic titles into the respective configuration files on MediaWiki.org. Because reading Korean and Arabic is difficult for me, it's hard to know if I copy/pasted everything correctly. Ambassadors, please feel free to glance over my lists. Any updates should please be made directly on MediaWiki.org at https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Newcomer_tasks/Prototype/topics/{langCode}.json. Those files are read directly by the prototype, so any change you make on mediawiki.org will show up in the UI of the morelike/keyword search prototype.

kostajh updated the task description. (Show Details)Oct 1 2019, 2:24 PM
Trizek-WMF triaged this task as High priority.Oct 1 2019, 3:58 PM
Trizek-WMF set Due Date to Oct 15 2019, 1:00 PM.
Mholloway added a subscriber: Mholloway.

@MMiller_WMF the ORES prototype has been updated with datasets for Arabic, Czech and Korean.

LanguageTasks with topicsTask without topics
ko15,5116,496
cs19,6107,440
ar23,62610,816

It's also worth noting that (with the exception of Arabic, which would have tens of thousands more potential tasks due to my temporarily removal of the two most populated templates), these are the actual counts of potential tasks to show in suggested edits if we go with this approach. (cc @nettrom_WMF and @RHo )

I am going to move this task to the Epic column on the work board. It appears we are now having discussion on it but there is nothing actionable because the actionable next steps are for the ambassadors in the subtasks, if that is incorrect let me know!

Trizek-WMF changed Due Date from Oct 15 2019, 1:00 PM to Oct 15 2019, 12:00 PM.Oct 2 2019, 2:48 PM
Trizek-WMF changed Due Date from Oct 15 2019, 12:00 PM to Oct 16 2019, 12:00 PM.

Hello, I've noticed that for cs/Arts at least, it gives different results each time I try to submit a query. See screencast on https://martin.urbanec.cz/files/screencasts/newcomer_tasks_prototypes_cs_arts_01.webm. @MMiller_WMF said that's a bug in the chat, and asked me to put more information in this task.

Thanks, @Urbanecm. That's not the behavior I was expecting. @kostajh is out until Monday, and he'll be able to take a look then. Maybe it is randomizing results or something.

Hello, I've noticed that for cs/Arts at least, it gives different results each time I try to submit a query. See screencast on https://martin.urbanec.cz/files/screencasts/newcomer_tasks_prototypes_cs_arts_01.webm. @MMiller_WMF said that's a bug in the chat, and asked me to put more information in this task.

I can reproduce similar behavior for freetext. OTRS seems to work fine.

I can reproduce similar behavior for freetext. OTRS seems to work fine.

To clarify, do you mean ORES?

Hello, I've noticed that for cs/Arts at least, it gives different results each time I try to submit a query. See screencast on https://martin.urbanec.cz/files/screencasts/newcomer_tasks_prototypes_cs_arts_01.webm. @MMiller_WMF said that's a bug in the chat, and asked me to put more information in this task.

The requests to the search API are done asynchronously, so sometimes one finishes earlier than another. If the exact order is important (e.g. always show "Kopírovat úpravy" results then "Reference" then "Info" etc) I could change the prototype to do that. Or I could change it to completely randomize the result order, if that is preferable.

I can reproduce similar behavior for freetext. OTRS seems to work fine.

To clarify, do you mean ORES?

Ah, yes.

Hello, I've noticed that for cs/Arts at least, it gives different results each time I try to submit a query. See screencast on https://martin.urbanec.cz/files/screencasts/newcomer_tasks_prototypes_cs_arts_01.webm. @MMiller_WMF said that's a bug in the chat, and asked me to put more information in this task.

The requests to the search API are done asynchronously, so sometimes one finishes earlier than another. If the exact order is important (e.g. always show "Kopírovat úpravy" results then "Reference" then "Info" etc) I could change the prototype to do that. Or I could change it to completely randomize the result order, if that is preferable.

Leaving for @MMiller_WMF :).

RHo added a comment.Oct 6 2019, 8:07 PM

The requests to the search API are done asynchronously, so sometimes one finishes earlier than another. If the exact order is important (e.g. always show "Kopírovat úpravy" results then "Reference" then "Info" etc) I could change the prototype to do that. Or I could change it to completely randomize the result order, if that is preferable.

My 2c is that having the exact order is important for someone who is searching for this reason, so that they do not think that there is a bug in the search query.

My 2c is that having the exact order is important for someone who is searching for this reason, so that they do not think that there is a bug in the search query.

I can see that. OTOH, the instructions for evaluating are to look at the first ten articles returned, and given that it's likely that the first 10 will belong to a single template only (e.g. maybe results 1 through 31 are for "Kdo?" and results 32-40 are for "Kdy?", the evalutor should only look at the first 10 and therefore they only see template results for "Kdo?") maybe randomizing the result list would be more fair for a comparison, since we're asking the evaluators to look a single time at the results and not multiple times.

I'm happy to do whatever makes sense to @RHo and @MMiller_WMF, just let me know.

Hi @kostajh , the list of topic on the spreadsheet does not match with the 42 topics in the ORES tool. Can you have a look on it please?

@Urbanecm -- thanks for noticing that the results change on successive searches. I think that for the purposes of this evaluation, we shouldn't worry about it. I think you should just do one search and record the results from that search. When it comes time for implementation, we'll make sure to randomize correctly.

Hi @kostajh , the list of topic on the spreadsheet does not match with the 42 topics in the ORES tool. Can you have a look on it please?

@Dyolf77_WMF what does not match?

leila removed a subscriber: leila.Oct 8 2019, 7:33 PM

@Dyolf77_WMF what does not match?

In the ORES sheet, the topic Geography.Bodies of water (this topic is on the tool) is missing and for the 4 first topics on the same spreadsheet, are not matching with what is existing on the tool.

We've done a review today. Martin is doing to finish it asap.

@Urbanecm and @revi, what is the status of your sub-tasks?

Sorry @Trizek-WMF, I thought I updated the subtask. Done from my side, moved to an appropriate column.

We are currently working on deciding exactly how to proceed based on these evaluations, which I will post on this task.

Trizek-WMF closed this task as Resolved.Oct 31 2019, 9:19 AM

All subtasks done.

MMiller_WMF reopened this task as Open.Oct 31 2019, 10:06 PM

I will resolve this task once I post on it what our decision is from these evaluations.

Assigning to you then.

MMiller_WMF closed this task as Resolved.Nov 6 2019, 11:04 PM

Following these evaluations, we have decided to proceed with the ORES drafttopic model. That is because:

  • The ORES model performed the best in terms of accuracy. In other words, the highest number of results looked correct for each topic.
  • There is a dedicated team (the Scoring team) for supporting and improving ORES.
  • It is a system already in production that we can scale to more wikis in the future.