Page MenuHomePhabricator

Add a link: evaluate link recommendation (Mar 30 2021)
Closed, ResolvedPublic

Description

Now that the link recommendation algorithm has been productized into an API, we want to re-evaluate its results to ensure that it is still performing at a high level and that no regressions occurred as it was productized. We are expecting around 75% of link recommendations to be accurate.

Here's the procedure:

  1. Make sure that you have this browser extension installed in Chrome, which will allow you to easily read the results from the API.
  2. @Urbanecm_WMF will list 60 random articles from fr, en, vi, ar, cs, and bn in this spreadsheet. They will be articles that have at least one link recommendation.
  3. For each article, open the article up and open up its API link to receive its recommendations. List how many recommendations there are in the "links suggested" column. @Urbanecm_WMF will prepopulate the API links in the spreadsheet, but this is the link to the API for your reference.
  4. Look at each recommendation and decide whether you would choose "Yes" or "No". This judgment should be based both on whether that word or phrase should be a link and whether the link target is the right article to which it should be linked.
  5. In the spreadsheet, enter how many recommendations were given for the article and how many you would choose "Yes".
  6. Also enter whether the article is "short", "medium", or "long", according to your judgment.
  7. Add any notes or issues that you saw. Example of things to note are: clear reasons why the algorithm was wrong (e.g. "Suggests a link in the middle of a longer song title"), or if the algorithm recommends a link after that word's first occurrence in the article, or if it suggests linking to a disambiguation page.

Other notes:

  • If the article in the list is a disambiguation page, just cross it out and leave a note. Do not evaluate these, because we won't be recommending disambiguation pages to users.
  • Take a brief look at the "context before" and "context after" fields from the API. These are meant to be the characters occurring immediately before and after the link text, which are to help the feature highlight the text in production. Check that these strings are not broken. This seems especially possible in non-Latin languages, like Arabic and Bengali.
  • If you see any general patterns, e.g. the "context before" and "context after" fields are always broken, please explain that with examples in a comment on this task.

Details

Due Date
Apr 5 2021, 1:00 PM

Event Timeline

MMiller_WMF changed Due Date from Apr 2 2021, 1:00 PM to Apr 5 2021, 1:00 PM.
MMiller_WMF added a subscriber: Urbanecm.

@Urbanecm_WMF @Dyolf77_WMF @PPham @ANKAN -- this is the task that we're discussing in our ambassador meetings for this week. I expect it to take about 1.5 hours or less. If you spend more time on it, you can stop at however far along you are.

Once @Urbanecm_WMF creates the spreadsheet, we can get started. Please be finished by Monday, April 5.

@MMiller_WMF we might want to wait for T278719: load-datasets.py: Lock wait timeout exceeded; try restarting transaction to be resolved before starting the evaluation. That task (T278719) is blocking the import of new datasets which remove disambiguation pages, calendar year, list pages etc (complete list here) from the link recommendations.

@kostajh -- when do you expect the task to be resolved and the datasets imported such that the changes are reflected in the API?

Hopefully by the end of this week.

I just added 30 articles per each wiki to the spreadsheet.

@kostajh -- thanks for the heads up. I think we're going to go ahead and do this evaluation as-is. We're more concerned with identifying any major issues with the algorithm, and we'll know that there are a few outstanding filters not applied yet. We can reevaluate in a couple weeks if needed.

I just added 30 articles per each wiki to the spreadsheet.

FTR, P15093 is the script I used to do that.

MMiller_WMF added a subscriber: Ankan_WMF.

@Dyolf77_WMF @Urbanecm_WMF -- after discussion today with @Trizek-WMF, @Ankan_WMF, and @PPham we added a few more things to think about while evaluating. Please check out the "Other notes" section in the task description for these updates.

FYI @Urbanecm_WMF, I blanked the "links suggested" column because this can change over time. Rather than asking evaluators to update the existing column, I'm asking everyone just to fill it in with the number of suggestions available at the time they are evaluating.

The context after and before parts are not fetching the whole word(s) properly and get broken.

For example in the তুতানখামেন article (API), the "context_before" is fetching "ারি কোবরা", but it is better to fetch either "কানারি কোবরা" (last two words before the target) or "কোবরা" (last word before the target). Similarly, for the কবির আহম্মদ article (API), the "context_after" is fetching ") ক্যাম্বে", and "context_before" is fetching "ানের (তখন ", but to get the whole words, they should be ") ক্যাম্বেলপুরে" and "পাকিস্তানের (তখন " respectively.

In some cases, I think it is okay. In the first example, the context after is fetching "ের কামড়ে " and it is okay as the target word is "সাপ" and the word used in the article is "সাপের". To separate the target from context, It has to be "সাপ"+"ের". Although it seems to be broken in the API, it is right.

Done. Correct recommendation is only 71/142 which is exactly at 50%.

  • Because Vietnamese has 6 different tones (sang, sáng, sảng, sàng, sạng, sãng), the algorithm seems to have a hard time matching words with different tones correctly.

I find that it has the highest rate of incorrect recommendation in cases such as below:

"link_target": "Ứng Dũng"
"link_text": "ứng dụng"

"link_target": "Đông Thới"
"link_text": "đồng thời"

As you can see the text is just a normal word (uncapitalized, and often has generic meaning), but the target is capitalized which means it's a proper noun. I don't know why the algorithm keeps matching common words with proper noun and totally ignore the tones in these cases. Other recommendations between common and common words are still fine with the tones?

  • Sometimes it breaks a set of words down and just links a part of that set:

"context_before": "cộng đồng ",
"link_index": 3,
"link_target": "Công giáo tại Việt Nam",
"link_text": "Công giáo tại Việt Nam",

The word is actually "cộng đồng Công giáo tại Việt Nam" (Christian community in Vietnam) (see the "context before part), but the algorithm only picks "Công giáo tại Việt Nam" (Christianity in Vietnam) and ignore the word "cộng đồng" (community), so that makes the recommendation word wrong in context. I think this problem is because each word in English has its own meaning, but in Vietnamese you often compound many one syllable words into a set of words in order to have something that makes sense (Like Christianity (1 word) = Công giáo (2 one-syllable words compounded).

  • I don't link dates, but I see that many people in my community do, and I'm checking on it again to see if we have any actual rule about it.
  • There is no disambiguation page in my case.

Conclusion: I'm honestly a bit worried with the result as for some articles, there are ~10 recommendations but none of them is correct?

Done. Correct recommendation is only 71/142 which is exactly at 50%.

  • Because Vietnamese has 6 different tones (sang, sáng, sảng, sàng, sạng, sãng), the algorithm seems to have a hard time matching words with different tones correctly.

I find that it has the highest rate of incorrect recommendation in cases such as below:

"link_target": "Ứng Dũng"
"link_text": "ứng dụng"

"link_target": "Đông Thới"
"link_text": "đồng thời"

As you can see the text is just a normal word (uncapitalized, and often has generic meaning), but the target is capitalized which means it's a proper noun. I don't know why the algorithm keeps matching common words with proper noun and totally ignore the tones in these cases. Other recommendations between common and common words are still fine with the tones?

  • Sometimes it breaks a set of words down and just links a part of that set:

"context_before": "cộng đồng ",
"link_index": 3,
"link_target": "Công giáo tại Việt Nam",
"link_text": "Công giáo tại Việt Nam",

The word is actually "cộng đồng Công giáo tại Việt Nam" (Christian community in Vietnam) (see the "context before part), but the algorithm only picks "Công giáo tại Việt Nam" (Christianity in Vietnam) and ignore the word "cộng đồng" (community), so that makes the recommendation word wrong in context. I think this problem is because each word in English has its own meaning, but in Vietnamese you often compound many one syllable words into a set of words in order to have something that makes sense (Like Christianity (1 word) = Công giáo (2 one-syllable words compounded).

  • I don't link dates, but I see that many people in my community do, and I'm checking on it again to see if we have any actual rule about it.
  • There is no disambiguation page in my case.

Conclusion: I'm honestly a bit worried with the result as for some articles, there are ~10 recommendations but none of them is correct?

This will hopefully be improved when T279037: Character encoding issues in MySQL anchor dictionaries for viwiki is in production, probably some time next week.

@kostajh -- thank you! Please let us know when we should test in Vietnamese again. @PPham -- thank you for noticing and reporting this issue.

Done for Czech language. A lot of dates and centuries are suggested, but overall, looks good.

@Urbanecm_WMF -- did you count dates and centuries as good links or bad? Should we consider removing them from the suggestions?

@MGerlach -- we've finished doing this evaluation, and here are the accuracies for each wiki:

image.png (188×417 px, 16 KB)

The wikis range between 70% and 90%, except for Vietnamese, which has issues we identified above (T279037). We'll reevaluate that language once the issue is fixed. Do those accuracy levels match what you would expect based on your evaluations?

In terms of specific items for improvement, you can look in the "notes" column of the spreadsheet to see the details. Here are some issues that were common:

  • Linking to just a portion of a larger phrase, in which the larger phrase would not be a link. Examples include:
    • Awards, e.g. "The Jane Smith Award for Excellence" might link just to "Jane Smith".
    • Song titles, e.g. "Un Beso Para Mi" might link just to "Un Beso".
    • Schools, e.g. "Rockville High School" might link just to "Rockville".
  • Linking to text in sections that usually don't have links, e.g. the "Sources" section, in which links were suggested inside of citations.
  • Possessive suffix: for anchor text "Brazilian Navy's", the suggestion would be to link just the "Brazilian Navy" portion to the target, whereas we would want to include the "'s" in the link.
  • Links to dates and centuries may be too frequent.

@MGerlach -- none of the issues above are blockers for Growth's work -- that's good! But could you please create Phabricator tasks for improvements that you would like to investigate at some point (not necessarily now) based on this list?

@MMiller_WMF thanks for the summary of the evaluation.

The wikis range between 70% and 90%, except for Vietnamese, which has issues we identified above (T279037). We'll reevaluate that language once the issue is fixed. Do those accuracy levels match what you would expect based on your evaluations?

Yes. this matches roughly the values for the precision I obtained using the backtesting-data using the threshold of 0.5 (link to analysis). I expect the numbers to substantially improve for viwiki with the fix of the encoding-issues.

In terms of specific items for improvement, you can look in the "notes" column of the spreadsheet to see the details. Here are some issues that were common:

  • Linking to just a portion of a larger phrase, in which the larger phrase would not be a link. Examples include:
    • Awards, e.g. "The Jane Smith Award for Excellence" might link just to "Jane Smith".
    • Song titles, e.g. "Un Beso Para Mi" might link just to "Un Beso".
    • Schools, e.g. "Rockville High School" might link just to "Rockville".
  • Linking to text in sections that usually don't have links, e.g. the "Sources" section, in which links were suggested inside of citations.
  • Possessive suffix: for anchor text "Brazilian Navy's", the suggestion would be to link just the "Brazilian Navy" portion to the target, whereas we would want to include the "'s" in the link.
  • Links to dates and centuries may be too frequent.

Dates and years should appear less frequently for the suggested links as I am filtering links to articles that are instances of certain entities using wikidata (the current list includes disambiguation pages, list-pages, years, and calendar years). From what I understood there was some delay in loading the newly generated tables, thus the model used in the evaluation might not have included this update.

@MMiller_WMF Can we enumerate all entities we want to filter (with their corresponding wikidata-id)? For example, in this way we could easily include thing like centuries (Q578). Given the list of entities, I could then update the filter-list when training the models. Currently, I am lacking understanding what should and what shouldnt be filtered.

@MGerlach -- none of the issues above are blockers for Growth's work -- that's good! But could you please create Phabricator tasks for improvements that you would like to investigate at some point (not necessarily now) based on this list?

Yes, I can create phabricator-tasks for possible improvements. Regarding the wrong links, do we have a notion of i) how often a specific type of wrong link occurs, ii) priority to fix the error-type (I guess this will be highly correlated to how often it occurs).

@MGerlach -- I think you should check out the spreadsheet comments to get a sense of the frequency of each type of error. Maybe it would be useful to add columns to code the errors. My sense is that the most common type of error is this one, and we see it in each language:

  • Linking to just a portion of a larger phrase, in which the larger phrase would not be a link. Examples include:
    • Awards, e.g. "The Jane Smith Award for Excellence" might link just to "Jane Smith".
    • Song titles, e.g. "Un Beso Para Mi" might link just to "Un Beso".
    • Schools, e.g. "Rockville High School" might link just to "Rockville".

@kostajh -- are we ready to re-test Vietnamese at this point?

@MMiller_WMF nearly; the datasets for all wikis are being imported in production and viwiki is the one that's getting updated now. I can comment here when it's done.

@MMiller_WMF nearly; the datasets for all wikis are being imported in production and viwiki is the one that's getting updated now. I can comment here when it's done.

@MMiller_WMF viwiki finished importing, it can be checked now.

I just finished the evaluation round 2, and the result is: 131/180 (~72.7%).

I'd say it's better than last time, at least now the tones are matched. But of course in some cases even with the same tones, the words can still have different meaning and the AI cannot be smart enough to pick it, but I guess 72% is fine.

Thank you Phuong. The target was between 70% and 90% of accuracy. Vietnamese language is just in the target!

Thanks, @PPham. With that, this task is resolved.