Page MenuHomePhabricator

Add a link: unnecessary articles on units are often suggested
Closed, ResolvedPublic

Description

In bnwiki, widely known units are often suggested, like ফুট (en: Foot (unit)), মিটার (en: Metre), সেলসিয়াস (en: Celsius). These suggestions are being accepted by the users, increasing the unnecessary accepted article number.

These articles are, obviously, followed by numbers.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
kostajh added a subscriber: kostajh.

@MMiller_WMF @Trizek-WMF I assume you'd agree with this, so moving into current sprint to go along with other updates to the mwaddlink filtering logic. I guess we'd want the wikidata items for units of measurement?

I guess we'd want the wikidata items for units of measurement?

Unit of measurement is only indirectly related to the specific units in the wikidata-tree, for example for Degree-celsius:

For our filtering logic, weremove all link-candidates that are instance_of a given item (such as Unit of temperature )

In order to filter all links related to a unit of measurement, we have to add all items that are a subclass_of Unit of measurement. A sparql- query yields 30 results, perhaps the most relevant entity-types are:

In order to remove link-candidates related to units of measurement, my recommendation would be to add the above entities to the filter.

Thanks, @MGerlach. Yes, @kostajh, I agree with this being in Ready for Development. Thank you!

Change 699395 had a related patch set uploaded (by Kosta Harlan; author: MGerlach):

[research/mwaddlink@main] Update the filter for removing candidate links

https://gerrit.wikimedia.org/r/699395

@Ankan_WMF Do you have one or two example articles where you observed the suggestions you mention above. We have made some changes to the model based on your feedback (T279434) and would like to check manually whether this resolves the issue or not or creates new problems (before deploying). Thanks.

@Ankan_WMF Do you have one or two example articles where you observed the suggestions you mention above. We have made some changes to the model based on your feedback (T279434) and would like to check manually whether this resolves the issue or not or creates new problems (before deploying). Thanks.

@MGerlach, I have checked some articles just now, examples of unit-related articles are:

Just to add some more date/day-related articles' suggestions, which I got now:

Change 699395 merged by jenkins-bot:

[research/mwaddlink@main] Update the filter for removing candidate links

https://gerrit.wikimedia.org/r/699395

The datasets are being imported in production, and new suggestions can be checked in a day or two on https://api.wikimedia.org/service/linkrecommendation/apidocs/

An example from today. It is still suggesting date like "৬ আগস্ট" (6 August).

@Aftabuzzaman -- thanks for checking in about this. I believe that this change we've made is only going to affect the evaluation of future articles. i.e. as users proceed through the existing suggestions and complete them, the system generates new ones. Those will not contain these terms that we want to exclude. Therefore, you'll still see these during the time it takes them to cycle out and new ones to cycle in.

Checking on wmf.15 cswiki - @Urbanecm_WMF: you mentioned that the issue (the dates displayed as suggested links) is still present in cswiki. I see the discreapncy between the number of links displayed to a user and the links fetched by API. I'll do more testing on other wikis.

Two recommeded links displayed Alexandrijská knihovna
Screen Shot 2021-07-23 at 12.07.58 PM.png (1×1 px, 406 KB)
Only one link fetched by API - link recommendation API for "Alexandrijská knihovnaI"

{
"links": [
{
"context_after": " v roce 64",
"context_before": "ytí města ",
"link_index": 0,
"link_target": "Arabové",
"link_text": "Araby",
"match_index": 0,
"score": 0.512970507144928,
"wikitext_offset": 1917
}
],
"links_count": 1,
"meta": {
"application_version": "5b4709d",
"dataset_checksums": {
"anchors": "e3799896b97229ae992f007cbd5c837960b66c0dbd65bfe42ae27b6d7b7b3be5",
"model": "18190aba440a9b123c0f897f9cf31ec5473d3d1102dba16caa74583067baec5e",
"pageids": "c0413b60ae9ce6533b19f1ef9f90a9758fe28242fcfa2cc994aa00baaa1a9d8b",
"redirects": "9071e4cb7f33323c3ca0d6d78e7c41016867e4f7ff7da619008c7ce71d251e1f",
"w2vfiltered": "689a4fa99a3795a5bba0053280916c48515c5e38d31eab5918d06fcc83f7e641"
},
"format_version": 1
},
"page_title": "Alexandrijská knihovna",
"pageid": 62136,
"revid": 19487788
}
Five recommeded links displayed Expedice 38
Screen Shot 2021-07-23 at 1.11.00 PM.png (976×2 px, 500 KB)
Only two links fetched by API - link recommendation API for "Expedice 38"
{
"links": [
{
"context_after": " Kóiči Wak",
"context_before": " japonský ",
"link_index": 0,
"link_target": "Kosmonaut",
"link_text": "kosmonaut",
"match_index": 0,
"score": 0.6288405060768127,
"wikitext_offset": 3408
},
{
"context_after": ".",
"context_before": "y, voda a ",
"link_index": 1,
"link_target": "Kyslík",
"link_text": "kyslík",
"match_index": 0,
"score": 0.6587978005409241,
"wikitext_offset": 4125
}
],
"links_count": 2,
"meta": {
"application_version": "5b4709d",
"dataset_checksums": {
"anchors": "e3799896b97229ae992f007cbd5c837960b66c0dbd65bfe42ae27b6d7b7b3be5",
"model": "18190aba440a9b123c0f897f9cf31ec5473d3d1102dba16caa74583067baec5e",
"pageids": "c0413b60ae9ce6533b19f1ef9f90a9758fe28242fcfa2cc994aa00baaa1a9d8b",
"redirects": "9071e4cb7f33323c3ca0d6d78e7c41016867e4f7ff7da619008c7ce71d251e1f",
"w2vfiltered": "689a4fa99a3795a5bba0053280916c48515c5e38d31eab5918d06fcc83f7e641"
},
"format_version": 1
},
"page_title": "Expedice 38",
"pageid": 652172,
"revid": 17878036
}

Hello @Etonkovidova, thanks for your comment. The output from the API looks good to me -- it seems that the issue with cswiki dates is that "old" recommendations are still remembered by the caching SQL table (should likely happen in other wikis, too).

I'm not sure how long it takes to regenerate the task pool – maybe @Tgr would know how to answer that question.

Hope this helps. Let me know if I can help in any other way.

I'm not sure how long it takes to regenerate the task pool – maybe @Tgr would know how to answer that question.

Manually regenerating it would take something like a day per wiki, but that would require T284551: Maintenance script for updating recommendations to newer dataset.

By itself, on a wiki like bnwiki which has ~20 add link edits per week, it would take basically forever.

An example from today. It is still suggesting date like "৬ আগস্ট" (6 August).

Checking the example from the above comment on bnwiki wmf.16 for the আই,_রোবট_(চলচ্চিত্র). The api doesn't list "৬ আগস্ট" as a suggestion.

I checked the target wikis (random five articles); there are some discrepancies between what results that UI returns and what api fetched:

viwikiarwikibnwikicswiki
CubeSat UI links count 3; API links count: 4; Eucratides_I UI links count 3; API links count: 1; Trương_Giác_(nhà_Kim) UI links count 8; API links count: 7;استكشاف_وادي_الملوك UI links count 5; API links count: 3;ভূ-বেড়া UI links count 2; API links count: 1; ১৩_এপ্রিল UI links count 9; API links count: 8;Pola_(odrůda_révy_vinné)| UI links count 10; API links count: 3;

@Tgr - any further actions on this task? Or it can be marked as Resolved?

If the API results seem correct, it can be resolved.

Etonkovidova claimed this task.

If the API results seem correct, it can be resolved.

Thanks! Since I did not notice any unnecessary suggestions in API results, I'm resolving the task.