Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Update the filter for removing candidate links | research/mwaddlink | main | +26 -6 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | MMiller_WMF | T252822 [EPIC] Growth: "add a link" structured task 1.0 | |||
Resolved | MMiller_WMF | T253278 Add a link: link recommendation algorithm | |||
Resolved | MGerlach | T279434 Add a link: algorithm improvements: Define filter for not linking specific articles types | |||
Resolved | Etonkovidova | T284666 Add a link: unnecessary articles on units are often suggested |
Event Timeline
@MMiller_WMF @Trizek-WMF I assume you'd agree with this, so moving into current sprint to go along with other updates to the mwaddlink filtering logic. I guess we'd want the wikidata items for units of measurement?
Unit of measurement is only indirectly related to the specific units in the wikidata-tree, for example for Degree-celsius:
- Degree Celsius instance_of Unit of temperature
- Unit of temperature subclass_of Unit of measurement
For our filtering logic, weremove all link-candidates that are instance_of a given item (such as Unit of temperature )
In order to filter all links related to a unit of measurement, we have to add all items that are a subclass_of Unit of measurement. A sparql- query yields 30 results, perhaps the most relevant entity-types are:
- Unit of length, example: meter
- Unit of time, example: day
- Unit of volume, example: litre
- Unit of mass, example: kilogram
- Unit of energy, example: joule
- Unit of information, example: byte
- Currency, example: US Dollar
- Parts-per notation, example: percent
In order to remove link-candidates related to units of measurement, my recommendation would be to add the above entities to the filter.
Change 699395 had a related patch set uploaded (by Kosta Harlan; author: MGerlach):
[research/mwaddlink@main] Update the filter for removing candidate links
@Ankan_WMF Do you have one or two example articles where you observed the suggestions you mention above. We have made some changes to the model based on your feedback (T279434) and would like to check manually whether this resolves the issue or not or creates new problems (before deploying). Thanks.
@MGerlach, I have checked some articles just now, examples of unit-related articles are:
- উপহ্রদ ('Metre' is being suggested)
- বাথিন্ডা বিমানবন্দর (Metre, Foot – both are being suggested)
Just to add some more date/day-related articles' suggestions, which I got now:
- ডেভিড লেটারম্যান article, 'এপ্রিল ১২' (April 12), 'মে ২০' (May 20)
- লিউপিন article, '২৫শে জুন' (June 25)
- ইউএসএ টুডে স্পোর্টস উইকলি article, '৪ সেপ্টেম্বর' (September 4), 'বুধবার' (Wednesday)
Change 699395 merged by jenkins-bot:
[research/mwaddlink@main] Update the filter for removing candidate links
The datasets are being imported in production, and new suggestions can be checked in a day or two on https://api.wikimedia.org/service/linkrecommendation/apidocs/
@Aftabuzzaman -- thanks for checking in about this. I believe that this change we've made is only going to affect the evaluation of future articles. i.e. as users proceed through the existing suggestions and complete them, the system generates new ones. Those will not contain these terms that we want to exclude. Therefore, you'll still see these during the time it takes them to cycle out and new ones to cycle in.
Checking on wmf.15 cswiki - @Urbanecm_WMF: you mentioned that the issue (the dates displayed as suggested links) is still present in cswiki. I see the discreapncy between the number of links displayed to a user and the links fetched by API. I'll do more testing on other wikis.
Two recommeded links displayed Alexandrijská knihovna | Only one link fetched by API - link recommendation API for "Alexandrijská knihovnaI" { "links": [ { "context_after": " v roce 64", "context_before": "ytí města ", "link_index": 0, "link_target": "Arabové", "link_text": "Araby", "match_index": 0, "score": 0.512970507144928, "wikitext_offset": 1917 } ], "links_count": 1, "meta": { "application_version": "5b4709d", "dataset_checksums": { "anchors": "e3799896b97229ae992f007cbd5c837960b66c0dbd65bfe42ae27b6d7b7b3be5", "model": "18190aba440a9b123c0f897f9cf31ec5473d3d1102dba16caa74583067baec5e", "pageids": "c0413b60ae9ce6533b19f1ef9f90a9758fe28242fcfa2cc994aa00baaa1a9d8b", "redirects": "9071e4cb7f33323c3ca0d6d78e7c41016867e4f7ff7da619008c7ce71d251e1f", "w2vfiltered": "689a4fa99a3795a5bba0053280916c48515c5e38d31eab5918d06fcc83f7e641" }, "format_version": 1 }, "page_title": "Alexandrijská knihovna", "pageid": 62136, "revid": 19487788 } |
Five recommeded links displayed Expedice 38 | Only two links fetched by API - link recommendation API for "Expedice 38" { "links": [ { "context_after": " Kóiči Wak", "context_before": " japonský ", "link_index": 0, "link_target": "Kosmonaut", "link_text": "kosmonaut", "match_index": 0, "score": 0.6288405060768127, "wikitext_offset": 3408 }, { "context_after": ".", "context_before": "y, voda a ", "link_index": 1, "link_target": "Kyslík", "link_text": "kyslík", "match_index": 0, "score": 0.6587978005409241, "wikitext_offset": 4125 } ], "links_count": 2, "meta": { "application_version": "5b4709d", "dataset_checksums": { "anchors": "e3799896b97229ae992f007cbd5c837960b66c0dbd65bfe42ae27b6d7b7b3be5", "model": "18190aba440a9b123c0f897f9cf31ec5473d3d1102dba16caa74583067baec5e", "pageids": "c0413b60ae9ce6533b19f1ef9f90a9758fe28242fcfa2cc994aa00baaa1a9d8b", "redirects": "9071e4cb7f33323c3ca0d6d78e7c41016867e4f7ff7da619008c7ce71d251e1f", "w2vfiltered": "689a4fa99a3795a5bba0053280916c48515c5e38d31eab5918d06fcc83f7e641" }, "format_version": 1 }, "page_title": "Expedice 38", "pageid": 652172, "revid": 17878036 } |
Hello @Etonkovidova, thanks for your comment. The output from the API looks good to me -- it seems that the issue with cswiki dates is that "old" recommendations are still remembered by the caching SQL table (should likely happen in other wikis, too).
I'm not sure how long it takes to regenerate the task pool – maybe @Tgr would know how to answer that question.
Hope this helps. Let me know if I can help in any other way.
Manually regenerating it would take something like a day per wiki, but that would require T284551: Maintenance script for updating recommendations to newer dataset.
By itself, on a wiki like bnwiki which has ~20 add link edits per week, it would take basically forever.
Checking the example from the above comment on bnwiki wmf.16 for the আই,_রোবট_(চলচ্চিত্র). The api doesn't list "৬ আগস্ট" as a suggestion.
I checked the target wikis (random five articles); there are some discrepancies between what results that UI returns and what api fetched:
viwiki | arwiki | bnwiki | cswiki |
---|---|---|---|
CubeSat UI links count 3; API links count: 4; Eucratides_I UI links count 3; API links count: 1; Trương_Giác_(nhà_Kim) UI links count 8; API links count: 7; | استكشاف_وادي_الملوك UI links count 5; API links count: 3; | ভূ-বেড়া UI links count 2; API links count: 1; ১৩_এপ্রিল UI links count 9; API links count: 8; | Pola_(odrůda_révy_vinné)| UI links count 10; API links count: 3; |
@Tgr - any further actions on this task? Or it can be marked as Resolved?
Thanks! Since I did not notice any unnecessary suggestions in API results, I'm resolving the task.