Change Details

In T336927, 18 rounds of add-a-link models were trained and for the pipelines that succeeded models were published here: https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/ Below is a list of wikis whose models were not published: | Wikis | Reason | jawiki | T304548#7937512 | bowiki | T304549#8060880 | dzwiki | T304551#8412493 | diqwiki, dvwiki | T304551#8417373 | hywwiki | T308134#8548734 | lrcwiki | T308136#8648765 | mnwwiki, mywiki | T308137#8690680 | piwiki | T308138#8708597 | zhwiki | T308139#8720236 | wuuwiki, zh_classicalwiki, zh_yuewiki | T308139#8728522 | shnwiki | T308141#8778455 | snwiki, szywiki | T308142#8804657 | tiwiki, urwiki | T308143#8827377 The to goal is to improve the link-recommendation algorithm in order to support the languages listed above. ====Notes: In T304548#7937512, jawiki did not pass the backtesting evaluation. The suggested next steps are to manually inspect the model with users who have experience with this language or use google-translate as the link-recommendation algorithm is iteratively improved until the model passes the backtesting evaluation. >>! In T304548#7952043, @MGerlach wrote: > Some thoughts on potential starting points on how to manually inspect the model. In order to figure where the text processing is failing for these languages, we could start checking the following steps: > - generation of candidate anchors ([[ https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/utils.py#L348 | code ]]). this uses simple tokenization to split the text into substrings which then serve as potential anchor-words which can be linked. if we cant identify suitable words, e.g. due to the absence of whitespaces, we wont be able to generate good link recommendations. > - generation of link candidates ([[ https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/utils.py#L363 | code ]]). for each candidate anchor, we look up link candidates in the anchor dictionary. The anchor dictionary contains all the already existing links (anchor+title of the linked page) and is created in [[ https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/generate_anchor_dictionary_spark.py | this script ]]. we should make sure the anchor dictionary is populated with a sufficient number of links, otherwise we wont be able to generate link candidates for any anchor. > - disambiguation of link candidates ([[ https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/utils.py#L386 | code ]]). once we have one or more link candidates for a candidate anchor, the model selects the most probable link from the candidates. we can inspect the probabilities assigned to each link candidate to understand where a potential error might come from. Models to be inspected and their datasets can be found on the `stat1008` machine: ``` WIKI_ID=jawiki cd /home/kevinbazira/mwaddlink/data/$WIKI_ID ```

In T336927, 18 rounds of add-a-link models were trained and for the pipelines that succeeded models were published here: https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/ Below is a list of wikis whose models were not published: | Wikis | Reason | jawiki | T304548#7937512 | bowiki | T304549#8060880 | dzwiki | T304551#8412493 | diqwiki, dvwiki | T304551#8417373 | hywwiki | T308134#8548734 | lrcwiki | T308136#8648765 | mnwwiki, mywiki | T308137#8690680 | piwiki | T308138#8708597 | zhwiki | T308139#8720236 | wuuwiki, zh_classicalwiki, zh_yuewiki | T308139#8728522 | shnwiki | T308141#8778455 | snwiki, szywiki | T308142#8804657 | tiwiki, urwiki | T308143#8827377 The goal is to improve the link-recommendation algorithm in order to support the languages listed above. ====Notes: In T304548#7937512, jawiki did not pass the backtesting evaluation. The suggested next steps are to manually inspect the model with users who have experience with this language or use google-translate as the link-recommendation algorithm is iteratively improved until the model passes the backtesting evaluation. >>! In T304548#7952043, @MGerlach wrote: > Some thoughts on potential starting points on how to manually inspect the model. In order to figure where the text processing is failing for these languages, we could start checking the following steps: > - generation of candidate anchors ([[ https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/utils.py#L348 | code ]]). this uses simple tokenization to split the text into substrings which then serve as potential anchor-words which can be linked. if we cant identify suitable words, e.g. due to the absence of whitespaces, we wont be able to generate good link recommendations. > - generation of link candidates ([[ https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/utils.py#L363 | code ]]). for each candidate anchor, we look up link candidates in the anchor dictionary. The anchor dictionary contains all the already existing links (anchor+title of the linked page) and is created in [[ https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/generate_anchor_dictionary_spark.py | this script ]]. we should make sure the anchor dictionary is populated with a sufficient number of links, otherwise we wont be able to generate link candidates for any anchor. > - disambiguation of link candidates ([[ https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/utils.py#L386 | code ]]). once we have one or more link candidates for a candidate anchor, the model selects the most probable link from the candidates. we can inspect the probabilities assigned to each link candidate to understand where a potential error might come from. Models to be inspected and their datasets can be found on the `stat1008` machine: ``` WIKI_ID=jawiki cd /home/kevinbazira/mwaddlink/data/$WIKI_ID ```