In T304548#7937512, jawiki and aswiki did not pass the backtesting evaluation. The suggested next steps are to manually inspect the models with users who have experience with these languages or use google-translate as the link-recommendation algorithm is iteratively improved until the models pass the backtesting evaluation.
>>! In T304548#7952043, @MGerlach wrote:
> Some thoughts on potential starting points on how to manually inspect the model. In order to figure where the text processing is failing for these languages, we could start checking the following steps:
> - generation of candidate anchors ([[ https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/utils.py#L348 | code ]]). this uses simple tokenization to split the text into substrings which then serve as potential anchor-words which can be linked. if we cant identify suitable words, e.g. due to the absence of whitespaces, we wont be able to generate good link recommendations.
> - generation of link candidates ([[ https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/utils.py#L363 | code ]]). for each candidate anchor, we look up link candidates in the anchor dictionary. The anchor dictionary contains all the already existing links (anchor+title of the linked page) and is created in [[ https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/generate_anchor_dictionary_spark.py | this script ]]. we should make sure the anchor dictionary is populated with a sufficient number of links, otherwise we wont be able to generate link candidates for any anchor.
> - disambiguation of link candidates ([[ https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/utils.py#L386 | code ]]). once we have one or more link candidates for a candidate anchor, the model selects the most probable link from the candidates. we can inspect the probabilities assigned to each link candidate to understand where a potential error might come from.
Models to be inspected and their datasets can be found on the `stat1008` machine:
```
WIKI_ID=jawiki
cd /home/kevinbazira/mwaddlink/data/$WIKI_ID
```
====List of wikis to check
* jawiki
* aswiki
* bowiki
* bugwiki
* bpywiki
* dzwiki see T304551#8412493