Which types of article should not be recommended as links by the link-recommendation algorithm?
For example, we only recommend links to articles that are in the main namespace of the same Wikipedia and that are not redirect-pages. In different places we made specifications as to what types of articles we should not link to, such as
- disambiguation pages T261408#6840709
- dates (e.g. years) T253279
- dates and centuries T278864#6970551
A substantial fraction of these can be filtered by looking up statements of the corresponding Wikidata-item, most notably the value for the [[ https://www.wikidata.org/wiki/Property:P31 | instance-of property ]]. For example, we can identify the article [[ https://en.wikipedia.org/wiki/Statistics_(disambiguation) | Statistics (disambiguation) ]] as a disambiguation page (without parsing the title) from the fact that the corresponding Wikidata-item ([[ https://www.wikidata.org/wiki/Q2333935 | Q2333935 ]]) specifies that it is an "instance of" the Wikidata-item "Wikimedia disambiguation page" ([[ https://www.wikidata.org/wiki/Q4167410 | Q4167410 ]]).
Currently, we filter an article from the set of candidate-links if it is an instance-of the following items:
- Wikimedia disambiguation page ([[ https://www.wikidata.org/wiki/Q4167410 | Q4167410 ]])
- Wikimedia list article ([[ https://www.wikidata.org/wiki/Q13406463 | Q13406463 ]])
- Year ([[ https://www.wikidata.org/wiki/Q577 | Q577 ]])
- Calendar year ([[ https://www.wikidata.org/wiki/Q3186692 | Q3186692 ]])
Other candidates are:
- century ([[ https://www.wikidata.org/wiki/Q578 | Q578 ]])
- calendar date ([[ https://www.wikidata.org/wiki/Q205892 | Q205892 ]])
- point in time with respect to recurrent timeframe ([[ https://www.wikidata.org/Q14795564 | Q14795564 ]])
- ...
Aim:
[ ] Identify other types of articles we do not want to filter based on linking conventions or feedback
[ ] Identify the corresponding Wikidata item
[ ] Add the list of Wikidata items to the filter in the repository and retrain the model ([[ https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/filter_dict_anchor.py#L12 | code ]])