We recently finished [[ https://arxiv.org/abs/2306.03940 | a paper ]] describing the problem of orphan articles as the dark matter in Wikipedia. In that work, we sketched a potential solution to support editors in de-orphanization using link translation ([[ https://linkrec.toolforge.org/ | Wiki-Visibility tool ]]).
Once we identified a suitable new link to add (i.e. consisting of a source article and a target article), we are still faced with the task of inserting the link somewhere into the text of the source article. This can be a non-trivial task if an anchor word matching the pagetitle of the target article is not available and/or if the source article contains a lot of text.
Therefore, in this task we develop a multilingual model to support the task of link insertion by identifying the most suitable text span for a specific new link.
Previous focus: ~~In this task, the aim is to quantitatively evaluate how good these recommendations are. The idea is to propose a better model than SOTA benchmarks, but also to describe how currently available models do not perform well in the regime that is most relevant to support editors (e.g. orphans).~~
In this task, the aim is to quantitatively evaluate how good these recommendations are. The idea is to propose a better model than SOTA benchmarks, but also to describe how currently available models do not perform well in the regime that is most relevant to support editors (e.g. orphans).
Specifically, we will consider the following items:
[ ] Compare link translation for orphans with existing SOTA algorithms in the context of orphan articles
[ ] Provide an improved model using additional information from embeddings
[ ] Generate a new benchmark dataset for a more challenging link recommendation problem
[ ] (stretch) Obtain manual ratings of de-orphanization examples