The following blue links lead to null topic QIDs:
- categories, typically attached to the last section, i.e., any link starting with Category:
- media, i.e., any link starting with File:
- article sections, e.g., Article#Section
Sanitize them to enable proper topic QID lookups.
Tasks
- In theory, there are QIDs for categories, e.g., this one. In practice, non-English Wikipedias have a localized category prefix, so build a list of all category prefixes
- if a blue link starts with one such prefix, convert it to Category:
- build a list of all media prefixes
- if a blue link starts with one of them, filter it
- consider keeping media links in a separate dataset, they may be useful for Section-Level-Image-Suggestions
-
understand whether splitting on # is a reliable way to remove anchors from links or it also impacts other standalone titles- -
remove the anchor from any link that has one- - comply with wikilinks documentation to handle multiple topics for one link due to bad string matching, e.g., Q821067 and Q2386274 for Virtual Console
Conclusion
This ticket turned out to be trickier than expected. Here's a report of the actual work done.
- Category links required multiple iterations:
- T325369: Use Wikipedia page properties instead of Wikidata page links was not a solution
- checking against a list of all category prefixes & converting to Category: was not a solution. Links ended up in null topic QIDs anyway due to language-specific suffixes, e.g., Categoria:Anfíbios > Category:Anfíbios > null QID
- loading category pages separately from article pages was the solution, see this commit
- the gain was roughly +12 M links
- updated the topic QIDs lookup strategy: lowercase the first character of link target titles to comply with the expected markup
- found empty-string links and filtered them
- built a list of media prefixes and separated roughly 32 M media links from the main dataset
- won't handle article section links:
- links with internal anchors only account for roughly 0.1% of the overall dataset
- removing the anchor actually makes the link more generic, and may have a negative impact on precision/relevance
- splitting on # also impacts other standalone titles
Datasets size
| snapshot | total rows | null topic QIDs | null links | category links | media links |
| 2022-12-19 | 1.36 B (1,363,360,235) | 184 M (183,794,789) | 0 | 12 M (11,554,204) | 32 M (31,776,192) |
NOTE: The above table comes from the pipeline executed without relevance score