Links that are redirects result in null topic QIDs.
For instance, on frwiki:
|revision_id| page_qid| page_title|section_id|section_title|topic_qid| topic_title| +-----------+---------+---------------------------+----------+-------------+---------+-------------------------------------+ | 191906273| Q220408| Thomas François Burgers| 0| ### zero ###| null|République sud-africaine du Transvaal|
We should resolve the redirect into République_sud-africaine_(Transvaal) to enable the topic QID lookup.
See also https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/1#note_10669
Outcome
The initial assessment was for the August 2022 snapshot, no relevance score, no excluded sections, no redirects:
- total rows = 1.83 B
- null topic titles = 42 M (~2.3%)
- null topic QIDs = 631 M
- ratio > 1/3
Here we compare the impact of handling redirects for the 2022-11-07 snapshot, no relevance score, sections excluded as per ticked items in T318092: [M] Exclude certain sections from having topics in the section topics pipeline:
- no redirects
- total rows = 1.74 B (1,743,028,345)
- null topic QIDs = 538 M (537,750,071)
- ratio = 0.31
- handled redirects
- total rows = 1.74 B (1,743,227,622)
- null topic QIDs = 384 M (383,664,157)
- ratio = 0.22
- +9% gain
The full pipeline for 2022-11-07, i.e., with relevance score, sections excluded, no null topic QIDs yielded the following counts:
- no redirects = 1.2 B rows
- handled redirects = 1.36 B rows
- +16 M rows gain
Update
Thanks to @MunizaA 's code review, we increased the total amount of rows:
- no redirects = 1.2 B rows
- handled redirects = 1.43 B rows
- +23 M rows gain