Page MenuHomePhabricator

[M] Handle redirects
Closed, ResolvedPublic

Description

Links that are redirects result in null topic QIDs.
For instance, on frwiki:

|revision_id| page_qid|                 page_title|section_id|section_title|topic_qid|                          topic_title|
+-----------+---------+---------------------------+----------+-------------+---------+-------------------------------------+
|  191906273|  Q220408|    Thomas François Burgers|         0| ### zero ###|     null|République sud-africaine du Transvaal|

We should resolve the redirect into République_sud-africaine_(Transvaal) to enable the topic QID lookup.

See also https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/1#note_10669

Outcome

The initial assessment was for the August 2022 snapshot, no relevance score, no excluded sections, no redirects:

  • total rows = 1.83 B
  • null topic titles = 42 M (~2.3%)
  • null topic QIDs = 631 M
  • ratio > 1/3

Here we compare the impact of handling redirects for the 2022-11-07 snapshot, no relevance score, sections excluded as per ticked items in T318092: [M] Exclude certain sections from having topics in the section topics pipeline:

  • no redirects
    • total rows = 1.74 B (1,743,028,345)
    • null topic QIDs = 538 M (537,750,071)
    • ratio = 0.31
  • handled redirects
    • total rows = 1.74 B (1,743,227,622)
    • null topic QIDs = 384 M (383,664,157)
    • ratio = 0.22
    • +9% gain

The full pipeline for 2022-11-07, i.e., with relevance score, sections excluded, no null topic QIDs yielded the following counts:

  • no redirects = 1.2 B rows
  • handled redirects = 1.36 B rows
  • +16 M rows gain

Update

Thanks to @MunizaA 's code review, we increased the total amount of rows:

  • no redirects = 1.2 B rows
  • handled redirects = 1.43 B rows
  • +23 M rows gain

Event Timeline

CBogen renamed this task from Handle redirects to [M] Handle redirects.Oct 5 2022, 4:41 PM
mfossati changed the task status from Open to In Progress.Thu, Nov 17, 9:54 AM
mfossati claimed this task.

Integrated review & merged code, closing.