Goal
To enable ranking of section topics at the section level.
Definition
Section-level topic relevance is a score that measures to what extent a given section topic helps summarize and understand the content of a given section text in a Wikipedia article.
Proposal 1
Build on top of the article-level one, computed in a cross-wiki fashion as per T314863: [SPIKE] Section topics article-level relevance score's proposal and implemented as per T314863#8208535.
This requires a large adaptation of the current logic: we need machine-learned section alignments to compute the topic frequency across Wikis.
Caveats
The following points were raised by @matthiasmullie :
- Machine-learning section alignment might still not be good enough: it is quite realistic that some of the most relevant topics for a page already appear in the leading paragraph (and will continue to be discussed intensively in other sections, where we will no longer be able to pick up on them), and it is quite realistic that this would be true for many other wikis as well - so even if we manage to accurately align sections, we may still not be able to find the topic in any of the wikis (because in all/most of them, it was already linked only once, earlier in the page)
- It’s also quite plausible that aligning sections only gets us that far. In wikis with fewer coverage/detail, we may not be able to find a matching section (because that content isn’t there, or just a minor snippet within a larger other section)
- If the above doesn’t work out, we’d have to start to figure out actual (re)occurrences of a topic within a full article. String matching likely won’t cut it, so some form of language processing may be needed - which is essentially what we intended to avoid in the first place by going with links-based topic identification.
Proposal 2
Compute the topic frequency within the same wiki, pretty much as we do with the IDF component.
This requires a small adaptation of the current logic.
Caveats
The score is potentially biased.
See the following discussion with @matthiasmullie:
a “topic” is extracted based on links, which (per wiki guidelines) only appear once. As such, I wonder how reliable the frequency (never more than 1, for any topic mentioned) is
Currently a given topic may appear more than once in a given page, due to links occurring in templates, typically infoboxes. This is something that we plan to filter as per T318092: [M] Exclude certain sections from having topics in the section topics pipeline, but we may want to revert this decision.
Are we sure the guideline is actually enforced on all Wikipedias? From my manual checks on en, fr, it, pt, and es it looks so, but is that something that we can reliably check?
I doubt whether it’s enforced, but it’s certainly recommended to minimize duplicate links, so we should expect far fewer links (an expectation of max of 1) than the amount of times the topic is actually present
Acceptance Criteria
- Evaluate the two solutions - document their viability and level of effort in this ticket
- Choose which solution to move forward with
and document that in T318324- already implemented as part of this spike - Figure out how to slice a meaningful sample to be used for multiple evaluations
Update
We agreed with Research to start with proposal 2 and evaluate on the sample.
We will split the check among team members to have more eyes on data.
Outcome
We opted for proposal 2.
The main reason is that TF-IDF is a baseline score, known to work effectively for relatively long texts. It won't be magic when sections are short anyway.
Proposal 1 builds on top of proposal 2 and requires a high effort with a likely low improvement.