Page MenuHomePhabricator

[SPIKE] Section-level topic relevance score
Closed, ResolvedPublic

Description

Goal

To enable ranking of section topics at the section level.

Definition

Section-level topic relevance is a score that measures to what extent a given section topic helps summarize and understand the content of a given section text in a Wikipedia article.

Proposal 1

Build on top of the article-level one, computed in a cross-wiki fashion as per T314863: [SPIKE] Section topics article-level relevance score's proposal and implemented as per T314863#8208535.
This requires a large adaptation of the current logic: we need machine-learned section alignments to compute the topic frequency across Wikis.

Caveats

The following points were raised by @matthiasmullie :

  • Machine-learning section alignment might still not be good enough: it is quite realistic that some of the most relevant topics for a page already appear in the leading paragraph (and will continue to be discussed intensively in other sections, where we will no longer be able to pick up on them), and it is quite realistic that this would be true for many other wikis as well - so even if we manage to accurately align sections, we may still not be able to find the topic in any of the wikis (because in all/most of them, it was already linked only once, earlier in the page)
  • It’s also quite plausible that aligning sections only gets us that far. In wikis with fewer coverage/detail, we may not be able to find a matching section (because that content isn’t there, or just a minor snippet within a larger other section)
  • If the above doesn’t work out, we’d have to start to figure out actual (re)occurrences of a topic within a full article. String matching likely won’t cut it, so some form of language processing may be needed - which is essentially what we intended to avoid in the first place by going with links-based topic identification.

Proposal 2

Compute the topic frequency within the same wiki, pretty much as we do with the IDF component.
This requires a small adaptation of the current logic.

Caveats

The score is potentially biased.
See the following discussion with @matthiasmullie:

a “topic” is extracted based on links, which (per wiki guidelines) only appear once. As such, I wonder how reliable the frequency (never more than 1, for any topic mentioned) is

Currently a given topic may appear more than once in a given page, due to links occurring in templates, typically infoboxes. This is something that we plan to filter as per T318092: [M] Exclude certain sections from having topics in the section topics pipeline, but we may want to revert this decision.
Are we sure the guideline is actually enforced on all Wikipedias? From my manual checks on en, fr, it, pt, and es it looks so, but is that something that we can reliably check?

I doubt whether it’s enforced, but it’s certainly recommended to minimize duplicate links, so we should expect far fewer links (an expectation of max of 1) than the amount of times the topic is actually present

Acceptance Criteria

  • Evaluate the two solutions - document their viability and level of effort in this ticket
  • Choose which solution to move forward with and document that in T318324 - already implemented as part of this spike
  • Figure out how to slice a meaningful sample to be used for multiple evaluations

Update

We agreed with Research to start with proposal 2 and evaluate on the sample.
We will split the check among team members to have more eyes on data.

Outcome

We opted for proposal 2.
The main reason is that TF-IDF is a baseline score, known to work effectively for relatively long texts. It won't be magic when sections are short anyway.
Proposal 1 builds on top of proposal 2 and requires a high effort with a likely low improvement.

Event Timeline

mfossati changed the task status from Open to In Progress.Oct 24 2022, 1:15 PM
mfossati claimed this task.

Proposal 2 implementation merged.
Moving to blocked, pending data sample evaluation.

Sample sent out via the cross-team process.
@AUgolnikova-WMF and myself discussed benefits and level of effort for proposal 1: the main conclusion is that we expect a relatively high effort for a likely small improvement.

I suggest to wait for proposal 2's evaluation and get additional feedback from Research, then proceed with proposal 2 if there are no objections.

Sample sent out via the cross-team process.
@AUgolnikova-WMF and myself discussed benefits and level of effort for proposal 1: the main conclusion is that we expect a relatively high effort for a likely small improvement.

I suggest to wait for proposal 2's evaluation and get additional feedback from Research, then proceed with proposal 2 if there are no objections.

@mfossati do you have an update on the status of the evaluation for proposal 2? thanks!

@mfossati and @AUgolnikova-WMF:

In this slack thread, Alexandra says we will

keep the current approach and don't over-engineer trying to get somewhat "perfect" scores (as already discussed with Marco)

Does that mean we can close this ticket?

@mfossati do you have an update on the status of the evaluation for proposal 2? thanks!

I sent the data check request as per our formal cross-team process, solicited Research during the last meeting, and personally pinged @diego and @MunizaA . I haven't seen or received any update so far, I'll follow up again at the next meeting.

In this slack thread, Alexandra says we will

keep the current approach and don't over-engineer trying to get somewhat "perfect" scores (as already discussed with Marco)

Does that mean we can close this ticket?

The implementation can be considered over, but It'd be best to have additional eyes on the data. That's what I was waiting for before closing. I think that we shouldn't wait more than one week, see also my comment above.

As mentioned before in our meetings, the main problem we have is the confusing usage of "Blue Links" as synonym of "topics". In NLP topics are either categories or clusters of documents. The second important problem we have is the lack of a evaluation task or guidelines. If we are using links as tags, and we want to evaluate the importance/relevance of such tags, we need a task, because relevance depends on the context.

Therefore, without a clear use case or evaluation criteria, my suggestion is go with the solution that easier to implement, and my impression is that is proposal 2.

@mfossati do you have an update on the status of the evaluation for proposal 2? thanks!

I sent the data check request as per our formal cross-team process, solicited Research during the last meeting, and personally pinged @diego and @MunizaA . I haven't seen or received any update so far, I'll follow up again at the next meeting.

Sorry, I was on vacation. I have filled the '''eswiki''' tab here. I don't see a tab Urdu to be evaluated by Muniza.

mfossati updated the task description. (Show Details)

As mentioned before in our meetings, the main problem we have is the confusing usage of "Blue Links" as synonym of "topics". In NLP topics are either categories or clusters of documents.

Fully agree. Blue links are minted by the community for several reasons. I think we're trying to encode them as keywords in NLP jargon.

The second important problem we have is the lack of a evaluation task or guidelines. If we are using links as tags, and we want to evaluate the importance/relevance of such tags, we need a task, because relevance depends on the context.

From my perspective, the core task is keyword extraction. That's also why we agreed on TF-IDF as the baseline ranking score. I tried to provide a quite informal evaluation task and guidelines in the data check call, but I'd love to hear more structured thoughts!

Sorry, I was on vacation. I have filled the '''eswiki''' tab here. I don't see a tab Urdu to be evaluated by Muniza.

No worries and thanks for doing such a tedious task! I'll make sure Urdu is there and ping @MunizaA.

I think we can now close this spike.