[SPIKE] Section-level topic relevance score
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	mfossati
	Sep 22 2022, 2:34 PM

Description

Goal

To enable ranking of section topics at the section level.

Definition

Section-level topic relevance is a score that measures to what extent a given section topic helps summarize and understand the content of a given section text in a Wikipedia article.

Proposal 1

Build on top of the article-level one, computed in a cross-wiki fashion as per T314863: [SPIKE] Section topics article-level relevance score's proposal and implemented as per T314863#8208535.
This requires a large adaptation of the current logic: we need machine-learned section alignments to compute the topic frequency across Wikis.

Caveats

The following points were raised by @matthiasmullie :

Machine-learning section alignment might still not be good enough: it is quite realistic that some of the most relevant topics for a page already appear in the leading paragraph (and will continue to be discussed intensively in other sections, where we will no longer be able to pick up on them), and it is quite realistic that this would be true for many other wikis as well - so even if we manage to accurately align sections, we may still not be able to find the topic in any of the wikis (because in all/most of them, it was already linked only once, earlier in the page)

It’s also quite plausible that aligning sections only gets us that far. In wikis with fewer coverage/detail, we may not be able to find a matching section (because that content isn’t there, or just a minor snippet within a larger other section)

If the above doesn’t work out, we’d have to start to figure out actual (re)occurrences of a topic within a full article. String matching likely won’t cut it, so some form of language processing may be needed - which is essentially what we intended to avoid in the first place by going with links-based topic identification.

Proposal 2

Compute the topic frequency within the same wiki, pretty much as we do with the IDF component.
This requires a small adaptation of the current logic.

Caveats

The score is potentially biased.
See the following discussion with @matthiasmullie:

a “topic” is extracted based on links, which (per wiki guidelines) only appear once. As such, I wonder how reliable the frequency (never more than 1, for any topic mentioned) is

Currently a given topic may appear more than once in a given page, due to links occurring in templates, typically infoboxes. This is something that we plan to filter as per T318092: [M] Exclude certain sections from having topics in the section topics pipeline, but we may want to revert this decision.
Are we sure the guideline is actually enforced on all Wikipedias? From my manual checks on en, fr, it, pt, and es it looks so, but is that something that we can reliably check?

I doubt whether it’s enforced, but it’s certainly recommended to minimize duplicate links, so we should expect far fewer links (an expectation of max of 1) than the amount of times the topic is actually present

Acceptance Criteria

Evaluate the two solutions - document their viability and level of effort in this ticket
Choose which solution to move forward with ~~and document that in T318324~~ - already implemented as part of this spike
Figure out how to slice a meaningful sample to be used for multiple evaluations

Update

We agreed with Research to start with proposal 2 and evaluate on the sample.
We will split the check among team members to have more eyes on data.

Outcome

We opted for proposal 2.
The main reason is that TF-IDF is a baseline score, known to work effectively for relatively long texts. It won't be magic when sections are short anyway.
Proposal 1 builds on top of proposal 2 and requires a high effort with a likely low improvement.

Related Objects
Search...

Status	Assigned	Task
Resolved	CBogen	T275882 [EPIC] Structured sections with topical metadata
Resolved	CBogen	T311745 [EPIC] Section topics data pipeline
Resolved	mfossati	T318348 [SPIKE] Section-level topic relevance score

Event Timeline

mfossati created this task.Sep 22 2022, 2:34 PM

mfossati mentioned this in T318324: Implement section-level topic relevance score.Sep 22 2022, 2:39 PM

CBogen updated the task description. (Show Details)Sep 22 2022, 2:41 PM

CBogen edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.Sep 22 2022, 2:44 PM

CBogen moved this task from Incoming to Ready for Development on the Structured-Data-Backlog (Current Work) board.

mfossati added a parent task: T311745: [EPIC] Section topics data pipeline.Sep 22 2022, 2:49 PM

CBogen mentioned this in T316925: [L] Crystallize article-level section topics relevance score.Sep 22 2022, 4:27 PM

mfossati updated the task description. (Show Details)Oct 17 2022, 8:59 AM

mfossati changed the task status from Open to In Progress.Oct 24 2022, 1:15 PM

mfossati claimed this task.

mfossati moved this task from Ready for Development to Doing on the Structured-Data-Backlog (Current Work) board.

Implemented proposal 2 and opened merge request: https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/4
Moving to code review.

Proposal 2 implementation merged.
~~Moving to blocked,~~ pending data sample evaluation.

mfossati moved this task from Blocked to Doing on the Structured-Data-Backlog (Current Work) board.Nov 3 2022, 3:51 PM

mfossati updated the task description. (Show Details)Nov 15 2022, 11:37 AM

Sample sent out via the cross-team process.
@AUgolnikova-WMF and myself discussed benefits and level of effort for proposal 1: the main conclusion is that we expect a relatively high effort for a likely small improvement.

I suggest to wait for proposal 2's evaluation and get additional feedback from Research, then proceed with proposal 2 if there are no objections.

In T318348#8395615, @mfossati wrote:

Sample sent out via the cross-team process.
@AUgolnikova-WMF and myself discussed benefits and level of effort for proposal 1: the main conclusion is that we expect a relatively high effort for a likely small improvement.

I suggest to wait for proposal 2's evaluation and get additional feedback from Research, then proceed with proposal 2 if there are no objections.

@mfossati do you have an update on the status of the evaluation for proposal 2? thanks!

@mfossati and @AUgolnikova-WMF:

In this slack thread, Alexandra says we will

keep the current approach and don't over-engineer trying to get somewhat "perfect" scores (as already discussed with Marco)

Does that mean we can close this ticket?

In T318348#8454997, @CBogen wrote:

@mfossati do you have an update on the status of the evaluation for proposal 2? thanks!

I sent the data check request as per our formal cross-team process, solicited Research during the last meeting, and personally pinged @diego and @MunizaA . I haven't seen or received any update so far, I'll follow up again at the next meeting.

In T318348#8456844, @CBogen wrote:

In this slack thread, Alexandra says we will

keep the current approach and don't over-engineer trying to get somewhat "perfect" scores (as already discussed with Marco)

Does that mean we can close this ticket?

The implementation can be considered over, but It'd be best to have additional eyes on the data. That's what I was waiting for before closing. I think that we shouldn't wait more than one week, see also my comment above.

As mentioned before in our meetings, the main problem we have is the confusing usage of "Blue Links" as synonym of "topics". In NLP topics are either categories or clusters of documents. The second important problem we have is the lack of a evaluation task or guidelines. If we are using links as tags, and we want to evaluate the importance/relevance of such tags, we need a task, because relevance depends on the context.

Therefore, without a clear use case or evaluation criteria, my suggestion is go with the solution that easier to implement, and my impression is that is proposal 2.

In T318348#8457369, @mfossati wrote:

In T318348#8454997, @CBogen wrote:

@mfossati do you have an update on the status of the evaluation for proposal 2? thanks!

I sent the data check request as per our formal cross-team process, solicited Research during the last meeting, and personally pinged @diego and @MunizaA . I haven't seen or received any update so far, I'll follow up again at the next meeting.

Sorry, I was on vacation. I have filled the '''eswiki''' tab here. I don't see a tab Urdu to be evaluated by Muniza.

In T318348#8462050, @diego wrote:

As mentioned before in our meetings, the main problem we have is the confusing usage of "Blue Links" as synonym of "topics". In NLP topics are either categories or clusters of documents.

Fully agree. Blue links are minted by the community for several reasons. I think we're trying to encode them as keywords in NLP jargon.

The second important problem we have is the lack of a evaluation task or guidelines. If we are using links as tags, and we want to evaluate the importance/relevance of such tags, we need a task, because relevance depends on the context.

From my perspective, the core task is keyword extraction. That's also why we agreed on TF-IDF as the baseline ranking score. I tried to provide a quite informal evaluation task and guidelines in the data check call, but I'd love to hear more structured thoughts!

Sorry, I was on vacation. I have filled the '''eswiki''' tab here. I don't see a tab Urdu to be evaluated by Muniza.

No worries and thanks for doing such a tedious task! I'll make sure Urdu is there and ping @MunizaA.

I think we can now close this spike.

[SPIKE] Section-level topic relevance scoreClosed, ResolvedPublicActions

Description

Goal

Definition

Proposal 1

Caveats

Proposal 2

Caveats

Update

Outcome

Related ObjectsSearch...

Event Timeline

[SPIKE] Section-level topic relevance score
Closed, ResolvedPublic
Actions

Related Objects
Search...