Page MenuHomePhabricator

[M] Exclude certain sections from having topics in the section topics pipeline
Closed, ResolvedPublic

Description

Requirements
Based on initial examination of section topics, some sections should not have topics generated and store in the section topics pipeline:

  • (M) References
  • (M) External links
  • (M) Further reading
  • (XS) last section - Update: this rule is too strong. it actually wipes out plenty of potentially useful sections, see /user/mfossati/section_topics/last_section_titles on HDFS for a detailed dataset
  • We can base which sections to exclude similar to the task for add links T279519

Excluded:

  • (XL) tree type sections (if possible?)
  • (S) Infoboxes - can be used to understand what are important sections on the article level
  • (L) Sections without textual content

Should we exclude templates?

Usage Note:
Note that section topics will be used for section level image suggestions and certain sections are to be excluded to have images suggestion to them as per https://phabricator.wikimedia.org/T311730. The sections excluded from having topics is a subset of section excluded of having images recommended.

Estimated complexity breakdown

Complexity varies depending on what we want to exclude:

  • sections without textual content may be tricky. L complexity to figure that out
  • references, external links, further reading can be tackled with section alignments machine-learned by Research. M complexity
  • infoboxes should be easy - S
  • last section with category links is trivial - XS
  • tree-type sections look like the upper bound, as I have no idea - XL

Event Timeline

@mfossati feel free to add your observations on section topics examination

Two highlights from a manual check of 50 random output samples in 5 languages:

  1. due to T314865: Include section zero in the data pipeline, we currently extract infobox links, and we should filter them out
  2. category links in the form of Category:Something get attached to the last section. This results in null topic QIDs
  1. category links in the form of Category:Something get attached to the last section. This results in null topic QIDs

I had the same thought as T279519#6985649 , which can quickly resolve this.

AUgolnikova-WMF updated the task description. (Show Details)
AUgolnikova-WMF updated the task description. (Show Details)
AUgolnikova-WMF updated the task description. (Show Details)
CBogen renamed this task from Exclude certain sections from having topics in the section topics pipeline to [M] Exclude certain sections from having topics in the section topics pipeline.Sep 27 2022, 3:37 PM
CBogen updated the task description. (Show Details)
mfossati added a subscriber: MunizaA.

Merge request reviewed by @MunizaA (thanks for the blazing fast response! ❤), feedback integrated, code merged!