Requirements
Based on initial examination of section topics, some sections should not have topics generated and store in the section topics pipeline:
- (M) References
- (M) External links
- (M) Further reading
(XS) last section- Update: this rule is too strong. it actually wipes out plenty of potentially useful sections, see /user/mfossati/section_topics/last_section_titles on HDFS for a detailed dataset- We can base which sections to exclude similar to the task for add links T279519
Excluded:
- (XL) tree type sections (if possible?)
- (S) Infoboxes - can be used to understand what are important sections on the article level
- (L) Sections without textual content
Should we exclude templates?
Usage Note:
Note that section topics will be used for section level image suggestions and certain sections are to be excluded to have images suggestion to them as per https://phabricator.wikimedia.org/T311730. The sections excluded from having topics is a subset of section excluded of having images recommended.
Estimated complexity breakdown
Complexity varies depending on what we want to exclude:
- sections without textual content may be tricky. L complexity to figure that out
- references, external links, further reading can be tackled with section alignments machine-learned by Research. M complexity
- infoboxes should be easy - S
- last section with category links is trivial - XS
- tree-type sections look like the upper bound, as I have no idea - XL