Page MenuHomePhabricator

[L] Exclude sections-tables from having section topics
Closed, ResolvedPublic

Description

Context
Based on the section topics sample for evaluation, we have a lot of sections that are tables with lists of blue links (sporting events, names etc). Topics created from these blue links are not meaningful.

AC
Do not generate topics for sections that have:

  • only tables without textual context
  • unordered or ordered without textual content
  • generate one off approx. stats on the amount of excluded sections and articles per wiki of focus and share with product @AUgolnikova-WMF

Update

  • Snapshot: 2022-19-12
  • total rows before: 1.4 B (1,395,136,427)
  • total rows now: 1.2 B (1,260,350,442)
wikitotal sectionsexcluded sections
ar2,015,385124,791
bn298,25521,450
cs1,220,022139,434
es4,253,439380,613
id1,032,80881,749
pt612,41369,272
ru4,433,197539,602
NOTE: the solution only processes sections. If we want to exclude articles, we should use the Wikidata ontology, similarly to image suggestions.
NOTE: we currently extract top-level sections only, so lists or tables contained in subsections will not be excluded (example), since subsections are considered as textual content. We may adjust the behavior by extracting lower-level sections, too

Iteration 2

Based on the note above:

  • exclude lists and tables appearing in subsections without text
  • count the amount of excluded sections

Results

  • Snapshot: 2023-02-06
  • total rows at iteration 1: 500 M (500,382,423)
  • total rows at iteration 2: 469 M (469,334,379)
  • difference: 31 M (31,048,044)

Observations

While this iteration has improved the filter, it still doesn't cover other sections with standard lists/tables that should be excluded as well, typically leading text, links, or templates.
See the following examples: 1, 2, 3, 4, 5.

Iteration 3

I propose to take all the above examples into account by just excluding all sections that contain at least one list or table.

Results

The following table shows the amount of unique unfiltered sections per iteration per target wiki and the difference with the previous iteration.

NOTE: iteration 3 excluded unexpected sections, mainly due to the ambiguity of # being both an ordered list item in wikitext and a link anchor. Iteration 3 bis fixes that, thus keeping more sections.
wikiiteration 1iteration 2iteration 3 iteration 3 bis1 - 22 - 33 - 3 bis
arwiki1,871,2611,848,3181,626,2521,662,69922,943222,066-36,447
bnwiki276,957273,280237,289247,1583,67735,991-9,869
cswiki1,080,3151,057,102901,001934,94423,213156,101-33,943
enwiki15,035,01114,730,88211,782,66112,771,248304,1292,948,221-988,587
eswiki3,812,6743,700,5003,085,7013,216,167112,174614,799-130,466
frwiki5,119,6854,978,6954,030,9434,222,696140,990947,752-191,753
idwiki944,783931,481804,290839,93813,302127,191-35,648
ptwiki699,468683,154565,811593,64016,314117,343-27,829
ruwiki3,897,6983,842,8513,264,3903,397,94754,847578,461-133,557

Recap

The following table shows the last iteration's amount of unique filtered sections and how many got filtered from date and media topics:

wikitotal sectionsfiltered
arwiki1,649,85012,849
bnwiki245,1661,992
cswiki930,0714,873
enwiki12,618,647152,601
eswiki3,186,27629,891
frwiki4,186,06136,635
idwiki835,6274,311
ptwiki591,7631,877
ruwiki3,357,43940,508
NOTE: Additional Acceptance Criteria: Currently the main use case for section topics is section-level image suggestions, for which we want tables excluded -- but we are not sure whether we will want them excluded for future use cases. There fore, ensure that the table exclusion logic is *optional* in the code so that it can be removed later for use cases in which we want section topics for sections with tables.

Event Timeline

MarkTraceur renamed this task from Exclude sections-tables from having section topics to [L] Exclude sections-tables from having section topics.Dec 1 2022, 5:57 PM

Can we add lists to this too? Some sections are entirely enclosed with <ul></ul> tags

mfossati changed the task status from Open to In Progress.Jan 16 2023, 11:05 AM
mfossati claimed this task.

Thanks @mfossati,

NOTE: the solution only processes sections. If we want to exclude articles, we should use the Wikidata ontology, similarly to image suggestions.

You mean that we won't be excluding articles that are all tables like for article image suggestions? I think it is mentioned as part of https://phabricator.wikimedia.org/T311730 to inherit exclusions on the article level from aricle level suggestions:

Exclude sections of articles with specific instanceof values (inherited from article-level image suggestions). See articles excluded here

You mean that we won't be excluding articles that are all tables like for article image suggestions? I think it is mentioned as part of https://phabricator.wikimedia.org/T311730 to inherit exclusions on the article level from aricle level suggestions:

Exclude sections of articles with specific instanceof values (inherited from article-level image suggestions). See articles excluded here

@AUgolnikova-WMF, I'm not sure I understand: do we want to exclude these articles in Section-Topics or in Section-Level-Image-Suggestions ? This ticket targets the former, T311730: [L] Exclude certain sections from having generated image suggestions the latter.

@mfossati We can exclude them in Section-Level-Image-Suggestions as part of T311730: [L] Exclude certain sections from having generated image suggestions

Sounds good, closing this one then.

Re-opened to work on iteration 2 (subsections) -- see ticket description. This will be in section topics code, and then duplicated into section alignment code in T330841.

mfossati changed the task status from Open to In Progress.Thu, Mar 2, 4:07 PM

Merged, closing. Follow-up work in T330848