Context
Based on the section topics sample for evaluation, we have a lot of sections that are tables with lists of blue links (sporting events, names etc). Topics created from these blue links are not meaningful.
AC
Do not generate topics for sections that have:
- only tables without textual context
- unordered or ordered without textual content
- generate one off approx. stats on the amount of excluded sections and articles per wiki of focus and share with product @AUgolnikova-WMF
Update
- Snapshot: 2022-19-12
- total rows before: 1.4 B (1,395,136,427)
- total rows now: 1.2 B (1,260,350,442)
wiki | total sections | excluded sections |
ar | 2,015,385 | 124,791 |
bn | 298,255 | 21,450 |
cs | 1,220,022 | 139,434 |
es | 4,253,439 | 380,613 |
id | 1,032,808 | 81,749 |
pt | 612,413 | 69,272 |
ru | 4,433,197 | 539,602 |
Iteration 2
Based on the note above:
- exclude lists and tables appearing in subsections without text
- count the amount of excluded sections
Results
- Snapshot: 2023-02-06
- total rows at iteration 1: 500 M (500,382,423)
- total rows at iteration 2: 469 M (469,334,379)
- difference: 31 M (31,048,044)
Observations
While this iteration has improved the filter, it still doesn't cover other sections with standard lists/tables that should be excluded as well, typically leading text, links, or templates.
See the following examples: 1, 2, 3, 4, 5.
Iteration 3
I propose to take all the above examples into account by just excluding all sections that contain at least one list or table.
Results
The following table shows the amount of unique unfiltered sections per iteration per target wiki and the difference with the previous iteration.
wiki | iteration 1 | iteration 2 | iteration 3 | iteration 3 bis | 1 - 2 | 2 - 3 | 3 - 3 bis |
arwiki | 1,871,261 | 1,848,318 | 1,626,252 | 1,662,699 | 22,943 | 222,066 | -36,447 |
bnwiki | 276,957 | 273,280 | 237,289 | 247,158 | 3,677 | 35,991 | -9,869 |
cswiki | 1,080,315 | 1,057,102 | 901,001 | 934,944 | 23,213 | 156,101 | -33,943 |
enwiki | 15,035,011 | 14,730,882 | 11,782,661 | 12,771,248 | 304,129 | 2,948,221 | -988,587 |
eswiki | 3,812,674 | 3,700,500 | 3,085,701 | 3,216,167 | 112,174 | 614,799 | -130,466 |
frwiki | 5,119,685 | 4,978,695 | 4,030,943 | 4,222,696 | 140,990 | 947,752 | -191,753 |
idwiki | 944,783 | 931,481 | 804,290 | 839,938 | 13,302 | 127,191 | -35,648 |
ptwiki | 699,468 | 683,154 | 565,811 | 593,640 | 16,314 | 117,343 | -27,829 |
ruwiki | 3,897,698 | 3,842,851 | 3,264,390 | 3,397,947 | 54,847 | 578,461 | -133,557 |
Recap
The following table shows the last iteration's amount of unique filtered sections and how many got filtered from date and media topics:
wiki | total sections | filtered |
arwiki | 1,649,850 | 12,849 |
bnwiki | 245,166 | 1,992 |
cswiki | 930,071 | 4,873 |
enwiki | 12,618,647 | 152,601 |
eswiki | 3,186,276 | 29,891 |
frwiki | 4,186,061 | 36,635 |
idwiki | 835,627 | 4,311 |
ptwiki | 591,763 | 1,877 |
ruwiki | 3,357,439 | 40,508 |