**Context**
Based on the [[ https://docs.google.com/spreadsheets/d/1typAKEG8yE6H3uqITbSTA4KXnzATB3EHMP70P4c4MEU/edit#gid=1801906281 | section topics sample ]] for evaluation, we have a lot of sections that are tables with lists of blue links (sporting events, names etc). Topics created from these blue links are not meaningful.
**AC**
Do not generate topics for sections that have:
- [x] only tables without textual context
- [x] unordered or ordered without textual content
- [x] generate one off approx. stats on the amount of excluded sections and articles per wiki of focus and share with product @AUgolnikova-WMF
==Update==
- Snapshot: `2022-19-12`
- total rows before: **1.4 B** (1,395,136,427)
- total rows now: **1.2 B** (1,260,350,442)
| wiki | total sections | excluded sections
| ar | 2,015,385 | 124,791
| bn | 298,255 | 21,450
| cs | 1,220,022 | 139,434
| es | 4,253,439 | 380,613
| id | 1,032,808 | 81,749
| pt | 612,413 | 69,272
| ru | 4,433,197 | 539,602
NOTE: the solution only processes sections. If we want to exclude **articles**, we should use the Wikidata ontology, similarly to [image suggestions](https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/5c4a2ba7c37fcb9e06172396e8bf082affbfa774/image_suggestions/cassandra.py#L39).
NOTE: we currently extract **top-level** sections only, so lists or tables contained in subsections will **not** be excluded ([example](https://it.wikipedia.org/wiki/Paul_McGann#Filmografia_parziale)), since subsections are considered as textual content. We may adjust the behavior by extracting lower-level sections, too
==Iteration 2==
Based on the note above:
- [x] exclude lists and tables appearing in subsections without text
- [x] count the amount of excluded sections
===Results===
- Snapshot: `2023-02-06`
- total rows at iteration 1: **500 M** (500,382,423)
- total rows at iteration 2: **469 M** (469,334,379)
- difference: **31 M** (31,048,044)
The following table shows the amount of unique sections per iteration per target wiki.
| **wiki** | **iteration 1** | **iteration 2** | **difference**
| arwiki| 1,871,261| 1,848,318| 22,943|
| bnwiki| 276,957| 273,280| 3,677|
| cswiki| 1,080,315| 1,057,102| 23,213|
| enwiki|15,035,011|14,730,882|304,129|
| eswiki| 3,812,674| 3,700,500|112,174|
| frwiki| 5,119,685| 4,978,695|140,990|
| idwiki| 944,783| 931,481| 13,302|
| ptwiki| 699,468| 683,154| 16,314|
| ruwiki| 3,897,698| 3,842,851| 54,847|
===Observations===
While this iteration has improved the filter, it still doesn't cover other sections with standard lists/tables that should be excluded as well, typically leading text, links, or templates.
See the following examples: [1](https://en.wikipedia.org/w/index.php?oldid=1135030799#Player_records), [2](https://pt.wikipedia.org/w/index.php?oldid=64983782#Deputados_estaduais_eleitos), [3](https://fr.wikipedia.org/w/index.php?oldid=195819049#Par_%C3%89tat), [4](https://en.wikipedia.org/w/index.php?oldid=1128988322#SH_52), [5](https://es.wikipedia.org/w/index.php?oldid=144392198#Resultados).
I propose to take them all into account by just excluding **all** sections that **contain at least one** list or table.