**Context**
Based on the [[ https://docs.google.com/spreadsheets/d/1typAKEG8yE6H3uqITbSTA4KXnzATB3EHMP70P4c4MEU/edit#gid=1801906281 | section topics sample ]] for evaluation, we have a lot of sections that are tables with lists of blue links (sporting events, names etc). Topics created from these blue links are not meaningful.
**AC**
Do not generate topics for sections that have:
- [x] only tables without textual context
- [x] unordered or ordered without textual content
- [x] generate one off approx. stats on the amount of excluded sections and articles per wiki of focus and share with product @AUgolnikova-WMF
==Update==
- Snapshot: `2022-19-12`
- total rows before: **1.4 B** (1,395,136,427)
- total rows now: **1.2 B** (1,260,350,442)
| wiki | total sections | excluded sections
| ar | 2,015,385 | 124,791
| bn | 298,255 | 21,450
| cs | 1,220,022 | 139,434
| es | 4,253,439 | 380,613
| id | 1,032,808 | 81,749
| pt | 612,413 | 69,272
| ru | 4,433,197 | 539,602
NOTE: the solution only processes sections. If we want to exclude **articles**, we should use the Wikidata ontology, similarly to [image suggestions](https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/5c4a2ba7c37fcb9e06172396e8bf082affbfa774/image_suggestions/cassandra.py#L39).
NOTE: we currently extract **top-level** sections only, so lists or tables contained in subsections will **not** be excluded ([example](https://it.wikipedia.org/wiki/Paul_McGann#Filmografia_parziale)), since subsections are considered as textual content. We may adjust the behavior by extracting lower-level sections, too