T323505: [L] Exclude sections-tables from having section topics filters out sections with standard tables and lists without leading text.
By standard we mean tables rendered with base wikitext markup, i.e., starting with {|, #, and * for tables, ordered, and unordered lists respectively.
However, this leaves behind:
- sections with some text, then table(s) or list(s)
- a tail of non-standard ones, which includes but is not limited to specific templates
The plan is to use Enterprise HTML dumps and mwparserfromhtml as an effective way to detect all kinds of tables and lists. Since this would require a rewrite of the parsing logic, which is currently bound to wikitext markup, we will do this in a separate script.
Acceptance Criteria:
- Create a script to extract and store section titles that have lists or tables from HTML dumps
- Use the stored table to filter the SLIS pipeline
- Ensure that the table exclusion logic is *optional* in the code so that it can be removed later for use cases in which we want section topics for sections with tables.
Update
- snapshot: 2023-03-06
- total rows at a46b40fa (includes dates and media topics filter): 1.18 B (1,183,740,528)
- total rows at b482a93e (includes ar, cs, es, id, pt, ru table filter and section length filter): 255 M (255,016,670)
- difference: 929 M (928,723,858)
More details on the table filter at https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/21#note_22935 and https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/21#note_22936