Page MenuHomePhabricator

[XL] Exclude sections with non-standard tables and lists
Closed, ResolvedPublic

Description

T323505: [L] Exclude sections-tables from having section topics filters out sections with standard tables and lists without leading text.
By standard we mean tables rendered with base wikitext markup, i.e., starting with {|, #, and * for tables, ordered, and unordered lists respectively.
However, this leaves behind:

  • sections with some text, then table(s) or list(s)
  • a tail of non-standard ones, which includes but is not limited to specific templates

The plan is to use Enterprise HTML dumps and mwparserfromhtml as an effective way to detect all kinds of tables and lists. Since this would require a rewrite of the parsing logic, which is currently bound to wikitext markup, we will do this in a separate script.

Acceptance Criteria:

  • Create a script to extract and store section titles that have lists or tables from HTML dumps
  • Use the stored table to filter the SLIS pipeline
  • Ensure that the table exclusion logic is *optional* in the code so that it can be removed later for use cases in which we want section topics for sections with tables.

Update

  • snapshot: 2023-03-06
  • total rows at a46b40fa (includes dates and media topics filter): 1.18 B (1,183,740,528)
  • total rows at b482a93e (includes ar, cs, es, id, pt, ru table filter and section length filter): 255 M (255,016,670)
  • difference: 929 M (928,723,858)

More details on the table filter at https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/21#note_22935 and https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/21#note_22936

Event Timeline

The plan here is for @mfossati to check in after spending 1 day on this to define a clear stopping point.

This comment was removed by CBogen.
CBogen renamed this task from Exclude sections with non-standard tables and lists to [XL] Exclude sections with non-standard tables and lists.Mar 8 2023, 5:37 PM
CBogen updated the task description. (Show Details)
mfossati changed the task status from Open to In Progress.Mar 10 2023, 3:39 PM
mfossati claimed this task.
mfossati added a subscriber: matthiasmullie.

Pipeline changes plus major review of the tables detection script at https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/21.
Moving back to code review, CC @matthiasmullie .