Page MenuHomePhabricator

{Machine Readability}{lists} Improve List Parsing in Structured Contents
Open, MediumPublic5 Estimated Story Points

Description

Improve the parsing of lists in Structured Contents. Current parser struggles with nested lists, lists inside templates/infoboxes, and hybrid structures, leading to incomplete or inaccurate JSON outputs.

To Do
[ x] Review existing list parser and document findings
[ x] Refactor parser to correctly handle:

    • Unordered, ordered, and definition lists
    • Nested / hierarchical lists
    • Lists inside infoboxes
  • Integrate improved lists into Structured Content On-demand dev
  • Add a flag to enable/Disable
  • QA in dev and validate against representative articles (some examples in PRD)

Acceptance Criteria

  • Lists are represented as hierarchical JSON structures (no flattening)
  • Lists inside infoboxes are included in structured output
  • Bug of empty lists is fixed
  • Known list-related parsing issues from prior feedback are resolved

Resources

  • More info in PRD

Event Timeline

JArguello-WMF renamed this task from Feasibility of releasing parsed lists to {ists} Feasibility of releasing parsed lists.Feb 6 2025, 4:13 PM
REsquito-WMF renamed this task from {ists} Feasibility of releasing parsed lists to {Machine Readability}{lists} Feasibility of releasing parsed lists.Feb 20 2025, 8:48 AM
SDelbecque-WMF renamed this task from {Machine Readability}{lists} Feasibility of releasing parsed lists to {Machine Readability}{lists} Improve List Parsing in Structured Contents.Jan 14 2026, 8:02 AM
SDelbecque-WMF updated the task description. (Show Details)
JArguello-WMF lowered the priority of this task from High to Medium.Jan 14 2026, 9:06 PM
E.Enabulele updated Other Assignee, removed: E.Enabulele.

Creating investigation doc with findings

Ticket to be resolved when we mark it as done in asana, in the meantime, it will live here as well