Improve the parsing of lists in Structured Contents. Current parser struggles with nested lists, lists inside templates/infoboxes, and hybrid structures, leading to incomplete or inaccurate JSON outputs.
To Do
[ x] Review existing list parser and document findings
[ x] Refactor parser to correctly handle:
- Unordered, ordered, and definition lists
- Nested / hierarchical lists
- Lists inside infoboxes
- Integrate improved lists into Structured Content On-demand dev
- Add a flag to enable/Disable
- QA in dev and validate against representative articles (some examples in PRD)
Acceptance Criteria
- Lists are represented as hierarchical JSON structures (no flattening)
- Lists inside infoboxes are included in structured output
- Bug of empty lists is fixed
- Known list-related parsing issues from prior feedback are resolved
Resources
- More info in PRD