Structured Contents currently includes only lead and infobox images. Expand to also include the images appearing lower in the article sections, while preserving the structured metadata and its context (inside sections).
To Do
- Review current image parser
- Review Parsoid HTML patterns for inline images and galleries
- Expand WME schema where needed
- Implement parser to extract images from article body and capture structured metadata: source URL, caption, alt text, width/height, image type if available, and preserve context (inside sections, on paragraph level)
- Integrate extracted images into Structured Content On-demand dev
- add: "name", "media_type", "identifier" in dev
- Retroactively add enable/disable flag
- QA in dev
- Manual QA" pass—specifically, identifying a subset of ~100 articles with complex layouts to validate our selectors.
Acceptance Criteria
- Inline images appear in WME feeds similarly to lead/infobox images.
- Metadata fields are populated for the extracted images where available
Resources
- More info in PRD