Page MenuHomePhabricator

{Machine Readability}{Images} Parsed Images in article sections for Structured Contents
Closed, ResolvedPublic5 Estimated Story Points

Description

Structured Contents currently includes only lead and infobox images. Expand to also include the images appearing lower in the article sections, while preserving the structured metadata and its context (inside sections).

To Do

  • Review current image parser
  • Review Parsoid HTML patterns for inline images and galleries
  • Expand WME schema where needed
  • Implement parser to extract images from article body and capture structured metadata: source URL, caption, alt text, width/height, image type if available, and preserve context (inside sections, on paragraph level)
  • Integrate extracted images into Structured Content On-demand dev
  • add: "name", "media_type", "identifier" in dev
  • Retroactively add enable/disable flag
  • QA in dev
    • Manual QA" pass—specifically, identifying a subset of ~100 articles with complex layouts to validate our selectors.

Acceptance Criteria

  • Inline images appear in WME feeds similarly to lead/infobox images.
  • Metadata fields are populated for the extracted images where available

Resources

  • More info in PRD

Event Timeline

SDelbecque-WMF updated the task description. (Show Details)
SDelbecque-WMF updated the task description. (Show Details)

reviewing parsoid image and will start schema today

adding additional fields and identifier, removing the flags from infoboxes for images.

LDlulisa-WMF updated the task description. (Show Details)
LDlulisa-WMF updated the task description. (Show Details)

Ticket to be resolved when we mark it as done in Asana, in the meantime, it'll live here