Page MenuHomePhabricator

[Spike] Write a static e2e test to evaluate data quality of structured-contents API {Timebox: 5 days}
Closed, ResolvedPublic

Description

User Story: “As a developer, I want to fast e2e test framework for structured content,
so that I can run CICD smoke tests post-deployment to see if a new deployment breaks any changes in our API.”

Acceptance criteria

  1. Use a headless browser (Playwright) to load a Wikipedia article, take a screenshot of the abstract and infobox
  2. Return a score showing the similarity of expected vs observed text, it should be a percentage for abstract and infobox

ToDo

  • Use a headless browser (Playwright) to load a Wikipedia article, take a screenshot of the abstract and infobox
  • Inside the infobox, OCR each <tr> tag and save a rows of inbox
  • Use Tesseract.js to OCR the screenshots and extract the pixel coordinates of the text
  • Save all the "expected" text as metadata in variables
  • Call our structured contents API and download the JSON
  • Read the JSON content for Abstract and compare it to the expected text
  • Read the JSON for each row of the infobox and compare it to the expected text for each row
  • Return a score showing the similarity of expected vs observed text, it should be a percentage for abstract and infobox
Test Strategy

This is a e2e test

Checklist for testing

  • visually compare the screenshot snippets with the OCRed text
  • manually compare the expected and observed text and compare it with what we see in the static test
  • The e2e test should run as part of our existing e2e automation tests
Things to consider:
  • Accuracy of OCR text from images
  • Fuzzy matching of words between expected and observed text, we'll have issues with spacing, hyphens, new lines, extra symbols, unicode characters

Event Timeline

Protsack.stephan renamed this task from [Spike] Write a static e2e test to evaluate data quality of structured-contents API to [Spike] Write a static e2e test to evaluate data quality of structured-contents API {Timebox: 5 days}.Jan 10 2024, 3:29 PM
JArguello-WMF claimed this task.
JArguello-WMF subscribed.

superseded by the Metrics Framework