User Story: “As a developer, I want to fast e2e test framework for structured content,
so that I can run CICD smoke tests post-deployment to see if a new deployment breaks any changes in our API.”
Acceptance criteria
- Use a headless browser (Playwright) to load a Wikipedia article, take a screenshot of the abstract and infobox
- Return a score showing the similarity of expected vs observed text, it should be a percentage for abstract and infobox
ToDo
- Use a headless browser (Playwright) to load a Wikipedia article, take a screenshot of the abstract and infobox
- Inside the infobox, OCR each <tr> tag and save a rows of inbox
- Use Tesseract.js to OCR the screenshots and extract the pixel coordinates of the text
- Save all the "expected" text as metadata in variables
- Call our structured contents API and download the JSON
- Read the JSON content for Abstract and compare it to the expected text
- Read the JSON for each row of the infobox and compare it to the expected text for each row
- Return a score showing the similarity of expected vs observed text, it should be a percentage for abstract and infobox
Test Strategy
This is a e2e test
Checklist for testing
- visually compare the screenshot snippets with the OCRed text
- manually compare the expected and observed text and compare it with what we see in the static test
- The e2e test should run as part of our existing e2e automation tests
Things to consider:
- Accuracy of OCR text from images
- Fuzzy matching of words between expected and observed text, we'll have issues with spacing, hyphens, new lines, extra symbols, unicode characters