Page MenuHomePhabricator

Test structured-contents parsing with the new WMF update in T335512
Closed, ResolvedPublic3 Estimated Story Points

Assigned To
Authored By
ROdonnell-WMF
Dec 12 2023, 6:42 PM
Referenced Files
F41610922: image.png
Dec 18 2023, 12:21 PM
F41610920: image.png
Dec 18 2023, 12:21 PM
F41610918: image.png
Dec 18 2023, 12:21 PM
F41610915: image.png
Dec 18 2023, 12:21 PM
F41610912: image.png
Dec 18 2023, 12:21 PM
F41610910: image.png
Dec 18 2023, 12:21 PM
F41610907: image.png
Dec 18 2023, 12:21 PM

Description

User Story: “As a developer, I want to run tests on the structured-contents parsing and check that the T335512 WMF changes to the Parsoid version don't break our API features so that I can upgrade the parser logic to Parsoid V2.8.”

Acceptance criteria
Use the WMF branch in T335512 and validate the structured content unit tests

ToDo

  • [ x] Change WMF submodule to use the branch that Prabhat created for Parsoid 2.8
  • [x ] Run the snapshot tests and check from breaking changes
  • [ x] Update the go-query HTML queries and extract them into a configuration file (Ask Ricardo what his preference is here for abstracting HTML dependencies)
Test Strategy

Use a normal parser.go snapshot unit tests

Checklist for testing

  • [x ] Check for 2.8 breaking changes in the parser.go HTML queries
  • [x ] Abstract HTML selection out of code and into config file?
  • [ x] Re-run tests until the parser works correctly using Parsoid v2.8 HTML

Event Timeline

ROdonnell-WMF added a subscriber: JArguello-WMF.

@JArguello-WMF I'm adding this to this Sprint, because it's a breaking change and will eventually affect my code in structured-contents. I'd like to preempt any breakingchanges before the Xmas holidays

I ran an experiment with a sample of two sets of 500 random articles, a small differences were found just in Infobox and Sections. The good news is Templates, abstracts, and categories are identical.

Differences in samples

-------------------------------------------------------
| Sample | Infobox different |   Sections different |
-------------------------------------------------------
| 1st 500 |             17    |            19       |
-------------------------------------------------------
| 2st 500 |             2     |             14      |
-------------------------------------------------------

So, a difference rate of 0.36% for Sections and 0.16% for Infoboxes. On Monday, I'll investigate why they are different.

After comparing the old and new differences they are identical JSON. The screenshots show some examples highlighting RED and GREEN differences. The only difference is a prefix I put in the JSON to identify if it's OLD or NEW JSON.

The only minor difference in one category list output is that the order of items differs, but this is expected.

In conclusion, after running experiments on 4,000 articles the difference is near zero. The test code reported a 0.2% error rate, but on further inspection, all these were false-positive tests.

image.png (1×2 px, 614 KB)

image.png (1×2 px, 726 KB)

image.png (1×2 px, 697 KB)

image.png (410×2 px, 89 KB)

image.png (1×2 px, 522 KB)

image.png (1×2 px, 733 KB)

image.png (409×2 px, 301 KB)

ROdonnell-WMF moved this task from Sign Off to Done on the Wikimedia Enterprise (Sprint 52) board.