Page MenuHomePhabricator

Infobox defect: parsing descendent tables causes concatenated text in first JSON field value
Closed, ResolvedPublic3 Estimated Story Points

Assigned To
Authored By
ROdonnell-WMF
Jun 19 2023, 11:21 AM
Referenced Files
F37109375: image.png
Jun 19 2023, 11:42 AM
F37109373: image.png
Jun 19 2023, 11:42 AM
F37109370: image.png
Jun 19 2023, 11:42 AM
F37109332: image.png
Jun 19 2023, 11:21 AM
F37109328: image.png
Jun 19 2023, 11:21 AM

Description

Sub ticket of: https://phabricator.wikimedia.org/T339232

A flaw in table parsing, if there is a table within table in the infobox, we get the first filed with all the content. And then the individual cells output correctly. Example:

JSON infobox

image.png (55×220 px, 11 KB)

Original HTML See <table> inside an infobox <tr>
image.png (58×220 px, 12 KB)

Acceptance criteria

First field in this scenario should only have the text in that table cell, not all the descendant text

ToDo

  • Check for <tr> in the descendants, if exists then change the extract text to be get the text without traversing the embedded tr

Checklist for testing

  • manually run cli gen in parser project.
Things to consider:
  • Check that other JSON outputs don't get side effects with the change in parser selectors.

Event Timeline

Adding a check for <tr> nodes in GetText() and skipping the traversal if it is a tr elementNode. Need to check with Stephan, this will have side effects if other code uses GetText() method.

image.png (197×689 px, 30 KB)

Solves the two examples in the parent ticket, fixed first field in a section, now doesn't have repeated text from subsequent fields.

image.png (311×2 px, 68 KB)

image.png (950×1 px, 113 KB)

ROdonnell-WMF changed the task status from Open to In Progress.Jun 19 2023, 11:42 AM

Adding logic to check if there is a <th><td> sequence in the infobox line. first, th is a label and the first td is the data.