Page MenuHomePhabricator

Implement infobox fixes for structured-contents [1 week]
Closed, ResolvedPublic

Description

User Story: “As a client API user, I want to use structured-contents API,
so that I can see all the infobox JSON is correctly parsed and the content defects are fixed.”

Acceptance criteria
Read the Citation Research report on Infobox Feedback Investigation and implement the recommendations

ToDo

  • ABC News, "Headquarters" infobox key. One entry has a comma in the "values" array stings. Optimise the getText function to check if the element starts or ends with a punctuation mark [, : ; etc ] or a space and treat accordingly.
  • Slogan infobox value has a <br> in wikitext that has been removed from the API response. Optimise the getText function to check for <br> tags and use a comma to separate the element instead of a space.
  • FetLife, URL entry has too many spaces: "text": "fetlife .com". Root Cause: The goquery library, which is used to convert HTML to text, uses space to separate elements. Optimise the getText function to check for <wbr> tags and not separate the element with a space or comma.
  • Toyota missing space in "Parent" value after ownership, the list is also flat without any boundary between items. The infobox parser does not handle lists properly and treats them as text. Optimise the getText function to check for <li> tags and separate the elements with commas.
  • Monday Night Football "Presented by" and "Presented by" also are flat lists without boundaries. The flat lists result from the infobox parser not handling lists appropriately. Additionally, the getText method adds a space to <a> tag elements to adequately concatenate sentences and avoid follow-on sentences. Optimise the getText function to check for <li> tags and separate the elements with commas to solve the issue of flat lists. Because a space is added at the end of <a> tag element, check to see if the character after the element is a punctuation and avoid adding a space.
  • Google Chrome infobox. Our current parser is not optimised for HTML table parsing. Until we have a Table parser this should be delayed.
  • More info: Google Chrome: "Initial release" key:
    • "name": "Initial releaseWindows XPWindows XPmacOS, LinuxmacOS, LinuxMulti platform"
    • "value": "Beta / September 2, 2008; 15 years ago1.0 / December 11, 2008; 14 years agoPreview /...
  • Zillow Key people missing entry....
    • Jeremy Hofmann is not listed **
Test Strategy

Use the normal Parser snapshot testing framework and add the above articles to the testing suite.

  • Luvo
  • Ricardo
Description (optional)

Read the doc report on the Infobox Feedback investigation to get more context on each defect. I recommend we don't implement an HTML table parser to fix the Google Chrome defect. Table parsing will be problematic. Also, the Zillow defect seems to be fixed or was not a defect.

Event Timeline

JArguello-WMF renamed this task from Implement infobox fixes for structured-contents to Implement infobox fixes for structured-contents [1 week].Apr 11 2024, 1:43 PM

Toyota - replace values array with array with only 1 item. A markdown string of list items

Google Chrome - table parser will need more work. There is an RCF document under discussion. Once the table parser is ready, then this defect can be readdressed.

Google Chrome infobox has a strange editor layout. they have 3 inner tables within the infobox. One that spans just an infobox-data cell. And two more that span an infobox-full-data-row cell. What makes it difficult for the infobox parser is that all these inner tables use infobox-label and infobox-data at two different levels. This means that if we parse trs we'll get the parent row (with all the inner trs concatenated in one JSON row) and then individual JSON rows for each of the inner rows. There is not much we can do about that.

We could flatten the rows, but that would break other pages. We could skip the parent row, but that breaks other pages.

For the moment, it's best to say this is a "known issue" with bad editor markup. We should run some analytics to see if this HTML/CSS pattern is common in WMF articles. If it is not, then make an editorial change and ask the infobox to be corrected in WMF. If it's common, we need a way to resolve this issue cleanly. As yet I have no clean solution.