Change Details

User Story: “As a client API user, I want to use structured-contents API, so that I can see all the infobox JSON is correctly parsed and the content defects are fixed.” **Acceptance criteria** Read the Citation Research report on Infobox Feedback Investigation and implement the recommendations **ToDo** - [X] **ABC News**, "Headquarters" infobox key. One entry has a comma in the "values" array stings. Optimise the getText function to check if the element starts or ends with a punctuation mark [, : ; etc ] or a space and treat accordingly. - [X] **Slogan** infobox value has a in wikitext that has been removed from the API response. Optimise the getText function to check for tags and use a comma to separate the element instead of a space. - [X] **FetLife**, URL entry has too many spaces: "text": "fetlife .com". Root Cause: The goquery library, which is used to convert HTML to text, uses space to separate elements. Optimise the getText function to check for tags and not separate the element with a space or comma. - [X] **Toyota** missing space in "Parent" value after ownership, the list is also flat without any boundary between items. The infobox parser does not handle lists properly and treats them as text. Optimise the getText function to check for <li> tags and separate the elements with commas. - [X] **Monday Night Football** "Presented by" and "Presented by" also are flat lists without boundaries. The flat lists result from the infobox parser not handling lists appropriately. Additionally, the getText method adds a space to <a> tag elements to adequately concatenate sentences and avoid follow-on sentences. Optimise the getText function to check for <li> tags and separate the elements with commas to solve the issue of flat lists. Because a space is added at the end of <a> tag element, check to see if the character after the element is a punctuation and avoid adding a space. - [X] **Google Chrome** infobox. Our current parser is not optimised for HTML table parsing. Until we have a Table parser this should be delayed. - More info: Google Chrome: "Initial release" key: - "name": "Initial releaseWindows XPWindows XPmacOS, LinuxmacOS, LinuxMulti platform" - "value": "Beta / September 2, 2008; 15 years ago1.0 / December 11, 2008; 14 years agoPreview /... - [X] **Zillow** Key people missing entry.... - Jeremy Hofmann is not listed ** ===== Test Strategy ===== Use the normal Parser snapshot testing framework and add the above articles to the testing suite. ===== Description (optional) ===== Read the doc report on the Infobox Feedback investigation to get more context on each defect. I recommend we don't implement an HTML table parser to fix the Google Chrome defect. Table parsing will be problematic. Also, the Zillow defect seems to be fixed or was not a defect.

User Story: “As a client API user, I want to use structured-contents API, so that I can see all the infobox JSON is correctly parsed and the content defects are fixed.” **Acceptance criteria** Read the Citation Research report on Infobox Feedback Investigation and implement the recommendations **ToDo** - [X] **ABC News**, "Headquarters" infobox key. One entry has a comma in the "values" array stings. Optimise the getText function to check if the element starts or ends with a punctuation mark [, : ; etc ] or a space and treat accordingly. - [X] **Slogan** infobox value has a in wikitext that has been removed from the API response. Optimise the getText function to check for tags and use a comma to separate the element instead of a space. - [X] **FetLife**, URL entry has too many spaces: "text": "fetlife .com". Root Cause: The goquery library, which is used to convert HTML to text, uses space to separate elements. Optimise the getText function to check for tags and not separate the element with a space or comma. - [X] **Toyota** missing space in "Parent" value after ownership, the list is also flat without any boundary between items. The infobox parser does not handle lists properly and treats them as text. Optimise the getText function to check for <li> tags and separate the elements with commas. - [X] **Monday Night Football** "Presented by" and "Presented by" also are flat lists without boundaries. The flat lists result from the infobox parser not handling lists appropriately. Additionally, the getText method adds a space to <a> tag elements to adequately concatenate sentences and avoid follow-on sentences. Optimise the getText function to check for <li> tags and separate the elements with commas to solve the issue of flat lists. Because a space is added at the end of <a> tag element, check to see if the character after the element is a punctuation and avoid adding a space. - [X] **Google Chrome** infobox. Our current parser is not optimised for HTML table parsing. Until we have a Table parser this should be delayed. - More info: Google Chrome: "Initial release" key: - "name": "Initial releaseWindows XPWindows XPmacOS, LinuxmacOS, LinuxMulti platform" - "value": "Beta / September 2, 2008; 15 years ago1.0 / December 11, 2008; 14 years agoPreview /... - [X] **Zillow** Key people missing entry.... - Jeremy Hofmann is not listed ** ===== Test Strategy ===== Use the normal Parser snapshot testing framework and add the above articles to the testing suite. [x] Luvo [] Ricardo ===== Description (optional) ===== Read the doc report on the Infobox Feedback investigation to get more context on each defect. I recommend we don't implement an HTML table parser to fix the Google Chrome defect. Table parsing will be problematic. Also, the Zillow defect seems to be fixed or was not a defect.