Implement infobox fixes for structured-contents [1 week]
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ROdonnell-WMF
	Dec 11 2023, 3:02 PM

Description

User Story: “As a client API user, I want to use structured-contents API,
so that I can see all the infobox JSON is correctly parsed and the content defects are fixed.”

Acceptance criteria
Read the Citation Research report on Infobox Feedback Investigation and implement the recommendations

ToDo

ABC News, "Headquarters" infobox key. One entry has a comma in the "values" array stings. Optimise the getText function to check if the element starts or ends with a punctuation mark [, : ; etc ] or a space and treat accordingly.
Slogan infobox value has a <br> in wikitext that has been removed from the API response. Optimise the getText function to check for <br> tags and use a comma to separate the element instead of a space.
FetLife, URL entry has too many spaces: "text": "fetlife .com". Root Cause: The goquery library, which is used to convert HTML to text, uses space to separate elements. Optimise the getText function to check for <wbr> tags and not separate the element with a space or comma.
Toyota missing space in "Parent" value after ownership, the list is also flat without any boundary between items. The infobox parser does not handle lists properly and treats them as text. Optimise the getText function to check for <li> tags and separate the elements with commas.
Monday Night Football "Presented by" and "Presented by" also are flat lists without boundaries. The flat lists result from the infobox parser not handling lists appropriately. Additionally, the getText method adds a space to <a> tag elements to adequately concatenate sentences and avoid follow-on sentences. Optimise the getText function to check for <li> tags and separate the elements with commas to solve the issue of flat lists. Because a space is added at the end of <a> tag element, check to see if the character after the element is a punctuation and avoid adding a space.
Google Chrome infobox. Our current parser is not optimised for HTML table parsing. Until we have a Table parser this should be delayed.
More info: Google Chrome: "Initial release" key:
- "name": "Initial releaseWindows XPWindows XPmacOS, LinuxmacOS, LinuxMulti platform"
- "value": "Beta / September 2, 2008; 15 years ago1.0 / December 11, 2008; 14 years agoPreview /...
Zillow Key people missing entry....
- Jeremy Hofmann is not listed **

Test Strategy

Use the normal Parser snapshot testing framework and add the above articles to the testing suite.

Luvo
Ricardo

Description (optional)

Read the doc report on the Infobox Feedback investigation to get more context on each defect. I recommend we don't implement an HTML table parser to fix the Google Chrome defect. Table parsing will be problematic. Also, the Zillow defect seems to be fixed or was not a defect.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		E.Enabulele	T351803 {Investigation} RCA on infoboxes feedback
		Resolved		ROdonnell-WMF	T353153 Implement infobox fixes for structured-contents [1 week]

Event Timeline

ROdonnell-WMF created this task.Dec 11 2023, 3:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 11 2023, 3:02 PM

ROdonnell-WMF updated the task description. (Show Details)Dec 11 2023, 3:03 PM

ROdonnell-WMF added a parent task: T351803: {Investigation} RCA on infoboxes feedback.

ROdonnell-WMF updated the task description. (Show Details)Dec 13 2023, 11:12 AM

SDelbecque-WMF moved this task from Incoming to To Be Estimated/To Be Discussed on the Wikimedia Enterprise board.Jan 10 2024, 3:02 PM

LDlulisa-WMF moved this task from To Be Estimated/To Be Discussed to Incoming on the Wikimedia Enterprise board.Jan 10 2024, 3:18 PM

LDlulisa-WMF moved this task from Incoming to To Be Estimated/To Be Discussed on the Wikimedia Enterprise board.Jan 10 2024, 3:33 PM

JArguello-WMF moved this task from To Be Estimated/To Be Discussed to Engineering Backlog (DevOps, Maintenance, Tech debt) on the Wikimedia Enterprise board.Apr 4 2024, 1:22 PM

JArguello-WMF moved this task from Engineering Backlog (DevOps, Maintenance, Tech debt) to Machine Readability PB on the Wikimedia Enterprise board.Apr 9 2024, 1:54 PM

SDelbecque-WMF moved this task from Machine Readability PB to To Be Estimated/To Be Discussed on the Wikimedia Enterprise board.Apr 11 2024, 12:49 PM

JArguello-WMF renamed this task from Implement infobox fixes for structured-contents to Implement infobox fixes for structured-contents [1 week].Apr 11 2024, 1:43 PM

JArguello-WMF moved this task from To Be Estimated/To Be Discussed to Estimated /Discussed on the Wikimedia Enterprise board.

REsquito-WMF moved this task from Estimated /Discussed to To Be Estimated/To Be Discussed on the Wikimedia Enterprise board.Apr 24 2024, 11:40 AM

REsquito-WMF moved this task from To Be Estimated/To Be Discussed to Estimated /Discussed on the Wikimedia Enterprise board.

JArguello-WMF moved this task from Estimated /Discussed to Sprint 59 on the Wikimedia Enterprise board.May 2 2024, 1:33 PM

JArguello-WMF edited projects, added Wikimedia Enterprise (Sprint 59); removed Wikimedia Enterprise.

JArguello-WMF moved this task from Sprint 59 to Sprint 60 on the Wikimedia Enterprise board.

JArguello-WMF edited projects, added Wikimedia Enterprise (Sprint 60); removed Wikimedia Enterprise (Sprint 59).

ROdonnell-WMF claimed this task.May 3 2024, 8:13 PM

ROdonnell-WMF moved this task from Next Up to In Progress on the Wikimedia Enterprise (Sprint 60) board.

ROdonnell-WMF updated the task description. (Show Details)May 3 2024, 8:29 PM

Toyota - replace values array with array with only 1 item. A markdown string of list items

Google Chrome - table parser will need more work. There is an RCF document under discussion. Once the table parser is ready, then this defect can be readdressed.

ROdonnell-WMF updated the task description. (Show Details)May 3 2024, 9:39 PM

ROdonnell-WMF updated the task description. (Show Details)May 3 2024, 10:11 PM

ROdonnell-WMF updated the task description. (Show Details)May 3 2024, 10:19 PM

ROdonnell-WMF updated the task description. (Show Details)May 7 2024, 9:18 AM

ROdonnell-WMF updated the task description. (Show Details)May 7 2024, 9:26 AM

Google Chrome infobox has a strange editor layout. they have 3 inner tables within the infobox. One that spans just an infobox-data cell. And two more that span an infobox-full-data-row cell. What makes it difficult for the infobox parser is that all these inner tables use infobox-label and infobox-data at two different levels. This means that if we parse trs we'll get the parent row (with all the inner trs concatenated in one JSON row) and then individual JSON rows for each of the inner rows. There is not much we can do about that.

We could flatten the rows, but that would break other pages. We could skip the parent row, but that breaks other pages.

For the moment, it's best to say this is a "known issue" with bad editor markup. We should run some analytics to see if this HTML/CSS pattern is common in WMF articles. If it is not, then make an editorial change and ask the infobox to be corrected in WMF. If it's common, we need a way to resolve this issue cleanly. As yet I have no clean solution.

ROdonnell-WMF moved this task from In Progress to MR on the Wikimedia Enterprise (Sprint 60) board.May 7 2024, 9:33 AM

JArguello-WMF updated the task description. (Show Details)May 8 2024, 1:11 PM

ROdonnell-WMF moved this task from MR to QA on the Wikimedia Enterprise (Sprint 60) board.May 15 2024, 10:58 AM

JArguello-WMF moved this task from QA to Sign Off on the Wikimedia Enterprise (Sprint 60) board.May 16 2024, 1:04 PM

ROdonnell-WMF moved this task from Sign Off to Done on the Wikimedia Enterprise (Sprint 60) board.May 16 2024, 4:40 PM

JArguello-WMF closed this task as Resolved.May 23 2024, 2:04 PM