Improve our plain text parsing of Wikipedia sections
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	ROdonnell-WMF
	Jul 19 2023, 2:40 PM

Description

As a customer, I want to use the MR API and read the plain text from Wikipedia articles without any processing on my side. Customers are already using the Huggingface Wikipedia dataset, we'd like to give them a comparable (or better quality) experience.

Compare current plain text output to huggingface format, see "Stephan's section parser" differed from huggingface
Merge applicable plain text sections (header, title, paragraph, lists, etc) as on single property. Remove any unnecessary markup from the wiki Parsoid output
Give the customers a choice to extract the heading, paragraphs, and lists in sections or to see it as one big text chunk.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T350102 OKR 4.3
Open	None	T342109 {Machine Readability} Parsing Sections
Declined	None	T342261 Improve our plain text parsing of Wikipedia sections

Event Timeline

ROdonnell-WMF created this task.Jul 19 2023, 2:40 PM

ROdonnell-WMF renamed this task from Comapre and pick best approaches to plain text parsing of Wikidata sections to [Investigation] Choose best approaches to plain text parsing of Wikidata sections.Jul 19 2023, 2:58 PM

ROdonnell-WMF renamed this task from [Investigation] Choose best approaches to plain text parsing of Wikidata sections to Improve our plain text parsing of Wikidata sections.

ROdonnell-WMF renamed this task from Improve our plain text parsing of Wikidata sections to Improve our plain text parsing of Wikipedia sections.Jul 20 2023, 1:49 PM

JArguello-WMF moved this task from Incoming to Engineering Backlog (DevOps, Maintenance, Tech debt) on the Wikimedia Enterprise board.Jul 24 2023, 5:57 PM

creynolds subscribed.Aug 18 2023, 2:08 AM

@SDelbecque-WMF can you review this ticket and see how it fits into your OKR plans? If it's not relevant then can we archive this ticket? Should this ticket be added the the {Machine-Readability} grouping?

Stephanie will create tickets that align with the OKRs

Improve our plain text parsing of Wikipedia sectionsClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Improve our plain text parsing of Wikipedia sections
Closed, DeclinedPublic
Actions

Related Objects
Search...