Page MenuHomePhabricator

Improve our plain text parsing of Wikipedia sections
Closed, DeclinedPublic

Description

As a customer, I want to use the MR API and read the plain text from Wikipedia articles without any processing on my side. Customers are already using the Huggingface Wikipedia dataset, we'd like to give them a comparable (or better quality) experience.

  • Compare current plain text output to huggingface format, see "Stephan's section parser" differed from huggingface
  • Merge applicable plain text sections (header, title, paragraph, lists, etc) as on single property. Remove any unnecessary markup from the wiki Parsoid output
  • Give the customers a choice to extract the heading, paragraphs, and lists in sections or to see it as one big text chunk.

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
OpenNone
DeclinedNone

Event Timeline

ROdonnell-WMF renamed this task from Comapre and pick best approaches to plain text parsing of Wikidata sections to [Investigation] Choose best approaches to plain text parsing of Wikidata sections.Jul 19 2023, 2:58 PM
ROdonnell-WMF renamed this task from [Investigation] Choose best approaches to plain text parsing of Wikidata sections to Improve our plain text parsing of Wikidata sections.
ROdonnell-WMF renamed this task from Improve our plain text parsing of Wikidata sections to Improve our plain text parsing of Wikipedia sections.Jul 20 2023, 1:49 PM

@SDelbecque-WMF can you review this ticket and see how it fits into your OKR plans? If it's not relevant then can we archive this ticket? Should this ticket be added the the {Machine-Readability} grouping?

Stephanie will create tickets that align with the OKRs