As a customer, I want to use the MR API and read the plain text from Wikipedia articles without any processing on my side. Customers are already using the Huggingface Wikipedia dataset, we'd like to give them a comparable (or better quality) experience.
- Compare current plain text output to huggingface format, see "Stephan's section parser" differed from huggingface
- Merge applicable plain text sections (header, title, paragraph, lists, etc) as on single property. Remove any unnecessary markup from the wiki Parsoid output
- Give the customers a choice to extract the heading, paragraphs, and lists in sections or to see it as one big text chunk.