Objective
Review the Wikipedia plain text datasets used by LLM and data science teams to evaluate if our clients would benefit from a simple plain text output of our data.
- Identify populate datasets used by data science teams and LLM companies
- Compare ease-of-use for ingesting our Wikipedia dumps and MR structured contents JSON to these plain text datasets
- Get feedback from clients to see if our Sections JSON is "good enough" for their plain text ingestion, or would they benefit from pure plain text
Deliverable:
Write a report on which data sets are popular and what kinds of use cases they open for WME. Evaluate if the plain text should be added to our APIs, if it should be in structured content or as plain compressed dumps