Page MenuHomePhabricator

Investigate Huggingface and populate plain text Wikipedia datasets (competitor analysis)
Open, Needs TriagePublic


Review the Wikipedia plain text datasets used by LLM and data science teams to evaluate if our clients would benefit from a simple plain text output of our data.

  • Identify populate datasets used by data science teams and LLM companies
  • Compare ease-of-use for ingesting our Wikipedia dumps and MR structured contents JSON to these plain text datasets
  • Get feedback from clients to see if our Sections JSON is "good enough" for their plain text ingestion, or would they benefit from pure plain text

Write a report on which data sets are popular and what kinds of use cases they open for WME. Evaluate if the plain text should be added to our APIs, if it should be in structured content or as plain compressed dumps

Event Timeline

Reference links to the competitor datasets: - 8500 downloads in last month - 40K downloads in last month