Page MenuHomePhabricator

Create structured contents snapshot of people articles
Open, Needs TriagePublic

Description

As a PM, I want to create a sample set of Structured Contents EN Wikipedia namespace 0 of People with infoboxes, so that I can share with researchers.

Acceptance criteria
Snapshot of the EN Wiki that contains only articles related to people with parsed infoboxes.

ToDo

  • download english wikipedia snapshot
  • create a subset of all EN Wikipedia pages on People (approx 2M)
  • run infobox parser on that subset of pages
  • create a new snapshot of people with infoboxes
Description

This will enable collaboration opportunities to help quantify WME data's quality in the LLM space.

Event Timeline

Protsack.stephan renamed this task from Create full People sampleset to Create structured contents snapshot of people articles.Wed, May 15, 12:00 PM
Protsack.stephan updated the task description. (Show Details)
Protsack.stephan updated the task description. (Show Details)