Page MenuHomePhabricator

Run scraper on samples from several wikis
Closed, ResolvedPublic

Description

  • Blocked by subtasks to prepare the scraper
  • Choose a handful of wikis to test our scraper against. enwiki, dewiki, and fawiki (because it's common to use <ref> without creating the tag via template)
  • Pull the first 10k lines for these wikis, without decompressing or transferring the entire tarballs.
MIX_ENV=prod mix run cut-samples.exs  enwiki dewiki fawiki
  • Scrape each sample to a separate directory
mix run parse_wiki.exs samples/dewiki-NS0-20230320-ENTERPRISE-HTML-sample10000.ndjson --output reports/dewiki-20230320-sample10000-references.ndjson
mix run parse_wiki.exs samples/enwiki-NS0-20230320-ENTERPRISE-HTML-sample10000.ndjson --output reports/enwiki-20230320-sample10000-references.ndjson
mix run parse_wiki.exs samples/fawiki-NS0-20230320-ENTERPRISE-HTML-sample10000.ndjson --output reports/fawiki-20230320-sample10000-references.ndjson
  • Run aggregation on each sample
mix run aggregate.exs reports/dewiki-20230320-sample10000-references.ndjson > reports/dewiki-20230320-sample10000-references-summary.ndjson
mix run aggregate.exs reports/enwiki-20230320-sample10000-references.ndjson > reports/enwiki-20230320-sample10000-references-summary.ndjson
mix run aggregate.exs reports/fawiki-20230320-sample10000-references.ndjson > reports/fawiki-20230320-sample10000-references-summary.ndjson

Event Timeline

awight updated the task description. (Show Details)
awight moved this task from Doing to Done on the WMDE-TechWish-Sprint-2023-04-05 board.