- Blocked by subtasks to prepare the scraper
- Choose a handful of wikis to test our scraper against. enwiki, dewiki, and fawiki (because it's common to use <ref> without creating the tag via template)
- Pull the first 10k lines for these wikis, without decompressing or transferring the entire tarballs.
MIX_ENV=prod mix run cut-samples.exs enwiki dewiki fawiki
- Scrape each sample to a separate directory
mix run parse_wiki.exs samples/dewiki-NS0-20230320-ENTERPRISE-HTML-sample10000.ndjson --output reports/dewiki-20230320-sample10000-references.ndjson mix run parse_wiki.exs samples/enwiki-NS0-20230320-ENTERPRISE-HTML-sample10000.ndjson --output reports/enwiki-20230320-sample10000-references.ndjson mix run parse_wiki.exs samples/fawiki-NS0-20230320-ENTERPRISE-HTML-sample10000.ndjson --output reports/fawiki-20230320-sample10000-references.ndjson
- Run aggregation on each sample
mix run aggregate.exs reports/dewiki-20230320-sample10000-references.ndjson > reports/dewiki-20230320-sample10000-references-summary.ndjson mix run aggregate.exs reports/enwiki-20230320-sample10000-references.ndjson > reports/enwiki-20230320-sample10000-references-summary.ndjson mix run aggregate.exs reports/fawiki-20230320-sample10000-references.ndjson > reports/fawiki-20230320-sample10000-references-summary.ndjson
- Post the results somewhere the whole team can review