Change Details

**Goal** - Surface metric numbers from scraper data - Think about how we support self-serve for accessing scraper data **Steps** - See spreadsheet prepared in {T363327}. - Retrieve the data from the results - Document instructions on how the data can be fund/extracted **Metrics**: Should be retrievable from current scraper results [x] # of duplicate (identical) refs in a given wiki * `identical_refs_count` in column E gives the absolute number of identical refs. [ * [x] # of articles with at least one identical ref * `pages_with_identical_refs_count` in column J * `proportion_of_pages_with_identical_refs` in column AF for this number as a proportion of total pages. [] # of articles with more than 25 refs and have at least one identical reference, [] proportion of duplicate refs in articles with >25 refs vs. proportion of duplicates in articles <25 refs, split by wiki. - Assumption: longer reference lists have more duplicates because hard to find and manage [x] # of articles without references * `pages_with_refs_count` in column O for the number of pages with at least one ref. * `proportion_of_pages_with_refs` in column AI for this number as a proportion of total pages. * Requested metric can be found with `page_count` - `pages_with_refs_count` [] ratio of reference to paragraph per wiki **( TBD: Can we even do that without a code change to the scraper and a re-run? )** * `wikitext_length_average` in column C is a good proxy for paragraph count.