Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | awight | T347677 Run scraper on recent months for German Wikipedia to get reference dynamics over time | |||
Open | None | T348100 Request: changelog for Enterprise API HTML dumps | |||
Resolved | None | T348304 Scraper: mode to re-aggregate already existing page summaries | |||
Resolved | None | T350145 Debug scraper "similar ref" measure |
Event Timeline
Oops, something to remember for next time: I chose the 2023-07-01 dumps, but this expired at the 3-month date yesterday and the input file was removed. In the future, pick a newer or preferably the most recent dump.
To be transparent, we've run into some data quality issues which need to be resolved before this metric can be reliably calculated. Dump sizes are fluctuating awkwardly, and seem to not track the actual articles on each wiki.
@Lea_WMDE @Lena_WMDE Unfortunately I'll have to mark this task as stalled: I can't find two dewiki dump files which have reasonably correct data. What I've discovered about the upstream data so far also throws the entire scraper results into doubt, we shouldn't trust any absolute numbers.
Perhaps we can leave our goals abstract for now, ie. "10% increase in ref reuse over the next year", with a reliable calculation of the absolute numbers as one of the deliverables.
After checking the dumps from June, the dewiki page count does match the actual number of Main namespace pages. So we can at least rely on the static numbers for dewiki from our original scraper run.
I'm going to keep digging until I find another dump file with reasonably correct data.
Tech note about my ad-hoc approach to a health check:
tar xzf /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230801/dewiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz -O | wc -l
Then check to see that this row count is very close to the site-reported article count which is currently 2,841,034.
We now have two data points from two months apart: https://docs.google.com/spreadsheets/d/1q71Swzxpf2U4shhSJl8fry-CHg1RauJXBlXN_PHSlk4/edit#gid=1115046357
They show that 785,056 refs (+5.2%) and 42,497 pages were added in two months, and the average number of refs per page increased from 5.23 to 5.42 . I'm not entirely comfortable with these numbers, they feel high. We should take additional data points and cross-check with other sources.
Cram time... Let's find the number of "similar" references added over two months:
gunzip -c < reports/2023-06-01/dewiki-20230601-page-summary.ndjson.gz | jq ".similar_ref_count" > dewiki-20230601-similar_ref_count.txt | awk '{s+=$1} END {print s}' gunzip -c < reports/dewiki-20230801-page-summary.ndjson.gz | jq ".similar_ref_count" > dewiki-20230801-similar_ref_c ount.txt | awk '{s+=$1} END {print s}'
June: 10,844,568
August: 10,266,593
Aaaargh, the data is garbage again. Drilling down, I can find articles where the number of similar ref pairs is greater than the number of references, mostly in articles with hundreds of refs. My hunch is that we're not eliminating refs after they've been counted, so when 2 refs are similar they will be counted as 1 similarity, 3 refs will be counted as 3 similarities, and 4 refs as 6 similarities (triangular numbers).