Page MenuHomePhabricator

Run scraper on recent months for German Wikipedia to get reference dynamics over time
Closed, ResolvedPublic

Event Timeline

Oops, something to remember for next time: I chose the 2023-07-01 dumps, but this expired at the 3-month date yesterday and the input file was removed. In the future, pick a newer or preferably the most recent dump.

To be transparent, we've run into some data quality issues which need to be resolved before this metric can be reliably calculated. Dump sizes are fluctuating awkwardly, and seem to not track the actual articles on each wiki.

awight added subscribers: Lena_WMDE, Lea_WMDE.

@Lea_WMDE @Lena_WMDE Unfortunately I'll have to mark this task as stalled: I can't find two dewiki dump files which have reasonably correct data. What I've discovered about the upstream data so far also throws the entire scraper results into doubt, we shouldn't trust any absolute numbers.

Perhaps we can leave our goals abstract for now, ie. "10% increase in ref reuse over the next year", with a reliable calculation of the absolute numbers as one of the deliverables.

After checking the dumps from June, the dewiki page count does match the actual number of Main namespace pages. So we can at least rely on the static numbers for dewiki from our original scraper run.

I'm going to keep digging until I find another dump file with reasonably correct data.

Tech note about my ad-hoc approach to a health check:

tar xzf /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230801/dewiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz -O | wc -l

Then check to see that this row count is very close to the site-reported article count which is currently 2,841,034.

We now have two data points from two months apart: https://docs.google.com/spreadsheets/d/1q71Swzxpf2U4shhSJl8fry-CHg1RauJXBlXN_PHSlk4/edit#gid=1115046357

They show that 785,056 refs (+5.2%) and 42,497 pages were added in two months, and the average number of refs per page increased from 5.23 to 5.42 . I'm not entirely comfortable with these numbers, they feel high. We should take additional data points and cross-check with other sources.

awight removed awight as the assignee of this task.Oct 6 2023, 7:51 AM

Now processing the 2023-10-01 dump as a sanity check.

Cram time... Let's find the number of "similar" references added over two months:

gunzip -c < reports/2023-06-01/dewiki-20230601-page-summary.ndjson.gz | jq ".similar_ref_count" > dewiki-20230601-similar_ref_count.txt | awk '{s+=$1} END {print s}'
gunzip -c < reports/dewiki-20230801-page-summary.ndjson.gz | jq ".similar_ref_count" > dewiki-20230801-similar_ref_c
ount.txt | awk '{s+=$1} END {print s}'

June: 10,844,568
August: 10,266,593

Aaaargh, the data is garbage again. Drilling down, I can find articles where the number of similar ref pairs is greater than the number of references, mostly in articles with hundreds of refs. My hunch is that we're not eliminating refs after they've been counted, so when 2 refs are similar they will be counted as 1 similarity, 3 refs will be counted as 3 similarities, and 4 refs as 6 similarities (triangular numbers).

awight claimed this task.
awight moved this task from Watching / Epic to Done on the WMDE-TechWish-Maintenance-2023 board.