Run scraper on recent months for German Wikipedia to get reference dynamics over time
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	awight
	Sep 29 2023, 9:20 AM

Related Objects
Search...

Status	Assigned	Task
Resolved	awight	T347677 Run scraper on recent months for German Wikipedia to get reference dynamics over time
Open	None	T348100 Request: changelog for Enterprise API HTML dumps
Resolved	None	T348304 Scraper: mode to re-aggregate already existing page summaries
Resolved	None	T350145 Debug scraper "similar ref" measure

Event Timeline

awight created this task.Sep 29 2023, 9:20 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 29 2023, 9:20 AM

awight moved this task from Backlog to Doing on the WMDE-TechWish-Maintenance-2023 board.Sep 29 2023, 9:20 AM

Oops, something to remember for next time: I chose the 2023-07-01 dumps, but this expired at the 3-month date yesterday and the input file was removed. In the future, pick a newer or preferably the most recent dump.

awight added a subtask: T348100: Request: changelog for Enterprise API HTML dumps.Oct 4 2023, 6:30 AM

To be transparent, we've run into some data quality issues which need to be resolved before this metric can be reliably calculated. Dump sizes are fluctuating awkwardly, and seem to not track the actual articles on each wiki.

@Lea_WMDE @Lena_WMDE Unfortunately I'll have to mark this task as stalled: I can't find two dewiki dump files which have reasonably correct data. What I've discovered about the upstream data so far also throws the entire scraper results into doubt, we shouldn't trust any absolute numbers.

Perhaps we can leave our goals abstract for now, ie. "10% increase in ref reuse over the next year", with a reliable calculation of the absolute numbers as one of the deliverables.

After checking the dumps from June, the dewiki page count does match the actual number of Main namespace pages. So we can at least rely on the static numbers for dewiki from our original scraper run.

I'm going to keep digging until I find another dump file with reasonably correct data.

Tech note about my ad-hoc approach to a health check:

tar xzf /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230801/dewiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz -O | wc -l

Then check to see that this row count is very close to the site-reported article count which is currently 2,841,034.

awight moved this task from Doing to Review on the WMDE-TechWish-Maintenance-2023 board.Oct 6 2023, 7:47 AM

We now have two data points from two months apart: https://docs.google.com/spreadsheets/d/1q71Swzxpf2U4shhSJl8fry-CHg1RauJXBlXN_PHSlk4/edit#gid=1115046357

They show that 785,056 refs (+5.2%) and 42,497 pages were added in two months, and the average number of refs per page increased from 5.23 to 5.42 . I'm not entirely comfortable with these numbers, they feel high. We should take additional data points and cross-check with other sources.

awight removed awight as the assignee of this task.Oct 6 2023, 7:51 AM

Now processing the 2023-10-01 dump as a sanity check.

Cram time... Let's find the number of "similar" references added over two months:

gunzip -c < reports/2023-06-01/dewiki-20230601-page-summary.ndjson.gz | jq ".similar_ref_count" > dewiki-20230601-similar_ref_count.txt | awk '{s+=$1} END {print s}'
gunzip -c < reports/dewiki-20230801-page-summary.ndjson.gz | jq ".similar_ref_count" > dewiki-20230801-similar_ref_c
ount.txt | awk '{s+=$1} END {print s}'

June: 10,844,568
August: 10,266,593

Aaaargh, the data is garbage again. Drilling down, I can find articles where the number of similar ref pairs is greater than the number of references, mostly in articles with hundreds of refs. My hunch is that we're not eliminating refs after they've been counted, so when 2 refs are similar they will be counted as 1 similarity, 3 refs will be counted as 3 similarities, and 4 refs as 6 similarities (triangular numbers).

awight moved this task from Review to Done on the WMDE-TechWish-Maintenance-2023 board.Nov 7 2023, 1:13 PM

Tobi_WMDE_SW moved this task from Done to Watching / Epic on the WMDE-TechWish-Maintenance-2023 board.Nov 21 2023, 8:49 AM

awight closed this task as Resolved.Nov 21 2023, 12:46 PM

awight claimed this task.

awight moved this task from Watching / Epic to Done on the WMDE-TechWish-Maintenance-2023 board.

thiemowmde closed subtask T350145: Debug scraper "similar ref" measure as Resolved.Dec 13 2023, 11:19 AM

thiemowmde closed subtask T348304: Scraper: mode to re-aggregate already existing page summaries as Resolved.

Run scraper on recent months for German Wikipedia to get reference dynamics over timeClosed, ResolvedPublicActions

Related ObjectsSearch...

Event Timeline

Run scraper on recent months for German Wikipedia to get reference dynamics over time
Closed, ResolvedPublic
Actions

Related Objects
Search...