Scraper: mode to re-aggregate already existing page summaries
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	awight
	Oct 6 2023, 7:00 AM

Description

I've found the need to run in this mode several times now, and each time I do a one-off patch. It's not entirely trivial because the automatic wiki discovery code needs to be either bypassed or retargeted to look for existing page summary files.

Example use cases:

Implemented deduplication, want to reprocess files with this in place.
Added a new aggregate column which can be calculated from the old page summaries.

Schema changes would also be a concern here, but maybe aggregations can be made robust to missing input columns.

Code to review: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/93

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		awight	T347677 Run scraper on recent months for German Wikipedia to get reference dynamics over time
		Resolved		None	T348304 Scraper: mode to re-aggregate already existing page summaries

Event Timeline

awight created this task.Oct 6 2023, 7:00 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 6 2023, 7:00 AM

awight moved this task from Backlog to Doing on the WMDE-TechWish-Maintenance-2023 board.Oct 6 2023, 7:01 AM

awight removed awight as the assignee of this task.Oct 6 2023, 7:38 AM

awight moved this task from Doing to Review on the WMDE-TechWish-Maintenance-2023 board.

awight updated the task description. (Show Details)

WMDE-Fisch moved this task from Review to Done on the WMDE-TechWish-Maintenance-2023 board.Oct 9 2023, 7:53 AM

Tobi_WMDE_SW added a project: WMDE-TechWish-Sprint-2023-11-22.Nov 22 2023, 7:14 AM

Tobi_WMDE_SW moved this task from Sprint Backlog to Demo on the WMDE-TechWish-Sprint-2023-11-22 board.Nov 22 2023, 7:16 AM

thiemowmde moved this task from Demo to Done on the WMDE-TechWish-Sprint-2023-11-22 board.Nov 23 2023, 9:09 AM

thiemowmde closed this task as Resolved.Dec 13 2023, 11:19 AM

Scraper: mode to re-aggregate already existing page summariesClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Scraper: mode to re-aggregate already existing page summaries
Closed, ResolvedPublic
Actions

Related Objects
Search...