Page MenuHomePhabricator

Scraper: mode to re-aggregate already existing page summaries
Closed, ResolvedPublic

Description

I've found the need to run in this mode several times now, and each time I do a one-off patch. It's not entirely trivial because the automatic wiki discovery code needs to be either bypassed or retargeted to look for existing page summary files.

Example use cases:

  • Implemented deduplication, want to reprocess files with this in place.
  • Added a new aggregate column which can be calculated from the old page summaries.

Schema changes would also be a concern here, but maybe aggregations can be made robust to missing input columns.

Code to review: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/93