Page MenuHomePhabricator

Scraper: change concurrency to parallelize batches in one wiki
Closed, ResolvedPublic

Description

Concurrency across wikis (status quo)

The scraper was originally written to parallelize across wikis, but we found this to be an inefficient dimension to split on.

Here are some of the shortcomings of this approach:

  • When the job is nearing completion, we will likely be processing a handful of the largest wikis. Splitting by wiki yields a small number of very large jobs. We lose concurrency exactly when we need it most.
  • If we also parallelize batches, then overall concurrency would be M x N if left unregulated. A more sophisticated (and fragile) process pooling becomes necessary, eg. to limit API concurrency.

Concurrency in batches (proposed)

This task recommends that concurrency be implemented only as small batches of articles.

While we make this change, let's also refactor the mapdata wart as ordinary columns under the existing kartographer_maps analysis plugin rather than as a standalone (and warty) pipeline stage. Note: this is expected to increase mapdata API concurrency, this should at least be monitored and ideally limited, but it's not obvious how to do this without damaging overall concurrency.

Implementation

Code to review: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/86