Change Details

Concurrency across wikis (status quo) === The scraper was originally written to parallelize across wikis, but we found this to be an inefficient dimension to split on. Here are some of the shortcomings of this approach: * When the job is nearing completion, we will likely be processing a handful of the largest wikis. Splitting by wiki yields a small number of very large jobs. We lose concurrency exactly when we need it most. * If we also parallelize batches, then overall concurrency would be M x N if left unregulated. A more sophisticated (and fragile) process pooling becomes necessary, eg. to limit API concurrency. Concurrency in batches (proposed) === This task recommends that concurrency be implemented //only// as small batches of articles. While we make this change, let's also refactor the mapdata wart as ordinary columns under the existing kartographer_maps analysis plugin rather than as a standalone (and warty) pipeline stage. Note: this is expected to increase mapdata API concurrency, this should at least be monitored and ideally limited, but it's not obvious how to do this without damaging overall concurrency. Implementation === Work in progress code: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/new?merge_request%5Bsource_branch%5D=concurrent-batches