Scraper: change concurrency to parallelize batches in one wiki
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	awight
	Aug 25 2023, 7:30 PM

Description

Concurrency across wikis (status quo)

The scraper was originally written to parallelize across wikis, but we found this to be an inefficient dimension to split on.

Here are some of the shortcomings of this approach:

When the job is nearing completion, we will likely be processing a handful of the largest wikis. Splitting by wiki yields a small number of very large jobs. We lose concurrency exactly when we need it most.
If we also parallelize batches, then overall concurrency would be M x N if left unregulated. A more sophisticated (and fragile) process pooling becomes necessary, eg. to limit API concurrency.

Concurrency in batches (proposed)

This task recommends that concurrency be implemented only as small batches of articles.

While we make this change, let's also refactor the mapdata wart as ordinary columns under the existing kartographer_maps analysis plugin rather than as a standalone (and warty) pipeline stage. Note: this is expected to increase mapdata API concurrency, this should at least be monitored and ideally limited, but it's not obvious how to do this without damaging overall concurrency.

Implementation

Code to review: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/86

Related Objects

Mentioned In: T345412: Scraper: tune concurrency after changes to forking strategy
T341216: Debug why the dump scraper isn't fully concurrent

Event Timeline

awight created this task.Aug 25 2023, 7:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 25 2023, 7:30 PM

awight updated the task description. (Show Details)Aug 25 2023, 9:52 PM

awight moved this task from Backlog to Doing on the WMDE-TechWish-Maintenance-2023 board.

awight removed awight as the assignee of this task.Aug 28 2023, 8:14 AM

awight updated the task description. (Show Details)

awight moved this task from Doing to Review on the WMDE-TechWish-Maintenance-2023 board.

awight mentioned this in T341216: Debug why the dump scraper isn't fully concurrent.Aug 28 2023, 8:23 AM

WMDE-Fisch moved this task from Review to Done on the WMDE-TechWish-Maintenance-2023 board.Aug 28 2023, 1:19 PM

awight closed this task as Resolved.Aug 28 2023, 1:20 PM

awight claimed this task.

awight mentioned this in T345412: Scraper: tune concurrency after changes to forking strategy.Sep 1 2023, 7:22 AM

Scraper: change concurrency to parallelize batches in one wikiClosed, ResolvedPublicActions