Page MenuHomePhabricator

Scraper: tune concurrency after changes to forking strategy
Closed, ResolvedPublic

Description

We need to do another iteration of concurency tuning, after T345012 which switched to processing one wiki at a time, also integrating API requests into the main pipeline.

Now (nonoptimally) the proportion of pages with mapdata will have an impact on the ratio of CPU to IO, so to accomplish this task we should pick a test wiki with a proportion of pages with maps representative of the overall proportion across all wikis.

The goal is to maximize the number of pages processed over time, without pinning the CPUs at 100%.

Test conditions

plwikivoyage 20230820 snapshot, includes 13,095 pages, 664 of these include maps. Running on the Cloud VPS server with filesystem access to the dump file.

Results

concurrency 320: 27s
concurrency 160: 35s
concurrency 80: 45s
concurrency 40: 73s
concurrency 30: 70s
concurrency 20: 49s
concurrency 15: 41s
concurrency 12: 38s
concurrency 11: 40s
concurrency 10: 36s
concurrency 9: 43s
concurrency 7: 56s
concurrency 5: 70s

Timings are repeatable to within a few seconds across runs.

(Worksheet with charts)

image.png (383×611 px, 21 KB)

This overall picture is interesting because two minima emerge. The higher minimum exists because of massively parallel API requests, up to at least 36 concurrent requests when sampling network connections manually. However, it's considered rude to use this configuration, the API:Etiquette page even recommends making all requests serially.

Zooming in on the lower minimum,

image.png (375×607 px, 23 KB)

The sweet spot is at concurrency=10, which resulted in an API concurrency of 5-7, and with almost the same thoroughput as the much higher concurrency, 360 pages per second and an average of roughly 2 pages with maps per second. This also has implications for the type of machine we should run on: an 8-core box would be sufficient.

Implementation

Code to review:

Event Timeline

A total of 107M pages were found in these dumps, and 5.1M pages had a map for an average of 4.8% of overall pages.

Picking a wiki of 10-20k pages gives a total run time of several minutes, so plwikivoyage satisfies our criteria at 13k pages, 5% of them having a map.

awight moved this task from Backlog to Review on the WMDE-TechWish-Maintenance-2023 board.
awight updated the task description. (Show Details)

This sounds all very reasonable. Unfortunately the one unmerged patch is still a draft. How to proceed?

Note for the future: If the Kartographer analysis module is disabled, we can turn concurrency up by a lot, and should run the tuning again. In general, API concurrency should be limited separately from the rest of the process, the nice way to do this would be by isolating a request pool and allowing other processing to continue.