We need to do another iteration of concurency tuning, after T345012 which switched to processing one wiki at a time, also integrating API requests into the main pipeline.
Now (nonoptimally) the proportion of pages with mapdata will have an impact on the ratio of CPU to IO, so to accomplish this task we should pick a test wiki with a proportion of pages with maps representative of the overall proportion across all wikis.
The goal is to maximize the number of pages processed over time, without pinning the CPUs at 100%.
Test conditions
plwikivoyage 20230820 snapshot, includes 13,095 pages, 664 of these include maps. Running on the Cloud VPS server with filesystem access to the dump file.
Results
concurrency 320: 27s
concurrency 160: 35s
concurrency 80: 45s
concurrency 40: 73s
concurrency 30: 70s
concurrency 20: 49s
concurrency 15: 41s
concurrency 12: 38s
concurrency 11: 40s
concurrency 10: 36s
concurrency 9: 43s
concurrency 7: 56s
concurrency 5: 70s
Timings are repeatable to within a few seconds across runs.
This overall picture is interesting because two minima emerge. The higher minimum exists because of massively parallel API requests, up to at least 36 concurrent requests when sampling network connections manually. However, it's considered rude to use this configuration, the API:Etiquette page even recommends making all requests serially.
Zooming in on the lower minimum,
The sweet spot is at concurrency=10, which resulted in an API concurrency of 5-7, and with almost the same thoroughput as the much higher concurrency, 360 pages per second and an average of roughly 2 pages with maps per second. This also has implications for the type of machine we should run on: an 8-core box would be sufficient.
Implementation
Code to review: