Page MenuHomePhabricator

Debug why the dump scraper isn't fully concurrent
Closed, DeclinedPublic

Description

The scraper has configurable concurrency at the top level, processing each wiki in a separate thread. Overall concurrency can be set using a variable in config/prod.exs

However, in practice we see that concurrency starts at the intended level but immediately drops to something very low, with only 1-4 output files being written to. Why is this happening? The most likely possibility is that threads are blocking on a shared resource, maybe the mapdata API requests. This could be due to the HTTP library defaulting to single-threading.

Event Timeline

Diagnostic patch: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/82

Production code is c. 97% of the way through parsing the input file of the very last wiki (enwiki) before the final aggregation, it seems. So there would be no benefit in stopping now, but after that phase is complete we might want to stop ahead of mapdata to merge the patch for higher API concurrency. However, if this fetch is mostly complete then we should not interrupt, because the mapdata step is unfortunately not restartable (??)

I couldn't resist taking a look, so moving to our current backlog.