Debug why the dump scraper isn't fully concurrent
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	awight
	Jul 6 2023, 12:27 PM

Description

The scraper has configurable concurrency at the top level, processing each wiki in a separate thread. Overall concurrency can be set using a variable in config/prod.exs

However, in practice we see that concurrency starts at the intended level but immediately drops to something very low, with only 1-4 output files being written to. Why is this happening? The most likely possibility is that threads are blocking on a shared resource, maybe the mapdata API requests. This could be due to the HTTP library defaulting to single-threading.

Related Objects

Mentioned Here: T345012: Scraper: change concurrency to parallelize batches in one wiki

Event Timeline

awight created this task.Jul 6 2023, 12:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 6 2023, 12:27 PM

Diagnostic patch: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/82

Production code is c. 97% of the way through parsing the input file of the very last wiki (enwiki) before the final aggregation, it seems. So there would be no benefit in stopping now, but after that phase is complete we might want to stop ahead of mapdata to merge the patch for higher API concurrency. However, if this fetch is mostly complete then we should not interrupt, because the mapdata step is unfortunately not restartable (??)

I couldn't resist taking a look, so moving to our current backlog.

Deprecated in favor of T345012: Scraper: change concurrency to parallelize batches in one wiki

awight closed this task as Declined.Aug 28 2023, 8:23 AM

Debug why the dump scraper isn't fully concurrentClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

Debug why the dump scraper isn't fully concurrent
Closed, DeclinedPublic
Actions