Do basic performance profiling to check whether we're doing anything egregiously expensive. This doesn't need to be a thorough analysis.
Overall timing for all 4 steps of processing dewiki (2,791,185 Main namespace pages):
real 757m8.327s user 3686m51.122s sys 945m53.288s
61 pages/s of wall time on a 16-core machine, and roughly 80 ms/page of CPU (user) time.
The concurency could be tuned better, it looks like Flow is resulting in c. 30% load during the main scrape job. Its defaults are probably chosen for more CPU-intensive jobs and this one has a lot of IO and memory (ie. pushing HTML to the userspace parsing NIF threads). For a single wiki, we could try increasing thread count by 3x, but this might happen naturally when running multiple jobs.
Following instructions on https://blog.appsignal.com/2022/04/26/using-profiling-in-elixir-to-improve-performance.html to instrument pipeline.exs and then converting the fprof to callgrind using https://github.com/isacssouza/erlgrind , I profiled a small run over a wiki with only a few hundred pages. The results are hard for me to interpret but one thing that jumps out is that the edit distance algorithm is extremely hot due to the exponential term, comparing every pair of references. A wiki of several hundred pages caused millions of ref body comparisons. String.length itself is expensive, so I made tiny optimizations to call it less. Hard to see what the next optimization might be.
String operations in Elixir are Unicode-aware https://nietaki.com/2023/04/21/elixir-string-operations-seem-slow-and-why-its-a-good-thing/ which explains some slowness. I'll adapt the "cheap" code branches to compare byte length where appropriate.