Page MenuHomePhabricator

Scraper: profile pipeline
Closed, ResolvedPublic

Description

Do basic performance profiling to check whether we're doing anything egregiously expensive. This doesn't need to be a thorough analysis.

Overall timing for all 4 steps of processing dewiki (2,791,185 Main namespace pages):

real    757m8.327s
user    3686m51.122s
sys     945m53.288s

61 pages/s of wall time on a 16-core machine, and roughly 80 ms/page of CPU (user) time.

The concurency could be tuned better, it looks like Flow is resulting in c. 30% load during the main scrape job. Its defaults are probably chosen for more CPU-intensive jobs and this one has a lot of IO and memory (ie. pushing HTML to the userspace parsing NIF threads). For a single wiki, we could try increasing thread count by 3x, but this might happen naturally when running multiple jobs.

Following instructions on https://blog.appsignal.com/2022/04/26/using-profiling-in-elixir-to-improve-performance.html to instrument pipeline.exs and then converting the fprof to callgrind using https://github.com/isacssouza/erlgrind , I profiled a small run over a wiki with only a few hundred pages. The results are hard for me to interpret but one thing that jumps out is that the edit distance algorithm is extremely hot due to the exponential term, comparing every pair of references. A wiki of several hundred pages caused millions of ref body comparisons. String.length itself is expensive, so I made tiny optimizations to call it less. Hard to see what the next optimization might be.

String operations in Elixir are Unicode-aware https://nietaki.com/2023/04/21/elixir-string-operations-seem-slow-and-why-its-a-good-thing/ which explains some slowness. I'll adapt the "cheap" code branches to compare byte length where appropriate.

Event Timeline

If anyone wants to play with this in the future, the Flow stages should still be tuned, using some instrumentation like https://teamon.me/2016/measuring-visualizing-genstage-flow-with-gnuplot/ . Especially once multiple wikis are being processed, the arbitrary default number of stages won't make any sense. I don't think the multiple Flow pipelines will play together nicely, either.

A short pilot run shows that CPU usage is roughly 50%. Doubling the number of top-level (wiki) stages to 32 increases CPU utilization to 60% or so which is better, and shows that the job is IO-bound (NFS) for now.

I'll patch the application to allow configurable concurrency and will set the production job to 32 stages.

awight claimed this task.