Scraper: profile pipeline
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	awight
	Apr 25 2023, 3:07 PM

Description

Do basic performance profiling to check whether we're doing anything egregiously expensive. This doesn't need to be a thorough analysis.

Overall timing for all 4 steps of processing dewiki (2,791,185 Main namespace pages):

real    757m8.327s
user    3686m51.122s
sys     945m53.288s

61 pages/s of wall time on a 16-core machine, and roughly 80 ms/page of CPU (user) time.

The concurency could be tuned better, it looks like Flow is resulting in c. 30% load during the main scrape job. Its defaults are probably chosen for more CPU-intensive jobs and this one has a lot of IO and memory (ie. pushing HTML to the userspace parsing NIF threads). For a single wiki, we could try increasing thread count by 3x, but this might happen naturally when running multiple jobs.

Following instructions on https://blog.appsignal.com/2022/04/26/using-profiling-in-elixir-to-improve-performance.html to instrument pipeline.exs and then converting the fprof to callgrind using https://github.com/isacssouza/erlgrind , I profiled a small run over a wiki with only a few hundred pages. The results are hard for me to interpret but one thing that jumps out is that the edit distance algorithm is extremely hot due to the exponential term, comparing every pair of references. A wiki of several hundred pages caused millions of ref body comparisons. String.length itself is expensive, so I made tiny optimizations to call it less. Hard to see what the next optimization might be.

String operations in Elixir are Unicode-aware https://nietaki.com/2023/04/21/elixir-string-operations-seem-slow-and-why-its-a-good-thing/ which explains some slowness. I'll adapt the "cheap" code branches to compare byte length where appropriate.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T345411 Scraper: destroy Cloud VPS runner instance
Resolved	None	T341751 Publish dump scraper reports
Resolved	None	T335411 Scraper: produce spreadsheet of scraped statistics for comparing wikis
Resolved	awight	T332032 Create baseline statistics for reference usage (2023)
Resolved	awight	T335362 Scraper: profile pipeline

Event Timeline

awight created this task.Apr 25 2023, 3:07 PM

awight moved this task from Sprint Backlog to Doing on the WMDE-TechWish-Sprint-2023-04-19 board.Apr 26 2023, 11:09 AM

awight updated the task description. (Show Details)Apr 27 2023, 6:17 AM

awight moved this task from Doing to Tech Review on the WMDE-TechWish-Sprint-2023-04-19 board.

awight updated the task description. (Show Details)Apr 27 2023, 7:41 AM

lilients_WMDE moved this task from Incoming to In progress on the WMDE-References-FocusArea board.May 2 2023, 12:41 PM

WMDE-Fisch added a project: WMDE-TechWish-Sprint-2023-05-03.May 3 2023, 8:06 AM

Lena_WMDE moved this task from Sprint Backlog to Tech Review on the WMDE-TechWish-Sprint-2023-05-03 board.May 3 2023, 8:06 AM

If anyone wants to play with this in the future, the Flow stages should still be tuned, using some instrumentation like https://teamon.me/2016/measuring-visualizing-genstage-flow-with-gnuplot/ . Especially once multiple wikis are being processed, the arbitrary default number of stages won't make any sense. I don't think the multiple Flow pipelines will play together nicely, either.

A short pilot run shows that CPU usage is roughly 50%. Doubling the number of top-level (wiki) stages to 32 increases CPU utilization to 60% or so which is better, and shows that the job is IO-bound (NFS) for now.

I'll patch the application to allow configurable concurrency and will set the production job to 32 stages.

awight closed this task as Resolved.Jul 13 2023, 12:04 AM

awight claimed this task.

awight moved this task from In progress to Done on the WMDE-References-FocusArea board.Oct 23 2024, 7:05 AM

Scraper: profile pipelineClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Scraper: profile pipeline
Closed, ResolvedPublic
Actions

Related Objects
Search...