Investigate scraper performance drop-off
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	awight
	Thu, Apr 18, 3:17 PM

Description

Problem: HTML dump scraper thoroughput drops off quickly after time.

Hypothesis 1: some threads fail to halt

Evidence for: Performance degrades but never recovers. Secondary process count decreases over a long timeframe at the end of the job, as if each thread is shutting down one at a time.

Test: Kill worker threads after a maximum time limit. Log skipped pages.

Hypothesis 2: exponential ref comparisons explode

Evidence against:

the highest number of references is found in an article https://de.wikipedia.org/wiki/Liste_von_neuzeitlich_ausgestorbenen_Weichtieren , with 832kB of wikitext and 5300 refs. I synthesized a dump file with just the existing (four!) copies of this article which appear in the real dump and ran processing, but the job allocated 75MB of memory, took 1m20s to finish and outputs looked fine.
Articles which seem to cause the drop in performance aren't timing out.

Hypothesis 3: triggered by a specific article feature

Evidence for: The problem doesn't seem to appear until roughly the 1.6M'th article, and then quickly degrades with time. If you fast-forward the stream to this point, the problem appears immediately. This happens whether we fast-forward by skipping lines within the program (therefore running data through all of the tar / gunzip / line split stream transformers), or if we start with a prepared snippet of the input file.

Evidence against: Performance curves look the same even if the articles around the inflection point have been skipped.

Hypothesis 4: articles are sorted in order of increasing difficulty

Evidence for: Similar to H3, processing snippets later in the stream seem to reflect exactly the same performance profile as the corresponding rows in a full run.

Understanding the way dumps are ordered would be helpful.

Code to review

https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/123

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T357611 Re-run the scraper on a limited set of wikis
		Resolved		WMDE-Fisch	T362900 Investigate scraper performance drop-off

Event Timeline

awight created this task.Thu, Apr 18, 3:17 PM

awight updated the task description. (Show Details)Fri, Apr 19, 10:08 AM

WIP on the low-level-concurrency branch will let us experiment with per-page timeouts and debugging.

awight claimed this task.Mon, Apr 22, 7:40 AM

awight moved this task from Sprint Backlog to Doing on the WMDE-TechWish-Sprint-2024-04-12 board.

awight updated the task description. (Show Details)

awight updated the task description. (Show Details)Mon, Apr 22, 9:35 AM

awight updated the task description. (Show Details)Mon, Apr 22, 9:51 AM

awight updated the task description. (Show Details)Mon, Apr 22, 9:57 AM

awight updated the task description. (Show Details)Mon, Apr 22, 10:04 AM

Very surprisingly to me, Hypothesis 4 seems to be the only validated theory. I haven't yet identified what makes the last articles harder to process, but the performance characteristics are almost perfectly repeatable when going back and forth between sets of articles at the beginning vs. the end of the dump. Initial articles can be processed at ~1.5k articles/s, and final articles at ~250 articles/s.

In this example, the segment on the left is processing the tail articles starting at the 2.6M'th row, and on the right we're processing the first articles in the dump.

Well, it could be simple after all. Articles at the end are on average twice as long (by HTML length).

cat dewiki-20240220-head-page-summary.ndjson | jq '.html_length' | jq -s add/length
27682

cat dewiki-20240220-+2.6M-page-summary.ndjson | jq '.html_length' | jq -s add/length                                                                      
59474

Average wikitext_length is 4533 for the initial articles and 11308 at the end.

Average ref_count is 3.2 at the beginning and 9.2 at the end.

awight added a project: Patch-For-Review.Mon, Apr 22, 11:26 AM

WMDE-Fisch removed a project: Patch-For-Review.Tue, Apr 23, 9:56 AM

WMDE-Fisch updated the task description. (Show Details)

thiemowmde added a project: WMDE-TechWish-Sprint-2024-04-24.Wed, Apr 24, 12:13 PM

awight moved this task from Tech Review to Done on the WMDE-TechWish-Sprint-2024-04-12 board.Wed, Apr 24, 12:16 PM

awight removed a project: WMDE-TechWish-Sprint-2024-04-24.Wed, Apr 24, 12:20 PM

WMDE-Fisch closed this task as Resolved.Wed, Apr 24, 5:06 PM

WMDE-Fisch claimed this task.