Page MenuHomePhabricator

Investigate scraper performance drop-off
Closed, ResolvedPublic


Problem: HTML dump scraper thoroughput drops off quickly after time.

Hypothesis 1: some threads fail to halt

Evidence for: Performance degrades but never recovers. Secondary process count decreases over a long timeframe at the end of the job, as if each thread is shutting down one at a time.

Test: Kill worker threads after a maximum time limit. Log skipped pages.

Hypothesis 2: exponential ref comparisons explode

Evidence against:

  • the highest number of references is found in an article , with 832kB of wikitext and 5300 refs. I synthesized a dump file with just the existing (four!) copies of this article which appear in the real dump and ran processing, but the job allocated 75MB of memory, took 1m20s to finish and outputs looked fine.
  • Articles which seem to cause the drop in performance aren't timing out.

Hypothesis 3: triggered by a specific article feature

Evidence for: The problem doesn't seem to appear until roughly the 1.6M'th article, and then quickly degrades with time. If you fast-forward the stream to this point, the problem appears immediately. This happens whether we fast-forward by skipping lines within the program (therefore running data through all of the tar / gunzip / line split stream transformers), or if we start with a prepared snippet of the input file.

Evidence against: Performance curves look the same even if the articles around the inflection point have been skipped.

Hypothesis 4: articles are sorted in order of increasing difficulty

Evidence for: Similar to H3, processing snippets later in the stream seem to reflect exactly the same performance profile as the corresponding rows in a full run.

Understanding the way dumps are ordered would be helpful.

Code to review

Event Timeline

WIP on the low-level-concurrency branch will let us experiment with per-page timeouts and debugging.

awight moved this task from Sprint Backlog to Doing on the WMDE-TechWish-Sprint-2024-04-12 board.
awight updated the task description. (Show Details)

Very surprisingly to me, Hypothesis 4 seems to be the only validated theory. I haven't yet identified what makes the last articles harder to process, but the performance characteristics are almost perfectly repeatable when going back and forth between sets of articles at the beginning vs. the end of the dump. Initial articles can be processed at ~1.5k articles/s, and final articles at ~250 articles/s.

In this example, the segment on the left is processing the tail articles starting at the 2.6M'th row, and on the right we're processing the first articles in the dump.

image.png (689×1 px, 66 KB)

awight moved this task from Doing to Tech Review on the WMDE-TechWish-Sprint-2024-04-12 board.

Well, it could be simple after all. Articles at the end are on average twice as long (by HTML length).

cat dewiki-20240220-head-page-summary.ndjson | jq '.html_length' | jq -s add/length

cat dewiki-20240220-+2.6M-page-summary.ndjson | jq '.html_length' | jq -s add/length                                                                      

Average wikitext_length is 4533 for the initial articles and 11308 at the end.

Average ref_count is 3.2 at the beginning and 9.2 at the end.

WMDE-Fisch claimed this task.