Change Details

Problem: HTML dump scraper thoroughput drops off quickly after time. == Hypothesis 1: some threads fail to halt == Evidence: Performance degrades but never recovers. Secondary process count decreases over a long timeframe at the end of the job, as if each thread is shutting down one at a time. Test: Kill worker threads after a maximum time limit. Log skipped pages. == Hypothesis 2: exponential ref comparisons explode == Evidence against: the highest number of references is found in an article https://de.wikipedia.org/wiki/Liste_von_neuzeitlich_ausgestorbenen_Weichtieren , with 832kB of wikitext and 5300 refs. I synthesized a dump file with just the existing (four!) copies of this article which appear in the real dump and ran processing, but the job allocated 75MB of memory, took 1m20s to finish and outputs looked fine.