This will simplify how we share monitoring duty during the long-running scrape job.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Today
Yesterday
Stalled waiting for WMF legal review.
Well, it could be simple after all. Articles at the end are on average twice as long (by HTML length).
In this example, the segment on the left is processing the tail articles starting at the 2.6M'th row, and on the right we're processing the first articles in the dump.
Very surprisingly to me, Hypothesis 4 seems to be the only validated theory. I haven't yet identified what makes the last articles harder to process, but the performance characteristics are almost perfectly repeatable when going back and forth between sets of articles at the beginning vs. the end of the dump. Initial articles can be processed at ~1.5k articles/s, and final articles at ~250 articles/s.
Fri, Apr 19
WIP on the low-level-concurrency branch will let us experiment with per-page timeouts and debugging.
Thu, Apr 18
This may be related to T362894: Data quality: HTML dumps contain unexplainably outdated revisions of some pages. The duplicates seem to have various revision ids, here's a set showing that the article is included three times with the same title and page id, but at different versions:
tar xzf dewiki-NS0-20240201-ENTERPRISE-HTML.json.tar.gz -O | jq 'select(.name == "10.000 B.C.") | .identifier,.version.identifier'
@BTullis Thanks for highlighting this possibility! I tried the Conda environment as you suggested and it works perfectly for our needs. Even at high concurrency, the performance seems to be the same as in the bare metal environment I had cobbled together previously.
Wed, Apr 17
Still seeing extreme swings in performance, following the same shape as before. Now with additional metrics:
Tue, Apr 16
Pulling this in because it would be nice to have, to debug the slowdown we see after the first 20 minutes or so.
Some of these packages already appeary in debmonitor:
Hmm, spot-checking is only turning up articles which were created or moved after the snapshot date.
Duplicates: each copy of a page comes with a different revid, and checking the final counts we can see that our deduplication did catch the extra copies during the aggregation step:
Bringing this task into our sprint because it has data quality implications and probably blocks scraping for the moment.
Oookay there are all kinds of things happening. Diffing the two lists, we can see that the scraper is still producing duplicates:
+1._Buch_Samuel +1._Buch_Samuel +1._Buch_Samuel +1._Buch_Samuel +1._Buch_Samuel
analytics-mysql dewiki -B -e 'select page_title from page where page_namespace=0 and page_is_redirect=0' > dewiki_all_pages_db.txt
Deleted articles shouldn't show up in either list, so Hypothesis #2 is also looking unlikely.
ApiQueryAllPages uses the page title to carry continuation state, which is very reasonable! This would only be fooled by page renames happening during the dump interval, which is possible but not likely to add up to 1%. Hypothesis #1 is looking unlikely.
There's still a small (<1%) gap in page count. Splitting a small investigation out as T362659.
Performance graph (articles per second, peak is ~1k) shows very odd artifacts, these could be real or the wiki could be ordered so that longer articles come later:
Mon, Apr 15
WMDE Technical Wishes is relying on the 100% sampling rate for an upcoming experiment. Please let us know ahead of time if there are more plans to sample again in 2024.
Looks like we can use VisualEditorFeatureUse for our needs.
Looks like the usages have successfully been switched over already, so let's just try to drop the Popups duplicate module and see what happens.