At least 1% of articles are missing from the HTML dumps, when comparing the meta=siteinfo&siprop=statistics API results to the count of unique pages found in the dumps. This task is done when we have an explanation for why the numbers differ.
Hypothesis #1: enterprise dump source code gets its article list by making repeated API call to the list=allpages action API endpoint. This seems inherently unstable as the responding query is re-run unpredictably, and the cursor works by row count rather than something like page id.
Hypothesis #2: perhaps deleted articles are counted in the stats but not in the dump? The sitestats endpoint does omit page_is_redirect != 0 but seems to count deleted pages.