Wikidata dumps currently come directly from the SQL servers.
The general process here is iterate through all pages, and slowly write all content to files (possibly in multiple threads).
An alternative solution could be for Wikidata to produce 2 event streams of RDF and JSON output to hadoop, if T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth & T215001: Revisions missing from mediawiki_revision_create are complete.
In order to not need to wait for T120242 or T215001 this could be implemented differently, with a service taking a reliable and consistent input (such as MediaWiki recent changes) and populating a reliable stream in kafka of content by making requests to Wikidata for the content.
Dumps could then be created directly from hadoop, which I imagine would take far less time allowing users to get fresher data, and also benefiting services such as Wikidata-Query-Service which sometimes have to reload from dumps.
If we could quickly push this data to kafka too, we would likely see some reduction in load on s8 db servers, as dump generation would no longer need to run. I'm sure DBA would appreciate this.
And the new query service flink updater could also make use of the RDF stream, instead of using mediawiki revision create events and then requesting Special:EntityData.
This would likely also open up the doors a little more to subset dumps and other fun like that.
The currently weekly imported data set of wikidata_entities into hadoop would also be fresh for analysis, rather than 1-2 weeks behind the times.