During data reload, the Munging step is taking a significant time. A quick look at resource consumption indicates that munging is CPU bound. It is probably possible to introduce some parallelism to increase throughput.
Per-item data are mostly independent, so different items can be easily processable in parallel, however that would require splitting the incoming data per item (note that item data not necessarily have item URI as subject - there are statements, references, values, sitelinks, etc.)
in multiple thread doubled the speed of the munger
old: real 1371m34.618s user 1854m48.672s sys 24m44.480s new: real 731m20.495s user 1798m42.176s sys 30m7.888s
I should have linked https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/553758 to this task.
Since the rdf parser is the limiting factor I think we will have to do the entity delimitation without a rdf parser if we want to further improve the speed of this step.
We could also consider switching to the nt format which I'm sure will be a lot faster to parse if the size overhead is acceptable.
I'm going to mark this as resolved as the munge step can now be performed in hadoop as a standalone step, meaning it is still useful to other users and is not only useful as part of the streaming updater.
The main part of this was done in https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/623621 thanks to @dcausse for guiding me through my dream.
I'll currently writing a post about how to make use of this on a context outside of the WMF cluster and will try to link back here when done.