Page MenuHomePhabricator

WDQS Munger should be multi threaded
Closed, ResolvedPublic


During data reload, the Munging step is taking a significant time. A quick look at resource consumption indicates that munging is CPU bound. It is probably possible to introduce some parallelism to increase throughput.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Per-item data are mostly independent, so different items can be easily processable in parallel, however that would require splitting the incoming data per item (note that item data not necessarily have item URI as subject - there are statements, references, values, sitelinks, etc.)

Separation of

  • parsing
  • munging
  • writing

in multiple thread doubled the speed of the munger

real    1371m34.618s
user    1854m48.672s
sys     24m44.480s

real    731m20.495s
user    1798m42.176s
sys     30m7.888s

I should have linked to this task.
Since the rdf parser is the limiting factor I think we will have to do the entity delimitation without a rdf parser if we want to further improve the speed of this step.
We could also consider switching to the nt format which I'm sure will be a lot faster to parse if the size overhead is acceptable.

The real solution will come from moving all this processing in hadoop as part of the new streaming updater

Addshore changed the task status from Declined to Resolved.Oct 10 2020, 9:15 AM
Addshore added a subscriber: Addshore.

I'm going to mark this as resolved as the munge step can now be performed in hadoop as a standalone step, meaning it is still useful to other users and is not only useful as part of the streaming updater.
The main part of this was done in thanks to @dcausse for guiding me through my dream.
I'll currently writing a post about how to make use of this on a context outside of the WMF cluster and will try to link back here when done.