Right now, images trigger an imageinfo request to the Mediawiki API and the generated HTML is dependent on the output of the imageinfo request. However, this adds an unnecessary async dependency (even if the requests are batched and overlapped with other activity).
It should be possible to generate a "normalized" HTML output during regular parse that uses information from wikitext, and then postprocess the output based on a bulk API request in the end (images, redlinks, disambiguation links, and whatever else). This is hinted at in this Wikitext 2.0 note.
The generalized push here is to make the wikitext be as self-sufficient as possible on parse, and use post-processing to transform it based on database state. Our current redlinks and disambiguation link parse strategy are 2 steps towards that goal. This image parsing strategy is another step towards that goal.