The wikitext
<gallery>File:Example.jpg|thumb|''Hello''</gallery>
converts to
<ul ... data-mw='{"name":"gallery","attrs":{},"body":{"extsrc":"File:Example.jpg|thumb|''Hello''"}}'>
<li>
<div ...><span ...><a ...><img ... /></a></span></div>
<div class="gallerytext"><i>Hello</i></div>
</li>
</ul>Note the caption data exists as both wikitext in data-mw.body.extsrc and html in the gallerytext div.
Parsoid prefers to use the data-mw wikitext when converting back to wikitext. This means we would need to detect a caption being edited for the first, and generate an attribute transaction on the parent gallery, which is very messy.