An exception to the requirement "If they do anything that brings their focus outside the caption box, like opening image details, opening the caption onboarding, or tapping the inactive publish button, this triggers the red warning text." is if the user does not input any characters in the caption box and loses focus, in that case the caption input should still contain the placeholder text.
When there is no caption text (whether via the initial state or when the user entered and then deleted text), the placeholder should be shown. The validation warning should not be shown.