Currently the tutorial presented to the user in the first step has to be an image. That probably works for Wikimedia Commons, but is extremely troublesome for other smaller wikis that cannot afford the time and resources to design the tutorial properly. Additionally, any changes to the tutorial require the community to go through the hassle of updating the image, which is much slower than simply changing some wikitext. A much simpler approach would be to simply make an option for the tutorial to be just a piece of wikitext. That would also make the wizard more consistent with other upload methods, that always use MW messages to explain upload rules on the wiki.
We had this problem on Nonsensopedia (https://nonsa.pl/), in the end we hacked on the extension's code to load a message instead of an image (you can find a code diff here).
I propose removing the image-based tutorial and making it simply a piece of wikitext/system message. That would still allow for images to be used if someone really wanted to do that and would simplify the tutorial code greatly.
As for whether this solution works and fits in with the rest of the UI… it does, IMO: