Change Details

After doBlockLevels() sees an empty line, it starts hunting for a place to insert a start tag. It does this by checking if the line contains various assorted open or close tags (div, blockquote, th, etc.) -- if it gets a match, then it skips the line, and continues on to the next line. If there is no match, then it inserts the <p> at the start of the line. With a <p> tag already open, the algorithm for </p> insertion is much the same in the opposite direction, i.e. presence of div etc. forces immediate insertion of </p> at the start of the line (regardless of where in the line the actual tag was found). This is not equivalent to an HTML parsing algorithm. As an example of the mischief this can cause, consider the input: ``` aaaa <div></div><span id="1"> bbbb </span> <span id="2"> cccc <div></div> </span> ``` This produces the output: ``` <p>aaaa </p> <div></div><span id="1"> <p>bbbb </span> <span id="2"> cccc </p> <div></div> <p></span> </p> ``` The empty line following "aaaa" sets a flag, and the first empty div suppresses insertion of the <p> on that line, so the "bbbb" line gets the <p>. Then with the second empty div, we trigger insertion of </p> prior. Thus the paragraph straddles the boundary between the first two spans. The final </span> does not suppress anything, so it gets wrapped in its own paragraph. This is a reduction of the Rachael Flatt test case on [[https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy|Parsing/Replacing Tidy]] and so blocks T89331, since Depurate handles such broken HTML differently to Tidy.