Structured Discussions (erstwhile Flow) has had HTML stored in the database since its inception. Parsoid's HTML has upgraded and changed since then. While Parsoid has had backward compatibility code to support html -> wt for older versions, and has also grown more lenient in what HTML it expects, it would be useful if the HTML is upgraded so that Parsoid code can remove the b/c code without fear of breaking wikitext editing support on these older flow posts.
In the future, once Flow's content has been upgraded, whenever Parsoid updates its HTML versions, we could consider periodically upgrading the stored HTML.
- SD currently does not store <head> => Parsoid HTML version number isn't known. T209114: Store <head> (including Parsoid version number) for HTML Flow content will address that for the future. But, for now, any upgrade script will have to convert HTML -> wt and reparse that wt -> HTML to upgrade stored HTML.
- T148258: html2wt for links: Ignore data-parsoid and rel types more aggressively and generate the expected canonical forms is the bug that previous attempts to update HTML tripped on. T148258#3738352 indicates adding https: prefix to hrefs might more or less solve a large part of that problem.
- We will need to write a script to perform this upgrade of stored content (after backing up the content first).
- Test runs of this script will let us identify any other html -> wt -> html bugs that need addressing.
- Given that there are about ~800K posts, we will need some mechanism of verifying that the upgrade didn't break anything.