Page MenuHomePhabricator

Upgrade Parsoid HTML stored in the StructuredDiscussions tables
Open, MediumPublic

Description

Structured Discussions (erstwhile Flow) has had HTML stored in the database since its inception. Parsoid's HTML has upgraded and changed since then. While Parsoid has had backward compatibility code to support html -> wt for older versions, and has also grown more lenient in what HTML it expects, it would be useful if the HTML is upgraded so that Parsoid code can remove the b/c code without fear of breaking wikitext editing support on these older flow posts.

In the future, once Flow's content has been upgraded, whenever Parsoid updates its HTML versions, we could consider periodically upgrading the stored HTML.

Some notes based on a conversation between @ssastry, @Catrope and @SBisson:

Event Timeline

ssastry triaged this task as Medium priority.Nov 8 2018, 10:16 PM
ssastry created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@ssastry To aid in our prioritization/triage: roughly when would you like this to be done?

@ssastry To aid in our prioritization/triage: roughly when would you like this to be done?

Ideally, before we begin our PHP port in earnest. Is end of Jan 2019 feasible?

Change 500784 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/Flow@master] Add maintenance script to reserialize revision content

https://gerrit.wikimedia.org/r/500784

Change 500784 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] Add maintenance script to reserialize revision content

https://gerrit.wikimedia.org/r/500784

@JTannerWMF It would be helpful for the Growth Team to prioritize this work and set up a process for running upgrades of Parsoid HTML. On the Parsoid end, we are beginning to accumulate a lot of code cruft hanging on to code to support these old versions and we are going to be making more changes in the future. So, it is time for StructuredDiscussions stored content to have a reliable process for upgrading stored content. The Parsing Team will fix any bugs in Parsoid's HTML -> WT conversion that is currently blocking this work. I normally chat with @Catrope directly but I figured it might be useful to actually carve out from space at the team level for this work.

@ssastry -- thanks for bringing this back up. Our team is going to talk about when we can fit this into our work, but I wanted to ask for more specifics around when it is needed for your work. Is there a deadline you can give us from your perspective?

Ideally end-Jan 2021, but probably no later than Q3.

@Catrope -- I'm untagging the Growth team on this but leaving you assigned, because I think this task follows you even as you've moved to a new team. Does that sound right?

marcella added a project: Growth-Team.
marcella added a subscriber: marcella.

Sorry for the miscommunication about this ticket, but for now StructuredDiscussions maintenance remains part of the Growth team responsibilities, and isn't part of the Vue team work. For that reason I'm re-assigning this to @kostajh and adding it back to the Growth team boards. If you've got any questions, please let me know, and thanks a lot for helping to sort this out.

Here are some notes and questions. Mostly questions so far 😄

In the future, once Flow's content has been upgraded, whenever Parsoid updates its HTML versions, we could consider periodically upgrading the stored HTML.

@ssastry I'm not sure if I understand the "we could consider periodically upgrading" part of this sentence. This task overall is about migrating older Parsoid HTML to the latest format (side note: is that 2.0.0 currently?). Are you saying that after this initial upgrade is done, conceivably we would need to run the upgrade script again on a periodic basis as the Parsoid HTML version changes from e.g. 2.0.0 to 2.0.1 or 2.1.0? Or is this task about a one-off migration?

SD currently does not store <head> => Parsoid HTML version number isn't known. T209114: Store <head> (including Parsoid version number) for HTML Flow content will address that for the future. But, for now, any upgrade script will have to convert HTML -> wt and reparse that wt -> HTML to upgrade stored HTML.

So, a few months after this task (T209120) was authored, T209114: Store <head> (including Parsoid version number) for HTML Flow content was completed and that should be working in production -- T209114 ensures that all new Parsoid HTML written/read by Flow is in the new format and includes the Parsoid version number.

Confirming then that the idea now is to target Flow content en masse prior to the patch for T209114 being deployed in Feb 2019, and our maintenance script should do essentially the same work that was implemented in T209114 but as a batch process on the server-side.

T148258: html2wt for links: Ignore data-parsoid and rel types more aggressively and generate the expected canonical forms is the bug that previous attempts to update HTML tripped on. T148258#3738352 indicates adding https: prefix to hrefs might more or less solve a large part of that problem.

I'm confused about what we need to do here. Is it adding the https:// prefix, or the idea mentioned in the task description of "As for as Flow is concerned, one solution that Flow could adopt would be to strip data-parsoid and rel info entirely from links and Parsoid does the right thing just by analyzing the href." Our maintenance script converts from html -> wikitext -> html, so stripping data-parsoid and rel-info would be done implicitly by that process. Or are you saying that html -> wikitext -> html is not going to satisfy the task description, and instead we need to do (old) html -> (new) html conversion with Parsoid without going through wikitext?

We will need to write a script to perform this upgrade of stored content...

Growth team wrote a script for this in 2018, that takes old Flow Parsoid HTML content, converts to wikitext, and converts back to the latest Parsoid HTML. I don't see anything about https prefixing in that patch, though, so I guess that's still something we need to do?

... (after backing up the content first).

This seems tricky. Has this type of thing been done before? What does the process look like for backing up the content? Who would we coordinate with on that? How do we restore in the event of problems, and how long of a window do we have to detect problems? If we run the script and an hour later there are lots of (we'd need to quantify that) problems, we could restore the previous content, but what about a week or several weeks later?

Test runs of this script will let us identify any other html -> wt -> html bugs that need addressing.

Where should the test runs take place, and who will be responsible for QA on those test runs?

Given that there are about ~800K posts, we will need some mechanism of verifying that the upgrade didn't break anything.

I don't have any thoughts right now on how we would do that. @ssastry do you have suggestions?


Now for the questions @marcella asked me to think about:

what needs to be done?

From what I can tell, the order of operations is roughly:

  1. clarify the answers to the questions I've written in my comments here and update the task description
  2. determine a test and QA process for running the script and verifying its results
  3. determine a backup/restore process
  4. do the migration

what are the risks?

An insufficient test/QA process could result in data loss (missing content, formatting problems, etc) and (I think) potentially make Flow boards utilizing this HTML unusable.

An insufficient backup/restore process could mean the inability to recover from those problems.

Due to general unfamiliarity with our team on the internals of Flow and its Parsoid integration and our team's focus being concentrated in other areas of work, there's a higher risk of errors.

how long will it take?

Hard to say, unfortunately.

If the maintenance script we already have works perfectly, then we still need to figure out the protocols around deployment, as well as backup/restore, which might be complicated.

If the maintenance script test runs reveal problems, then we will need unknown amounts of time to fix issues as they come up.

As an aside, this is a pretty major context shift for Growth-Team, so making time to properly focus on this and understand the problem space will be challenging.

@ssastry does your team have any capacity to help us with this? The Contributors-Team -> Growth-Team maintains Flow for historical reasons, due to personnel who were heavily involved in its development, none of whom are on our team anymore. Specifically, looking at the existing maintenance script written for this task, and testing it out in --dry-run mode would be helpful, but really any assistance your team could provide would be most welcome.

Can we rely on the normal DB backups? Especially if we keep good track of what revisions were changed, as long those exist the original value can be recovered.
Would QA involve just looking at HTML diffs produced by the dry-run of the script, or would it happen on the actual converted Flow pages?

Here are some notes and questions. Mostly questions so far 😄

In the future, once Flow's content has been upgraded, whenever Parsoid updates its HTML versions, we could consider periodically upgrading the stored HTML.

@ssastry I'm not sure if I understand the "we could consider periodically upgrading" part of this sentence. This task overall is about migrating older Parsoid HTML to the latest format (side note: is that 2.0.0 currently?). Are you saying that after this initial upgrade is done, conceivably we would need to run the upgrade script again on a periodic basis as the Parsoid HTML version changes from e.g. 2.0.0 to 2.0.1 or 2.1.0? Or is this task about a one-off migration?

Parsoid HTML version will keep updating and ideally, the Flow stored version will keep up so we don't have to have version b/c for ever. So, this task is initially about the first migration, but also about having a process for running regular migrations in the future. My guess is that (a) this first migration will yield you a bunch of scripts and process that can be reused (b) the later migrations will be much simpler if done whenever Parsoid's HTML version changes -- this will greatly reduce problems like what we have now where we have a bigger pile of version changes that have accumulated over time.

SD currently does not store <head> => Parsoid HTML version number isn't known. T209114: Store <head> (including Parsoid version number) for HTML Flow content will address that for the future. But, for now, any upgrade script will have to convert HTML -> wt and reparse that wt -> HTML to upgrade stored HTML.

So, a few months after this task (T209120) was authored, T209114: Store <head> (including Parsoid version number) for HTML Flow content was completed and that should be working in production -- T209114 ensures that all new Parsoid HTML written/read by Flow is in the new format and includes the Parsoid version number.

Confirming then that the idea now is to target Flow content en masse prior to the patch for T209114 being deployed in Feb 2019, and our maintenance script should do essentially the same work that was implemented in T209114 but as a batch process on the server-side.

You should run the update on *all* content. Having the version number makes the job easier in some cases (for your script by knowing what changed) or for Parsoid where it inspects version numbers for specific b/c handling. You implemented T209114 to assist HTML version migration, but it doesn't eliminate the need for it.

T148258: html2wt for links: Ignore data-parsoid and rel types more aggressively and generate the expected canonical forms is the bug that previous attempts to update HTML tripped on. T148258#3738352 indicates adding https: prefix to hrefs might more or less solve a large part of that problem.

I'm confused about what we need to do here. Is it adding the https:// prefix, or the idea mentioned in the task description of "As for as Flow is concerned, one solution that Flow could adopt would be to strip data-parsoid and rel info entirely from links and Parsoid does the right thing just by analyzing the href." Our maintenance script converts from html -> wikitext -> html, so stripping data-parsoid and rel-info would be done implicitly by that process. Or are you saying that html -> wikitext -> html is not going to satisfy the task description, and instead we need to do (old) html -> (new) html conversion with Parsoid without going through wikitext?

So, I am suggesting that HTML -> WT conversion results are improved and will be more reliable if you strip data-parsoid & rel info from links before you post your original HTML to Parsoid for HTML -> WT conversion.

We will need to write a script to perform this upgrade of stored content...

Growth team wrote a script for this in 2018, that takes old Flow Parsoid HTML content, converts to wikitext, and converts back to the latest Parsoid HTML. I don't see anything about https prefixing in that patch, though, so I guess that's still something we need to do?

See above. I'll go back and look at T148258 if https:// addition is still needed.

... (after backing up the content first).

This seems tricky. Has this type of thing been done before? What does the process look like for backing up the content? Who would we coordinate with on that? How do we restore in the event of problems, and how long of a window do we have to detect problems? If we run the script and an hour later there are lots of (we'd need to quantify that) problems, we could restore the previous content, but what about a week or several weeks later?

@Tgr seemed to have some ideas there. We can help you figure out how to do QA on this. When we ported Parsoid from JS to PHP, we had a HTML comparison script to let us identify problems (and which it did) and we could perhaps adapt that for your needs. You may not be able to use the script as is and it probably need adaptation. @Catrope might have some thoughts around that.

Test runs of this script will let us identify any other html -> wt -> html bugs that need addressing.

Where should the test runs take place, and who will be responsible for QA on those test runs?

Perhaps @Catrope or @Tgr have thoughts on where these test runs should take place.

Given that there are about ~800K posts, we will need some mechanism of verifying that the upgrade didn't break anything.

I don't have any thoughts right now on how we would do that. @ssastry do you have suggestions?

See above.

As an aside, this is a pretty major context shift for Growth-Team, so making time to properly focus on this and understand the problem space will be challenging.

@ssastry does your team have any capacity to help us with this? The Contributors-Team -> Growth-Team maintains Flow for historical reasons, due to personnel who were heavily involved in its development, none of whom are on our team anymore. Specifically, looking at the existing maintenance script written for this task, and testing it out in --dry-run mode would be helpful, but really any assistance your team could provide would be most welcome.

We are happy to help but this effort is best driven by you all. @marcella: we might still need to borrow some of @Catrope's time - it may just be that he needs to review the strategy or consult at appropriate points but having him involved in some fashion might make everyone's life a bit easier.

Some brief additional comments:

@ssastry wrt the "https:// addition" and the "strip data-parsoid and rel info" it seems like this is code that could conceivably be part of parsoid. That is, if we're already running the text through in essentially html2html mode, we can have a special "flow conversion" flag that does some of these hacks for improved html2html output. Since some folks outside the org have also adopted Flow, it seems like these conversion scripts should be 'supported' to some degree through one release cycle at least so that all users of Flow can upgrade their HTML -- and then after that some of the older b/c compatibility hacks can be removed, and instead start using a more release-synchronized b/c policy --- ie, each MW/Flow/Parsoid release supports html2html upgrade from the most recent release (or the most recent LTS release) but b/c past that point can be removed.

WRT to the 'backups': presumably initially the scripts will run in a 'dry run' mode, where instead of writing the results back to the database the instead feed the output to an appropriate HTML comparison script, which reports summary statistics and any problems found. Once this process yields 'clean enough' results, you'd actually do the conversion. As long as the conversion shows up as an edit, it would get a separate revision #, and could be "undone" just by reverting to the previous revision. It should show up in most cases as a 'null edit'.

Of course we'd also have to query or track those reverts, since the b/c code can't be removed until/unless all the conversion edits "stick".

Some brief additional comments:

@ssastry wrt the "https:// addition" and the "strip data-parsoid and rel info" it seems like this is code that could conceivably be part of parsoid. That is, if we're already running the text through in essentially html2html mode, we can have a special "flow conversion" flag that does some of these hacks for improved html2html output.

We'll have to build that endpoint, but doable.

Agreed that HTML transformations that facilitate legacy HTML -> wikitext conversion should be part of Parsoid, not Flow.

When we ported Parsoid from JS to PHP, we had a HTML comparison script to let us identify problems (and which it did) and we could perhaps adapt that for your needs. You may not be able to use the script as is and it probably need adaptation. @Catrope might have some thoughts around that.

The goal with the Parsoid port was no difference in the output, so an automated test could detect problems. Here, changing the output is the goal. A script could generate diffs (the migration script is already able to do that in dry-run mode), but who would determine whether those are the expected diffs, and how?

Perhaps @Catrope or @Tgr have thoughts on where these test runs should take place.

I think the risky part here is the HTML transformation, and that can be easily tested in production, we just wouldn't store the result. We'd want to make a test run or two for the storage manipulation logic, but it doesn't need a whole lot of QA as it is conceptually fairly simple.

As long as the conversion shows up as an edit, it would get a separate revision #, and could be "undone" just by reverting to the previous revision. It should show up in most cases as a 'null edit'.

I think that leaving old revisions with outdated HTML would break undo functionality (Flow seems to do undo actions via three-way merge, so presumably the three affected revisions are converted to wikitext first). So either we remove that, or we have to convert each revision in-place (which is what the current script does).

This comment was removed by ssastry.

I removed that last comment since I wrote that without thinking enough about it. Will revisit and write later.