Page MenuHomePhabricator

Investigate technical needs for native mobile diffs, including supporting APIs
Closed, ResolvedPublicSpike

Description

[to be updated]

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJul 23 2019, 6:48 PM
JMinor renamed this task from Compare revision spike to Investigate technical needs for native mobile diffs, including supporting APIs.Jul 23 2019, 9:24 PM
JMinor triaged this task as Medium priority.

Some API notes so far:

The MediaWiki API we have to work with is https://www.mediawiki.org/wiki/API:Compare. This returns changes formatted inside an html blob with table rows and cell tags. This formatting is added in MediaWiki's TableDiffFormatter class. It looks like mobile web overrides this in it's extension with InlineDiffFormatter & InlineDifferenceEngine. I can't find an API endpoint support for inline however.

One other thing I've found is MediaWiki's ArrayDiffFormatter. This seems like it could spit out diff changes in a more structured format for us, but I don't think it's accessible from an API endpoint.

We could make an MCS endpoint that strips table formatting from the compare API response, but I think better would be to see if ArrayDiffFormatter can give us what we need and expose that. Also note there's a UnifiedDiffFormatter too.

First pass at a structured contract. This is based on the changes here.

Single-column mobile view:

also mocked up a double-column option in case we decide to take it that far:

I chatted with @JoeWalsh and we discussed a few things related to this:

  • There is no structured diff data API because of performance reasons. The diffing modules used in production are C++ code that generates HTML directly. Fun times.
  • I'm not sure if Special:MobileDiff has an API, I would assume it doesn't (but ping @phuedx). In any case, we can get the HTML using the fun apioutput skin: https://en.wikipedia.org/wiki/Special:MobileDiff/910663282?useskin=apioutput
    • For any 2 random range of revisions on inline diffs? To be investigated.
  • There is Visual_diffs that VE did. It is also available as a beta feature on desktop that you can enable on preferences. The way it works is by fetching two parsoid HTML revisions and computing the inline diff on the client itself. I'm assuming with JS. It may be worth going asking the Editing-team if the visual diffs feature is self contained enough that it could be API-ified or bundled in the App as a JS library to run on a webview (or maybe as an API like page/mobile-html). Ping @marcella

@Jhernandez thanks for your notes! I'm attaching my finalized contract anyways just for documentation purposes but understood that it won't work with the C++ code.

I looked at the Special:MobileDiff apioutput skin and I think I can glean most of what I need from it. The two things that I don't have are nice-to-have's in this order of importance:

  • A section title (or ID) for each change. This is to display to the user what section the change occurred in, in case the two context lines on either side do not show the section.
  • Extra context chunks to display to the user in a collapsible/expandable format. I figure this is the bigger ask but if it seems like something that's possible let me know and I can get more details on how many lines would be needed. The expectation is not to let the user view the whole article but just provide some extra context lines beyond the default 2.

I'm not sure if VE's visual diffs will work - the ask is to show a wikitext/markup diff rather than a diff of the rendered result.

It does seem like I'm able to get a range of revisions on inline diffs - I'm using this as an example (there's 1 revision in between): https://test.m.wikipedia.org/wiki/Special:MobileDiff/375751...399929?useskin=apioutput

  • There is no structured diff data API because of performance reasons. The diffing modules used in production are C++ code that generates HTML directly. Fun times.

Perhaps we could request that wikidiff2 be modified to output a data structure representing the diff and/or the HTML representing the diff, i.e. extract the HTML outputting into a Formatter class.

Unfortunately, I can find no such API.

Unfortunately, I can find no such API.

Fortunately, I was wrong!

I believe that the compare API will return a diff for a single revision or a revision range, e.g. using the revisions used in comments above):

Note well that the API outputs HTML.

Perhaps we could request that wikidiff2 be modified to output a data structure representing the diff and/or the HTML representing the diff

Data structure representing the diff would be preferable to HTML so we aren't parsing through HTML (or both but, seems like HTML is already covered in the compare API).

After chatting with @pheudx sounds like there's no guarantee that Special:MobileDiff will remain in place and unchanged so we shouldn't depend on it. Assuming we can't get a structured diff from anywhere, is it possible to get these nice-to-have's out of the compare API? Also tagging @Mholloway @Eevans @WDoranWMF in case they have thoughts.

  • A section title/ID for each change. This is to display to the user what section the change occurred in, in case the two context lines on either side do not show the section.
  • Extra context chunks to display to the user in a collapsible/expandable format. I figure this is the bigger ask but if it seems like something that's possible let me know and I can get more details on how many lines would be needed. The expectation is not to let the user view the whole article but just provide some extra context lines beyond the default 2.

@Jdlrobson also tagging Jon - it was pointed out to me that the work in https://phabricator.wikimedia.org/T117279 could help us here. We are working on supporting inline diffs in iOS - it looks like this would save us a step with pulling the separated diff from API:Compare and merging them to be inline. Is there any ETA on when https://phabricator.wikimedia.org/T117279 might make it into prod?

@Tsevener no plan right now. I'd love to work on this, but it's not on our product roadmap so far. Without that, this is reliant on me finding a partner in crime to help me during 10% time (with code review and pushing towards this goal) this might happen sooner. I'd also happily help out if there was a cross team initiative.

“if the visual diffs feature is self contained enough that it could be API-ified or bundled in the App as a JS library to run on a webview (or maybe as an API like page/mobile-html)”

The answer to that is the thing I’m supposed to be checking into re: visual diff performance on T229467. If it’s in a place where it can be put in a web worker then the answer is fairly-inherently “yes”.

My vote would be to use Visual Diffs if it can easily be made to work for you, and otherwise to look seriously at updating wikidiff2 as @phuedx suggested to break HTML styling out into a Formatter class. Regarding the latter, I would much rather see us invest our effort in modernizing the core platform than paper over its shortcomings with another Node.js service.

T117279 looks interesting as well and seems to have a lot of support. Without a deeper dive I'm not sure how that would be helped or hindered by updating wikidiff2. In general I'd prefer to see issues fixed as far upstream as possible.

Come to think of it, I vaguely remember @jlinehan poking around in wikidiff2 for some reason a little while back, so let's ping him for comment, too. :)

Just as a reminder, Toni mentioned:

I'm not sure if VE's visual diffs will work - the ask is to show a wikitext/markup diff rather than a diff of the rendered result.

So it seems like visual diffs are out for the moment.

The lack of a wikitext structured data diffs API is definitely a core platform deficiency. I'm uncertain as to how to proceed given none of the options are particularly great, and going in to a "make diffs better project" is a big project with lots of ramifications. I'm pinging @WDoranWMF just as a FYI, this is a need from the product teams and a gap on the core platform's APIs, maybe you can surface it to the PMs in your team.

@Mholloway @Jhernandez I have been exploring adding the wikidiff2 c++ engine in our Xcode project and it seems to be working well so far. It's not the best long-term solution but it does allow me to change the formatting without cross-platform ramifications, and I have access to the inline diff. I'm no C++ developer but if we decide to continue down this path maybe I could submit a PR to wikidiff2 of my tweaks and someone could take it from there? It's essentially a (hopefully temporary) fork local on the device so, another not-particularly-great solution.

@Tsevener are you tweaking it to output different HTML or to output structured data?

We did a spike some time ago, and found it non-trivial to make those diff engines return structured data.

@Jhernandez well, not really structured. I just replaced the html with json formatting but it's all still being passed around through the same execute method that returns a String. App-side I'm able to instantiate an object from the string returned through a json serializer. Here's example from my InlineDiffJSON class vs InlineDiff.

InlineDiff:

void InlineDiff::printContext(const String & input)
{
	printWrappedLine("<div class=\"mw-diff-inline-context\">", input, "</div>\n");
}

InlineDiffJSON:

void InlineDiffJSON::printContext(const String & input)
{
    if (!noResults)
        result += ",";
    printWrappedLine("{\"type\": \"context\", \"text\": ", "\"" + escape_json(input) + "\"", ", \"highlight-ranges\": []}");
    noResults = false;
}

If the input contained a \" it corrupted the json hence the need for an escape_json method. Anyways it probably won't fly for an API, I was going for the bare minimum changes in the gateway classes to the diff engine.

We can have a REST API which delivers a JSON diff produced by @Tsevener's patch running on the server side, if that's what you need. For performance reasons, structured data isn't available in PHP, but as @Tsevener has evidently noticed, structured data is available in C++.

Alternatively, you could continue to run wikidiff2 on the client side, but it's potentially very slow.

@Tsevener wrote:

A section title/ID for each change. This is to display to the user what section the change occurred in, in case the two context lines on either side do not show the section.

Traditionally we've provided line numbers, which are not shown anywhere else and are obviously not ideal for the user. If we develop a feature to map line numbers to section names in PHP, and a wikidiff2 feature to accept such a mapping and use it for output formatting, then we should probably aim to use that feature for all output targets, not just the iOS app, to make best use of that investment. It's not like desktop users understand line numbers any better than iOS users.

Alternatively, you could try to interpret the line numbers on the client side, after unserializing the JSON.

Extra context chunks to display to the user in a collapsible/expandable format.

Assuming you have string offsets or line numbers available on the client side, as well as the revision text, implementing this should be pretty straightforward.

Would you want a REST API that bundles the complete wikitext of the two revisions as well as the JSON diff?

@tstarling yeah, for our mocks the revision compare screen shows line numbers and the single revision changes screen just shows section titles. I figured it couldn't hurt to have both in an API response but if section title is a performance hit maybe we can pass in a parameter for whether or not you return the section titles?

After more talking with the team it sounds like 2 context lines around a block of changes is all that's needed which seems to be how the engine is already set up. So returning the complete revision wikitexts won't be necessary, all we would need is the json diff.

@Eevans @WDoranWMF Here's a new iteration of what a structured inline response could look like. We're trying to support moved paragraphs and phrases in this one, which complicates things a bit so let me know if anything is unclear. Inspiration for moved phrases came from here.

Note I have "type" key values as strings for diff line objects and highlight objects but they'd come from a strict list of types so would probably be better as an Int / enum type. But keeping them strings here just for clarity's sake.

If it helps here's the order of line diff objects that I tried to capture in this example:

{context section title line}
{context empty line}
{delete empty line}
{context populated line}
{context empty line}
{change line with both add, delete, and move phrase ranges within}
{context empty line}
{add populated line}
{move source line}
{context populated line}
{move destination line }
{context empty line}
{context populated line}

Whoops, sorry @Eevans for mis-tagging you.

@Tsevener It should be possible for us to integrate your updated wikidiff2 server side so we can serve via the new API. What are the next steps from your side/what's a likely timeline for it to be available?

@WDoranWMF there were still a few things left for me to tackle based on the latest contract (sectionTitle, moving paragraphs, restoring 2 context lines, maybe moving phrases if it's simple). I expect I could have a PR ready later next week with it. Would my approach mentioned here work? That is just replacing the html string output with an escaped JSON string output. Can't say I have the ability or time in the near future to rework the engine to output a structured object. So if that's necessary my work might not be salvageable.

@Tsevener @tstarling is happy to review the approach during our current sprint(this week and next) if you can ping us when the PR is available and we can go from. Does that work for you?

@WDoranWMF @eprodromou PR has been made. I am tracking it with this task - https://phabricator.wikimedia.org/T232231. I wasn't sure how to test this outside of Xcode so some of the changes (hhvm, php-named classes) were made blindly, and I see that tests are failing already. But hopefully this is an okay start. I am not handling moving phrases yet but it seemed doable after looking at the code again. I might be able to do a followup with that depending on bandwidth next week.

@Tsevener sorry for delay in replying. I've asked @tstarling to give input and added him as a reviewer.