Page MenuHomePhabricator

RFC: API for retrieval and saving of top-level HTML elements / sections by element ID
Closed, ResolvedPublic0 Estimated Story Points

Description

Section-level content retrieval and editing has the potential to speed up typical read and edit use cases, which is especially important on mobile. As discussed in T87556, a reasonable granularity for this would be top-level elements (direct children of <body>) or sections. Since the concept of sections is still a bit fluid and might end up being marked up as <section> elements, we decided to focus on the element retrieval portion first, without making assumptions on whether those elements are <section> elments or paragraphs, headings or tables.

Implementation sketch for the retrieval portion

  • mark top-level elements specially in Parsoid, in a way that's easy to match with a regexp; use that in RB to build up an offset index of element id to HTML byte range, and store it
    • alternative to regexps: build an offset map while serializing in Parsoid & return that in the pagebundle (possibly in the XMLSerializer)
  • In RESTBase: given a title, revision and section id(s), use the offset index to extract the requested sections from the stored HTML, and return them in a JSON blob with one entry per section.
    • URI layout (draft): /page/html/{title}?sections=mwAWI,mwAWM for latest revision, /page/html/{title}/123456?sections=mwAWI,mwAWM for a specific revision.

Implementation sketch for the save portion

  • Client sends a 'changeset' JSON blob with an array of element replacements:
{
  changes: [
      { 
         id: "mwXXOriginalId",
         replace: "<p id='mwXXOriginalId'>Modified content</p><p>Possibly new content</p>"
      },
      {
         id: "mwXXOtherId",
         prepend: "<p>Some content</p>"
      },
      {
         id: "mwXXOtherId",
         append: "<p>Some content</p>"
      }
  ]
}

Alternatively, we could also support replace semantics only, at the cost of needing to send possibly untouched adjacent sections back:

{
  "mwXXOriginalId": "<p id='mwXXOriginalId'>Modified content</p><p>Possibly new content</p>",
  "mwXXOtherId": "<p>Some content</p><p id='mwXXOtherId'>Possibly unchanged original content for element</p>"
}

We'll also need a special ID to support editing blank pages / replacing the entire content.

  • RB or Parsoid extracts the original HTML for each section, and converts it using the original data-parsoid to wikitext
  • operations are applied by concatenating new section's wikitext with slices of original wikitext based on original DSR offsets for adjacent sections, resulting in a full wikitext revision
  • RB returns this wikitext to the client, or directly saves the new revision on behalf of the client. The latter could be faster & more convenient. RESTBase could start a parse back to HTML in parallel with saving, reducing the time it takes to have fully parsed HTML in storage right after an edit. Minor complication is the need to check for templates parametrized on revision id.

Other remarks

  • Element-based changesets look like they could possibly be a building block of a simple collaborative editing solution. Finer-grained changes to an element could also be more compactly described in an object describing the transforms relative to the original content.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Catrope, @TrevorParscal, @Esanders, @Jdforrester-WMF: We are getting ready to start work on this. It would be great to get your input on the API design. Does the changeset representation make sense to you? Do you prefer replace semantics only?

How would this look with multiple sections inserted at a time. In terms of performance, we've already thrown away the majority of the data, so I'm not too worried about including an extra paragraph. Ultimately I think it would be better if the API was simple to implement.

@Esanders, agreed on simplicity and limited impact of including the original paragraph / section. Multiple inserted sections would just be multiple sections in the replacement content.

We discussed this on IRC. Outcomes:

  • Use simple replace semantics for changesets. Simplicity is more important than a micro-optimization for the relatively rare append case.
  • Determine a magic ID for replacing the entire body (for edits on empty content). Candidates: mw0, mw_body; Requirements:
    • Cannot be in the body, even if users supply it
    • Needs to be a valid JSON key
GWicke renamed this task from RFC: API for retrieval and saving of top-level HTML elements by element ID to RFC: API for retrieval and saving of top-level HTML elements / sections by element ID.May 6 2015, 5:49 PM

Quick update:

  • As of last week, Parsoid now emits offsets of top-level sections keyed on id in data-parsoid. This lets us use cheap substring operations to get a top-level section in RESTBase, with the main cost being data-parsoid loading (~10ms) and JSON parsing (~16ms).
  • An alternative for finer grained section / content retrieval (potentially useful for use cases like Labeled Section Transclusion (T100139)) would be to parse the HTML. Timings for Obama (1.1M):
    • SAX parse via libxmljs (node) and no-op handlers: 64ms
    • XML DOM parse via libxmljs (node): 16ms
      • XPATH match for ID (ex: dom.find('//*[@id = "mw123"]')) : 15ms
      • XPATH match for class (ex: dom.find("//*[contains(concat(' ', normalize-space(@class), ' '), ' interlanguage-link ')]"): 34ms
    • HTML5 DOM parse via Mozilla's html5ever: 32ms
      • full round-trip with serialization: 60ms
    • HTML5 DOM parse via domino (node): 220ms

Patch implementing the section retrieval portion at https://github.com/wikimedia/restbase/pull/250

@mobrovac also implemented a wikitext save end point internally. The main missing bit is html to wikitext conversion with individual modified sections:

{
  "mwXXOriginalId": "<p id='mwXXOriginalId'>Modified content</p><p>Possibly new content</p>",
  "mwXXOtherId": "<p>Some content</p><p id='mwXXOtherId'>Possibly unchanged original content for element</p>"
}

Fast (as in latency) implementation option:

  • Serialize each modified section to wikitext. For this, we need to extract the original corresponding HTML and pass it to Parsoid, along with data-parsoid for the entire page.
  • Use the wikitext offsets for the remaining sections (sorted by offset) to combine the original wikitext of the revision with the serialized sections, and return it.

Simpler implementation option:

  • Replace each modified section in the original HTML (from storage), using the HTML section offset information
  • serialize the modified full HTML, original full HTML and data-parsoid as for a regular /transform/html/to/wikitext request, and return the result

Created a PR for html->wikitext translation with sections: https://github.com/wikimedia/restbase/pull/270

The PR has been merged. Let's keep this ticket open until we actually deploy it.

The fix has been deployed. Closing.

A small utility function to simplify working with this section API in full-document edit applications like VE is available at https://gist.github.com/gwicke/f012b257e74223055065fe23a359bdc8.

GWicke reopened this task as Open.EditedAug 12 2016, 4:00 PM

Re-opening to track the recent discussion about improving full-document editing.

The VE team would like to avoid needing to serialize unmodified sections of the document. This means that an "append after" operation that does not require supplying unchanged content would be desirable.

In practice, this could look like this:

{
   mwABC: '<p>Modified section, as in current scheme</p>',
   mwBBB: {
      append: '<p>New content, inserted after mwBBB</p>'
   }
}

Each section operation can either be

  • a string replacing the pre-existing content of a section, or
  • an object describing an 'append' operation after the given section.

With this in place, the diff function posted above will not need content for unmodified sections any more, and we can take the presence of content as indication of modification, instead of the boolean modified flag.

If we're extending the schema, just implement the outline mode we proposed. It is completely trivial for the client and server to implement:

[
   { id: 'mwBBB' },
   { id: 'mwAAA' },
   { content: 'New' }
]

This can be computed in a single pass at convert time, by simply checking if a node has been flagged as modified. I have both server and client code written and tested end-to-end on my machine. Anything else is trying to fit a square peg in a round hole. We've already sunk far too much time into this.

Here is the small change to the RESTBase schema required to support it: https://github.com/wikimedia/restbase/pull/650/files

And the patch to the VE converter to skip over unmodified content: https://gerrit.wikimedia.org/r/#/c/303441/8/modules/ve-mw/dm/ve.dm.MWConverter.js

@Esanders: We are trying to keep our API surface small and orthogonal. Having two times two APIs (transform and save) for section editing is not very orthogonal, and complicates the API for all users.

To illustrate why small APIs matter, imagine a request from a user with a frequent need to bold an entire paragraph, and asking you to add a "bold paragraph" button. While it is clearly a win for the user to avoid the need to select the paragraph first, you might still decide not to add a second button, as the confusion for other users from having two buttons does not necessarily outweigh the relatively minor saving for the specific user.

So, if you want to make the case for a second API, then it would be helpful if you could document why the inconvenience of using a single API is so large as to outweigh the cost of adding a second API.

@Esanders: We are trying to keep our API surface small and orthogonal. Having two times two APIs (transform and save) for section editing is not very orthogonal, and complicates the API for all users.

You don't have the two times two problem if you don't add two different endpoints for the two different formats. You could have one endpoint that takes either the existing replace format or Ed's outline format, provided they are easy to distinguish (array vs object is already a distinguishing property, but perhaps you'd like something more elegant).

You could have one endpoint that takes either the existing replace format or Ed's outline format, provided they are easy to distinguish (array vs object is already a distinguishing property, but perhaps you'd like something more elegant).

This doesn't avoid the problem of needing to document two different APIs, with different request formats. Trying to do this in a single entry point tends to be harder than in two separate ones. For example, swagger-ui can display example requests, but this only works well for a single request format.

Getting it to work with the documentation tool of the day is a relatively minor issue. I'm sure we can work around that.

@Esanders: Paths are not what is expensive here. The expensive part is documenting two different request formats with overlapping use cases, as well as when to use each of them.

That is not expensive. What's expensive is spending a whole week trying to shoe horn our use case into an exsiting API that was designed for something else - for the sake of having marginally simpler documentation. We've already sunk far too many person hours into this. Let's just merge our code and move on - VE (and I'm sure Services) have far more important things to work on.

@Esanders: Clearly we have different perspectives and priorities. I am trying to figure out which underlying problems you are trying to solve & how we can fit them into the larger API. You have a specific solution in mind, and are reluctant to describe the underlying problems, and consider alternative solutions. As a result, this takes longer than necessary.

We talked about doing this by changing the API so that the replace value is no longer a string, but instead a list, each element of which is either a string or an { id: 'mwXXX' } object, like this:

{
    mwABC: [ 'foo', { id: 'mwDEF' }, 'bar', ... ],
    mwBBB: ...,
}

In this example, the prior text for section 'mwABC' is to be replaced by 'foo' + ( the prior text for section 'mwDEF' ) + 'bar' .

This means an append can be expressed as

{ mwXXX: [ { id: 'mwXXX' }, 'text to be appended' ] }

A prepend can be expressed as

{ mwXXX: [ 'text to be prepended', { id: 'mwXXX' } ] }

A move of mwYYY to before mwXXX can be expressed as

{ mwXXX: [ { id: 'mwYYY' }, { id: 'mwXXX' } ], mwYYY: [] }

A move of mwYYY to after mwXXX can be expressed as

{ mwXXX: [ { id: 'mwXXX' }, { id: 'mwYYY' } ], mwYYY: [] }

A deletion of several sections must be expressed as blanking each such section:

{ mwAAA: [], mwBBB: [], mwCCC: [], ... }

Thank you, @dchan. That's an excellent summary.

Next steps:

Services will:

  • Implement array syntax & reference support in section edit API.
    • Most of us are leaning towards deprecating & removing plain string support, so only arrays would be allowed for simplicity & generality. The API is marked as unstable, but we are not aware of any actual section API users, so this should not affect anybody. Input on whether to remove plain strings would be appreciated.
  • Update [the sectionDiff utility function](https://gist.github.com/gwicke/f012b257e74223055065fe23a359bdc8) to emit this format, given
    • an array of original section ids, and
    • an array of new section ids / strings: [{id: 'a'}, 'foo', {id: 'c'}]

Once this is ready, the function can be hooked up in VisualEditor, and saves will only transmit changed content over the wire.

  • an array of new section ids / strings: [{id: 'a'}, 'foo', {id: 'c'}]

And to clarify, the IDs can be reordered, so [{id: 'b'}, {id: 'a'}] is valid.

GWicke raised the priority of this task from Medium to High.Aug 19 2016, 3:54 PM
GWicke added a project: Services-next.

To clarify - what happens if one of the sections from the old document is not present as a key in the section replacement object? Is it preserved or removed?

@Pchelolo, it is removed. The semantics are "replace original section xyz with this new content".

T143356 discusses how section editing could cleanly support separate data-mw retrieval / editing. Please have a look, especially at the last comments.

Concretely, with data-mw support section edit requests would look like this:

{
  mwABC: {
    html: ['Foo', { id: 'mwABC' }],
    "data-mw": { /* data-mw for new content, or empty object if all inline */ }
  }
}

@Esanders, @dchan: Any objections from your side against moving straight to the { html: [ ... ] } format, so that we can support separate data-mw as well?

I had a brief in-person chat about moving straight to { html: [ ... ] } with @Esanders, and he saw no issue with doing so from the VE perspective. Now waiting for the Parsoid team feedback on T143356 before we can tackle the public section API change.

BTW, as we're changing the API in the backwards incompatible way, let's also tackle T116336 and accept application/json instead of weird formData and json combinations

I'm working on the client-side code at https://github.com/gwicke/parsoid-dom-utils/blob/master/lib/section_diff.js.

We also decided to vary the section replacement syntax slightly, using an array again: T143356#2580945

Let us know if you have concerns about this change.

Per IRL discussion - we either need VirtualRestService support for the JSON API, or we need the new API to support form data encoding.

Per IRL discussion - we either need VirtualRestService support for the JSON API, or we need the new API to support form data encoding.

PR for supporting both: https://github.com/wikimedia/hyperswitch/pull/59

Personally I don't think it makes sense to have just one of the APIs be JSON when the rest are form encoded, just because the main parameter is a JSON string. It means that identical parameters, e.g. scrub_wikitext, are passed in different ways between two endpoints, which I think will be confusing for future clients.

AFAICT the issue is with documentation/validation, but we shouldn't break our API because of the limitations of our current tools.

Change 308014 had a related patch set uploaded (by Ppchelko):
MultiHTTPClient: Support sending JSON POST requests.

https://gerrit.wikimedia.org/r/308014

Hm, my thinking is really moving back and forth on this.. On one hand it's weird to have JSON embedded inside form-encoded param. But on the other hand, when we introduce separate data-mw field to the other endpoints - we will eventually have the same problem anyway since data-mw is JSON and we're not going to change other endpoint accept types to JSON since they're used. @GWicke what do you think?

@Pchelolo, I commented on the more general question of JSON / form data interaction at T111748#2603225.

@GWicke So, having read that I would say that the best/fasted way to unblock VE Team is to merge my PR that would allow supplying both JSON and form-data and continue the conversation

@Pchelolo, if we go with a form-primary model, then I would propose to change the swagger spec to require a changes object containing the section diffs directly (without another changes entry inside the JSON). We wouldn't have a body parameter any more, so the hyperswitch coercion patch wouldn't be needed in its current form. JSON submission would still be supported, using the request format as documented right now.

Yesterday I've been trying a lot of options and I think that status quo plus PR https://github.com/wikimedia/hyperswitch/pull/59 is the best we can do right now without investing a lot of time in sagger-ui improvement.

With this we get validation support, more-or-less nice doc UI, support for both form-data and JSON.

When we introduce data-mw parameter to other transform endpoints we can seamlessly switch them to JSON without breaking existing clients because form-data will also be supported.

Yesterday I've been trying a lot of options and I think that status quo plus PR https://github.com/wikimedia/hyperswitch/pull/59 is the best we can do right now without investing a lot of time in sagger-ui improvement.

With this we get validation support, more-or-less nice doc UI, support for both form-data and JSON.

When we introduce data-mw parameter to other transform endpoints we can seamlessly switch them to JSON without breaking existing clients because form-data will also be supported.

This has now been merged, and should go out next week.

The change was deployed, so now the section-changes transform endpoint consumes both application/json and form-data. Resolving.

Per discussion IRL, each section needs to support specifying id and html, mostly for data-mw preservation.

@Esanders The PR fixing that is merged: https://github.com/wikimedia/restbase/pull/664 so the current RB master should work for your local testing. We will deploy it later this week as we accumulate more changes (or I can deploy it ASAP if you need)

PR deployed, so it should work right now, resolving again..

Change 308014 abandoned by Ppchelko:
MultiHTTPClient: Support sending JSON POST requests.

Reason:
Not needed any more

https://gerrit.wikimedia.org/r/308014