We are getting ready to integrate Parsoid with RESTBase. This task tracks our progress and what's left to do.
Generally, we need a way to communicate multiple mime entities both in POST requests to Parsoid, and in responses from Parsoid back to RESTBase. We could use [RFC 2387](https://tools.ietf.org/html/rfc2387) or [RFC 2388](https://www.ietf.org/rfc/rfc2388.txt), but this seems to be a bit cumbersome especially for the response part. Instead, we can just use a simple JSON encoding with a structure that's otherwise basically identical to MIME.
The downside of using JSON is possibly efficiency (quite a bit more escaping going on, especially for HTML-in-JSON). However, we could address that later by optionally using a binary encoding like [msgpack](http://msgpack.org/) without changing the message structure itself.
Another option to consider would be to flatten the structure by using names like `original.html`. This would make it easy to encode the same message as `multipart/related`.
== On-demand generation of HTML and data-parsoid ==
Request to RESTBase: `GET /{domain}/v1/page/{name}/html/{revision}`
Restbase checks whether the HTML for that revision is found in storage. If not, then it asks Parsoid to generate it.
For a normal GET request, RESTBase checks whether the predecessor revision is found in storage (currently the predecessor revision is passed in an x-parsoid header, but we could also try the pagecontent revision table). If it is, then it retrieves the data for that and posts:
```
POST /v2/{domain}/html/{name}/{revision}
{
previous: {
revid: 12345, // The previous revision ID
html: {
headers: {
'content-type': 'text/html;profile=mediawiki.org/specs/html/1.0.0'
},
body: "the original HTML"
}
'data-parsoid': {
headers: {
'content-type': 'application/json;profile=mediawiki.org/specs/data-parsoid/0.0.1'
},
body: {}
}
}
}
```
For a `no-cache` request, RESTBase instead first checks whether the *current* revision is found in storage. If it is, it sends the data for that in the `original` key:
```
POST /v2/{domain}/html/{name}/{revision}
{
original: {
revid: 12345, // The original revision ID
html: {
headers: {
'content-type': 'text/html;profile=mediawiki.org/specs/html/1.0.0'
},
body: "the original HTML"
}
'data-parsoid': {
headers: {
'content-type': 'application/json;profile=mediawiki.org/specs/data-parsoid/0.0.1'
},
body: {}
}
}
}
```
This entry point returns both html and data-parsoid in one JSON blob, which restbase stores in html and data-parsoid buckets, and also returns to the client.
Example response from Parsoid:
```
{
revid: 12346, // The new revision ID (maybe?)
html: {
headers: {
'content-type': 'text/html;profile=mediawiki.org/specs/html/1.0.0'
},
body: "the new HTML"
}
'data-parsoid': {
headers: {
'content-type': 'application/json;profile=mediawiki.org/specs/data-parsoid/0.0.1'
},
body: {}
}
}
```
=== Status ===
- Parsoid: Slightly simpler format done & deployed. Needs to be updated to indicate the mime types as documented above.
- RESTBase: Prototype implementation.
- should pass to Parsoid:
- previous revision's HTML + data-parsoid OnEdit
- current revision's HTML + data-parsoid OnDependencyChange
== html2wt conversion support ==
Request to RESTBase: `POST /{domain}/v1/transform/html/to/wt`. For an existing page, title and revision are passed: `POST /{domain}/v1/transform/html/to/wt/{title}/{revision}`
RESTBase retrieves the original html, wikitext & data-parsoid from storage & POSTs it to Parsoid:
```
POST parsoid/v2/{domain}/wt/{title}/{oldid}
{
"html": {
"headers": {
"content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
},
"body": "<html>The modified HTML</html>"
},
"original": {
"revid": 12345,
"wikitext": {
"headers": {
"content-type": "text/plain;profile=mediawiki.org/specs/wikitext/1.0.0"
},
"body": "the original wikitext"
},
"html": {
"headers": {
"content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
},
"body": "the original HTML"
},
"data-parsoid": {
"headers": {
"content-type": "application/json;profile=mediawiki.org/specs/data-parsoid/0.0.1"
},
"body": {
"ids": {}
}
}
}
}
```
If the modified `html` is just a string, then Parsoid is expected to assume that it's in the latest html version.
Parsoid returns the wikitext, which RESTBase returns to the client without any caching:
```
{
wikitext: {
headers: {
'content-type': 'text/plain;profile=mediawiki.org/specs/wikitext/1.0.0'
},
body: "the modified wikitext"
}
}
```
We could also just return the wikitext with the appropriate content-type header, but there is something to be said for consistency & the ability to return additional metadata (like warnings or errors) in the future.
=== Status ===
- Parsoid WIP implementation: https://gerrit.wikimedia.org/r/#/c/165685/
- RESTBase to be done
=== Remaining tasks ===
* [ ] augment the POST /{domain}/v1/transform/<html|wt>/to/html endpoint to:
* [ ] allow an optional bodyOnly field **PR in [[ https://github.com/wikimedia/restbase/pull/153 | restbase#153 ]]**
* [x] accept an application/json body
* [ ] accept a multipart/form-data body **PR in [[ https://github.com/wikimedia/restbase/pull/154 | restbase#154 ]]**
* [ ] add html-to-wikitext support
* [ ] store wt
* see [[ https://github.com/wikimedia/parsoid/blob/master/lib/mediawiki.ApiRequest.js#L224 | how Parsoid gets wt from the MW API ]]