RESTBase / Parsoid integration - waiting for parsoid deploy
Closed, ResolvedPublic0 Estimated Story Points
Actions

Assigned To

Authored By

	• GWicke
	Nov 26 2014, 1:17 AM

Description

We are getting ready to integrate Parsoid with RESTBase. This task tracks our progress and what's left to do.

Generally, we need a way to communicate multiple mime entities both in POST requests to Parsoid, and in responses from Parsoid back to RESTBase. We could use RFC 2387 or RFC 2388, but this seems to be a bit cumbersome especially for the response part. Instead, we can just use a simple JSON encoding with a structure that's otherwise basically identical to MIME.

The downside of using JSON is possibly efficiency (quite a bit more escaping going on, especially for HTML-in-JSON). However, we could address that later by optionally using a binary encoding like msgpack without changing the message structure itself.

Another option to consider would be to flatten the structure by using names like original.html. This would make it easy to encode the same message as multipart/related.

Implementation notes

Parsoid: Slightly simpler format done & deployed. Needs to be updated to indicate the mime types as documented above.
RESTBase: Prototype implementation.
- should pass to Parsoid:
  - previous revision's HTML + data-parsoid OnEdit
  - current revision's HTML + data-parsoid OnDependencyChange
Parsoid WIP implementation: https://gerrit.wikimedia.org/r/#/c/165685/
RESTBase to be done

Details

	Subject	Repo	Branch	Lines +/-
	API v2 parsing and serialization routes	mediawiki/services/parsoid	master	+602 -89

Customize query in gerrit

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Resolved	• GWicke	T1228 RESTbase deployment
Resolved	• GWicke	T88319 Replicate Parsoid v1 API functionality in RESTBase
Resolved	• GWicke	T75955 RESTBase / Parsoid integration - waiting for parsoid deploy
Resolved	• Jdouglas	T88456 On-demand generation of HTML and data-parsoid
Resolved	• GWicke	T88457 html2wt conversion support
Resolved	Arlolra	T89411 Parsoid v2 API selser might not be working correctly
		· · ·

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• GWicke updated the task description. (Show Details)Dec 19 2014, 12:53 AM

• GWicke updated the task description. (Show Details)

@GWicke: In the example responses from Parsoid, you'll showing the pagebundle but the route examples say /html/.

Also, in the cache scenario (wt2html), we try to reuse the expansions from the previous revision, as in,
https://gerrit.wikimedia.org/r/#/c/165685/10/api/routes.js

but, for no-cache,

For a no-cache request, RESTBase instead first checks whether the *current* revision is found in storage. If it is, it sends the data for that in the original key:

what, if anything, does Parsoid do with the original data.

what, if anything, does Parsoid do with the original data.

if the original data contains the wikitext source, then parsoid won't have to fetch it. I'll assume you meant that

In T75955#937200, @Arlolra wrote:

@GWicke: In the example responses from Parsoid, you'll showing the pagebundle but the route examples say /html/.

Do you mean html2wt or wt2html in the issue summary?

For a no-cache request, RESTBase instead first checks whether the *current* revision is found in storage. If it is, it sends the data for that in the original key:

what, if anything, does Parsoid do with the original data.

It reuses template expansions or images, depending which of the two triggered the update. The reason for the update is relayed in an x-parsoid header currently.

Do you mean html2wt or wt2html in the issue summary?

wt2html

It reuses template expansions or images, depending which of the two triggered the update. The reason for the update is relayed in an x-parsoid header currently.

I was thinking it only did that with the previous revision data but I guess it's both then
https://gerrit.wikimedia.org/r/#/c/165685/12/api/routes.js

can you review https://gerrit.wikimedia.org/r/#/c/165685/ when you get a chance?

@Arlolra: See my review there.

As you noted in your response, we also need a way to request more targeted updates. This includes the current wt2html updates (re-render templates or re-render images), but we should use a mechanism that can also cover more refined requests in the future:

re-sanitization of HTML / upgrade of HTML spec
update of transclusions affected by certain templates only
update of specific images only

A key in the request seems to make most sense. Perhaps something like this?

{
   update: {
      templates: true,
      images: false,
      spec: true
   },
   original: { ... }
}

Not all of these need to be set all the time, only the actually requested updates could be set to a truish value. In the future, we could add an object value with finer-grained options instead to support targeted template / image updates.

I updated the patch with that.

• Jdouglas claimed this task.Dec 29 2014, 8:34 PM

Is it feasible to run this Parsoid patch in my local dev environment so that I can develop against it?

EDIT: I found instructions here.

@Jdouglas: Yes, that should work. See https://www.mediawiki.org/wiki/Parsoid/Setup#Manual_installation_on_Linux_or_Mac_OS_X

Basic summary:

# clone parsoid
git clone ssh://<user>@gerrit.wikimedia.org:29418/mediawiki/services/parsoid

cd parsoid

# check out the patch with the git commandline shown in gerrit:
git fetch ssh://<user>@gerrit.wikimedia.org:29418/mediawiki/services/parsoid refs/changes/85/165685/13 && git checkout FETCH_HEAD
npm install

node api/server

This should bring up parsoid locally. The wikipedia domains are all configured by default.

I converted the POST parsoid/v2/{domain}/wt/{title}/{oldid} example above to a JSON file, and attempted to run it locally, but something went wrong:

$ curl -H "Content-Type: application/json" -X POST -d @T75955.json localhost:8000/v2/en.wikipedia.org/wt/Main_Page/1
Cannot call method on undefined
TypeError: Cannot call method on undefined
    at Object.ES.CheckObjectCoercible (/home/jdouglas/code/parsoid/node_modules/es6-shim/es6-shim.js:152:17)
    at Object.ES.ToObject (/home/jdouglas/code/parsoid/node_modules/es6-shim/es6-shim.js:164:26)
    at Function.Object.keys (/home/jdouglas/code/parsoid/node_modules/es6-shim/es6-shim.js:905:38)
    at Object.DU.applyDataParsoid (/home/jdouglas/code/parsoid/lib/mediawiki.DOMUtils.js:1958:10)
    at Promise.then.timeout.then.apiUtils.jsonResponse.wikitext.headers.content-type (/home/jdouglas/code/parsoid/api/routes.js:275:7)
    at /home/jdouglas/code/parsoid/node_modules/es6-shim/es6-shim.js:1220:20
    at Object._onImmediate (/home/jdouglas/code/parsoid/node_modules/es6-shim/es6-shim.js:1177:28)
    at processImmediate [as _immediateCallback] (timers.js:345:15)

Any debugging pointers?

T75955.json827 BDownload

EDIT: It works if I s/wt/html/g in the URL.

This appears to work for pagebundle in place of wt.

curl -s -H "Content-Type: application/json" -X POST -d @T75955.json localhost:8000/v2/en.wikipedia.org/pagebundle/Main_Page/1

{
    "data-parsoid": {
        "body": {
            "counter": 1,
            "ids": {
                "mwAA": {
                    "dsr": [
                        0,
                        949,
                        0,
                        0
                    ]
                },
                "mwAQ": {
                    "dsr": [
                        928,
                        949,
                        0,
                        0
                    ]
                }
            }
        },
        "headers": {
            "content-type": "application/json;profile=mediawiki.org/specs/data-parsoid/0.0.1"
        }
    },
    "html": {
        "body": "<!DOCTYPE html>\n<html prefix=\"dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/\" about=\"http://en.wikipedia.org/wiki/Special:Redirect/revision/12345\"><head prefix=\"mwr: http://en.wikipedia.org/wiki/Special:Redirect/\"><meta property=\"mw:articleNamespace\" content=\"0\"/><link rel=\"dc:replaces\" resource=\"mwr:revision/12342\"/><meta property=\"dc:modified\" content=\"2002-02-16T19:00:49.000Z\"/><meta about=\"mwr:user/0\" property=\"dc:title\" content=\"210.55.230.17\"/><link rel=\"dc:contributor\" resource=\"mwr:user/0\"/><meta property=\"mw:revisionSHA1\" content=\"1ec7fece702a00b9d99c0e848eea8c524d12fe3f\"/><meta property=\"dc:description\" content=\"*\"/><meta property=\"mw:parsoidVersion\" content=\"0\"/><link rel=\"dc:isVersionOf\" href=\"//en.wikipedia.org/wiki/Main_Page\"/><title>Main_Page</title><base href=\"//en.wikipedia.org/wiki/\"/><link rel=\"stylesheet\" href=\"//en.wikipedia.org/w/load.php?modules=mediawiki.legacy.commonPrint,shared|mediawiki.skinning.elements|mediawiki.skinning.content|mediawiki.skinning.interface|skins.vector.styles|site|mediawiki.skinning.content.parsoid&amp;only=styles&amp;debug=true&amp;skin=vector\"/></head><body id=\"mwAA\" lang=\"en\" class=\"mw-content-ltr sitedir-ltr ltr mw-body mw-body-content mediawiki\" dir=\"ltr\"><p id=\"mwAQ\">the original wikitext</p></body></html>",
        "headers": {
            "content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
        }
    }
}

This works for wt when I replace `"body": {}" with the following:

"body": {
    "counter": 1,
    "ids": {
        "mwAA": {
            "dsr": [
                0,
                949,
                0,
                0
            ]
        },
        "mwAQ": {
            "dsr": [
                928,
                949,
                0,
                0
            ]
        }
    }
}

EDIT: The key is having body.ids -- at least "body": { "ids": {} }

Here's the source of the problem: https://github.com/wikimedia/parsoid/blob/04721c235f9cc6d4ffdca8418e73d602aecfebba/lib/mediawiki.DOMUtils.js#L1963

Would it be better to allow for an id-less body?

e.g.

Object.keys( dp.ids || {} ).forEach(function ( key ) {

• Jdouglas updated the task description. (Show Details)Dec 29 2014, 10:28 PM

@Jdouglas: Passing in data-parsoid without ids in the body could be considered a bug in the client worth highlighting. If that passed unnoticed, the result would likely be wikitext corruption in edits, which would make many people unhappy. If we send the 'original' object, then it should contain both html *and* a valid data-parsoid attribute, both from storage.

Parsoid also needs to support html2wt conversion without the 'original' object (for new page creation for example). That might be a simpler mode to test for the first iteration.

Gotcha. Does Parsoid have any infrastructure for validation? A 400 with a reason might be more informative to an erroneous client than a 500 with an obscure es6-shim stack trace.

• GWicke mentioned this in T76883: Flatten internal URL structure and move away from a layered overloading model.Dec 29 2014, 11:24 PM

@Jdouglas I think validation is fairly embryonic at this point. Definitely agreed on 400 being better than 500.

• GWicke updated the task description. (Show Details)Dec 29 2014, 11:34 PM

• GWicke updated the task description. (Show Details)

Understandable, considering we're operating without a type system. :]

@Jdouglas: To be fair, JSON schema validation is not usually happening for free just by having stronger typing.

Change 165685 merged by Jdouglas:
API v2 parsing and serialization routes

https://gerrit.wikimedia.org/r/165685

See proposed tests here: https://github.com/wikimedia/restbase/pull/98

@GWicke:

If the modified html is just a string, then Parsoid is expected to assume that it's in the latest html version.

Does this refer to a POST to parsoid/v2/{domain}/wt/{title}/{oldid} with a request body containing raw HTML?

@Jdouglas: I think we can always wrap things into a full request on the restbase end, and just add the latest configured parsoid mime type if we didn't receive any from the client. So we can perhaps just strike this as a parsoid requirement.

When there's no title or revision, i.e. POST /{domain}/v1/transform/html/to/wt, what should be the Parsoid request URL?

POST parsoid/v2/{domain}/wt/{title}/{oldid}

Parsoid appears to allow gibberish:

POST parsoid/v2/{domain}/wt/___NOT_A_REAL_TITLE_1231___/

Should we designate a special placeholder title? Or should we update Parsoid to respond to POST parsoid/v2/{domain}/wt?

@Jdouglas, basically title and oldid are optional, so both POST parsoid/v2/{domain}/wt/{title} and POST parsoid/v2/{domain}/wt should work. If the request without title doesn't work in parsoid right now, then we could work around it by inventing one like 'Main_Page' for now. IIRC that's what Parsoid used to do as well if the title is missing.

Edit: Parsoid uses the configured mainpage for the wiki, which happens to be 'Main Page' for enwiki, but differs for other languages.

Oh, my mistake. The new patch already supports POST parsoid/v2/{domain}/wt.

The transform endpoint is now functioning: https://github.com/wikimedia/restbase/pull/98

After #98, is there anything else needed for T75955?

Answer: yes: https://coveralls.io/files/412841510#L93

fgiunchedi subscribed.Jan 12 2015, 5:02 PM

I think we're pretty close to marking this one as done. Anyone want to do a final check of the tests to verify they cover the scope of this task?

On-demand generation of HTML and data-parsoid:

https://github.com/wikimedia/restbase/blob/3412329b102e69bb25482d9f21a74151bd00abdd/test/features/parsoid/ondemand.js

html2wt conversion support:

https://github.com/wikimedia/restbase/blob/3412329b102e69bb25482d9f21a74151bd00abdd/test/features/parsoid/transform.js

Strictly speaking https://github.com/wikimedia/restbase/blob/3412329b102e69bb25482d9f21a74151bd00abdd/test/features/parsoid/ondemand.js#L87 does not really test whether the original content was indeed passed to parsoid. Maybe we could let parsoid emit a header if it received the original properly. We could then assert that header in the response.

test whether the original content was indeed passed to parsoid

@GWicke For specification purposes, can you define "original" in this context?

Does it always mean "penultimate" with respect to the state of the wiki behind {domain}?

Anyone want to do a final check of the tests to verify they cover the scope of this task?

We're not quite there yet. The Parsoid API accepts a body flag that reduces the scope of the response.

@GWicke do you know who is using this feature? Is it something RESTBase should provide via its own API? If so, how will it effect storage? We might not want to store the body.innerHTML, but instead store the entire content. The drawback to this is that we'll have to replicate some of the same DOM parsing that Parsoid currently handles to be able to subset the content on demand.

In T75955#977393, @Jdouglas wrote:

Anyone want to do a final check of the tests to verify they cover the scope of this task?

We're not quite there yet. The Parsoid API accepts a body flag that reduces the scope of the response.

@GWicke do you know who is using this feature?

AFAIK this is currently used by Flow, likely also VE.

Is it something RESTBase should provide via its own API?

Yes. We talked about supporting a plain POST body with the html/wt in 'content' and optional other fields like a 'bodyOnly' boolean. I would vote for using this option in the transform API, as it is easy enough and more efficient for binary content when using multipart/form-data encoding. We can always add JSON as an option later.

Does the following look correct?

User -> RESTBase:

{
  method: 'post',
  uri: '/{domain}/v1/transform/<html|wt>/to/html',
  headers: {
    'content-type': 'application/json',
  },
  body: {
    content: 'Hello, world!',
    bodyOnly: true,
  },
}

RESTBase -> Parsoid:

{
  method: 'post',
  uri: 'http://parsoid-lb.eqiad.wikimedia.org/{domain}/FAKE_PAGE_NAME',
  headers: {
    'content-type': 'application/x-www-form-urlencoded',
  },
  body: '$BODY',
}

Where $BODY is the form-urlencoded representation of:

{
  oldid: 123, // optional
  html: 'Hello, world!',
  body: true, // optional
}

In T75955#976719, @Jdouglas wrote:

test whether the original content was indeed passed to parsoid

@GWicke For specification purposes, can you define "original" in this context?

Does it always mean "penultimate" with respect to the state of the wiki behind {domain}?

On second thought, does it mean "the revision for which new changes are being submitted"?

@Jdouglas, I think we need to always request the full HTML from Parsoid, so that our storage has the full version. If bodyOnly was requested in the POST, then we should probably pluck out the body with a regexp in restbase.

Regarding multipart/form-data vs. application/json: Should we promote the former for the transform API?

Ok, so it sounds like we need only augment the POST /{domain}/v1/transform/<html|wt>/to/html endpoint to:

allow an optional bodyOnly field
accept a multipart/form-data body rather than JSON

Sound right?

In T75955#980780, @Jdouglas wrote:

accept a multipart/form-data body rather than JSON

We could also accept both.

The other missing bit is that we currently don't actually fetch wikitext from the MW API. We need to do so (and perhaps store it) for proper html2wt support.

The other missing bit is that we currently don't actually fetch wikitext from the MW API. We need to do so (and perhaps store it) for proper html2wt support.

Can you point me to the docs on how to get Wikitext from the MW API?

In T75955#1005304, @Jdouglas wrote:

Can you point me to the docs on how to get Wikitext from the MW API?

action=raw ought to help

Great, thanks!

This is what Parsoid uses: https://github.com/wikimedia/parsoid/blob/master/lib/mediawiki.ApiRequest.js#L224

• ssastry moved this task from In Progress to VE Q3 on the Parsoid board.Jan 31 2015, 1:19 AM

• Jdouglas updated the task description. (Show Details)Feb 2 2015, 6:22 PM

• Jdouglas updated the task description. (Show Details)Feb 2 2015, 9:02 PM

• Jdouglas updated the task description. (Show Details)

• Jdouglas updated the task description. (Show Details)Feb 2 2015, 9:04 PM

• GWicke added a parent task: T88319: Replicate Parsoid v1 API functionality in RESTBase.Feb 2 2015, 9:07 PM

What is the multipart/form-data counterpart to the following application/json?

{
    "html": {
        "headers": {
            "content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
        },
        "body": "<html>The modified HTML</html>"
    },
    "original": {
        "revid": 12345,
        "wikitext": {
            "headers": {
                "content-type": "text/plain;profile=mediawiki.org/specs/wikitext/1.0.0"
            },
            "body": "the original wikitext"
        },
        "html": {
            "headers": {
                "content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
            },
            "body": "the original HTML"
        },
        "data-parsoid": {
            "headers": {
                "content-type": "application/json;profile=mediawiki.org/specs/data-parsoid/0.0.1"
            },
            "body": {
                "ids": {}
            }
        }
    }
}

Should each multipart field contain JSON? For example:

--------------------------d0f6b49b1b041887
Content-Disposition: form-data; name="html"; filename="html"
Content-Type: application/json

{
    "headers": {
        "content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
    },
    "body": "<html>The modified HTML</html>"
}

--------------------------d0f6b49b1b041887
Content-Disposition: form-data; name="original"; filename="original"
Content-Type: application/json

{
    "revid": 12345,
    "wikitext": {
        "headers": {
            "content-type": "text/plain;profile=mediawiki.org/specs/wikitext/1.0.0"
        },
        "body": "the original wikitext"
    },
    "html": {
        "headers": {
            "content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
        },
        "body": "the original HTML"
    },
    "data-parsoid": {
        "headers": {
            "content-type": "application/json;profile=mediawiki.org/specs/data-parsoid/0.0.1"
        },
        "body": {
            "ids": {}
        }
    }
}

--------------------------d0f6b49b1b041887--

@Jdouglas, is there a use case for this? The discussion about multipart/form-data is about the RESTBase API, not the Parsoid v2 API.

It's from the discussion above:

2015-02-02-153138_1065x416_scrot.png (416×1 px, 33 KB)

@Jdouglas, your example is from the Parsoid v2 API, while the RESTBase transform API can just use flat posts. We make the assumption that the transform API does not need to deal with mis-matches in the posted content, as the use cases this is intended for (VE, edit widgets etc) aren't long-running & don't involve storage.

multipart/form-data does support per-field content-type headers, but by default those are not set afaik.

Notes from conversation with @GWicke:

We don't want the client to pass in original -- RESTBase will always find that on its own (e.g. from storage)
We don't need to allow the client to specify the content type yet
Assume client is always using the latest spec for whichever Content-Type they're using
Client might know the specific Content-Type, but we shouldn't break if we can fallback to a functioning default

• Jdouglas updated the task description. (Show Details)Feb 3 2015, 5:33 PM

• Jdouglas updated the task description. (Show Details)

This task is getting rather unwieldy. I'm going to split it up into better-digestible subtasks.

• Jdouglas updated the task description. (Show Details)Feb 3 2015, 6:50 PM

• Jdouglas updated the task description. (Show Details)Feb 3 2015, 6:53 PM

• Jdouglas updated the task description. (Show Details)Feb 3 2015, 7:51 PM

• Jdouglas moved this task from In progress to Blocked / others on the RESTBase board.Feb 5 2015, 1:00 AM

• Jdouglas moved this task from Blocked / others to In progress on the RESTBase board.

• Jdouglas moved this task from In progress to Blocked / others on the RESTBase board.Feb 5 2015, 1:04 AM

• marcoil added a project: VisualEditor 2014/15 Q3 blockers.Feb 10 2015, 2:57 PM

See https://github.com/wikimedia/restbase/blob/master/specs/mediawiki/v1/content.yaml for the content API v1 spec in development.

Jdforrester-WMF moved this task from Nominated to Dependencies on the VisualEditor 2014/15 Q3 blockers board.Feb 10 2015, 7:42 PM

Jdforrester-WMF edited a custom field.Feb 10 2015, 8:32 PM

Latest iteration of the RESTBase transform API: https://github.com/wikimedia/restbase/pull/173

Testing revealed an issue with selser, possibly on the Parsoid side: T89411

• GWicke added a subtask: T89411: Parsoid v2 API selser might not be working correctly.Feb 12 2015, 11:49 PM

• GWicke closed subtask T88456: On-demand generation of HTML and data-parsoid as Resolved.Feb 12 2015, 11:55 PM

• GWicke mentioned this in T88457: html2wt conversion support.Feb 13 2015, 1:09 AM

• GWicke closed subtask T88457: html2wt conversion support as Resolved.

Arlolra closed subtask T89411: Parsoid v2 API selser might not be working correctly as Resolved.Feb 13 2015, 9:13 PM

Jdforrester-WMF added a project: Blocked-on-Parsoid.Feb 13 2015, 9:39 PM

• GWicke reopened subtask T89411: Parsoid v2 API selser might not be working correctly as Open.Feb 14 2015, 6:09 PM

• GWicke removed a subtask: T54936: Move data-parsoid and possibly data-mw out of the DOM, add uids.

• GWicke renamed this task from RESTBase / Parsoid integration to RESTBase / Parsoid integration: done, waiting for parsoid patch / deploy.Feb 14 2015, 6:11 PM

• GWicke claimed this task.

• GWicke renamed this task from RESTBase / Parsoid integration: done, waiting for parsoid patch / deploy to RESTBase / Parsoid integration - waiting for parsoid deploy.Feb 14 2015, 9:56 PM

• ssastry closed subtask T89411: Parsoid v2 API selser might not be working correctly as Resolved.Feb 16 2015, 9:20 PM

Resolved by Parsoid deploy yesterday.

Jdforrester-WMF moved this task from Dependencies to Done on the VisualEditor 2014/15 Q3 blockers board.Feb 18 2015, 3:29 PM

Jdforrester-WMF removed a project: Patch-For-Review.Feb 24 2015, 5:29 PM

• GWicke mentioned this in T92972: Pass current / previous revision's information to Parsoid.Mar 17 2015, 3:54 PM

Arlolra mentioned this in rGPARf75c26553f3f: API v2 parsing and serialization routes.Apr 8 2015, 4:28 PM

Diffusion mentioned this in rGRES176ea528ddc4: Add tests for html2wt as described in T75955.Apr 8 2015, 4:34 PM

• GWicke mentioned this in rGRES95e282f8bd2b: Merge pull request #98 from wikimedia/html2wt-tests.Apr 8 2015, 4:34 PM

• GWicke mentioned this in rGRES5984df806c23: Merge pull request #110 from earldouglas/phabricator/T75955.

• GWicke mentioned this in rGRES102f4fb7e459: Merge pull request #113 from earldouglas/phabricator/T75955.

• GWicke mentioned this in rGRES3412329b102e: Merge pull request #116 from earldouglas/phabricator/T75955.

Arlolra mentioned this in T114413: Support various conversions in Parsoid's pb2pb endpoint.May 12 2016, 10:13 PM