Page MenuHomePhabricator

RESTBase / Parsoid integration - waiting for parsoid deploy
Closed, ResolvedPublic0 Estimated Story Points

Description

We are getting ready to integrate Parsoid with RESTBase. This task tracks our progress and what's left to do.

Generally, we need a way to communicate multiple mime entities both in POST requests to Parsoid, and in responses from Parsoid back to RESTBase. We could use RFC 2387 or RFC 2388, but this seems to be a bit cumbersome especially for the response part. Instead, we can just use a simple JSON encoding with a structure that's otherwise basically identical to MIME.

The downside of using JSON is possibly efficiency (quite a bit more escaping going on, especially for HTML-in-JSON). However, we could address that later by optionally using a binary encoding like msgpack without changing the message structure itself.

Another option to consider would be to flatten the structure by using names like original.html. This would make it easy to encode the same message as multipart/related.

Implementation notes

  • Parsoid: Slightly simpler format done & deployed. Needs to be updated to indicate the mime types as documented above.
  • RESTBase: Prototype implementation.
    • should pass to Parsoid:
      • previous revision's HTML + data-parsoid OnEdit
      • current revision's HTML + data-parsoid OnDependencyChange
  • Parsoid WIP implementation: https://gerrit.wikimedia.org/r/#/c/165685/
  • RESTBase to be done

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@GWicke: In the example responses from Parsoid, you'll showing the pagebundle but the route examples say /html/.

Also, in the cache scenario (wt2html), we try to reuse the expansions from the previous revision, as in,
https://gerrit.wikimedia.org/r/#/c/165685/10/api/routes.js

but, for no-cache,

For a no-cache request, RESTBase instead first checks whether the *current* revision is found in storage. If it is, it sends the data for that in the original key:

what, if anything, does Parsoid do with the original data.

what, if anything, does Parsoid do with the original data.

if the original data contains the wikitext source, then parsoid won't have to fetch it. I'll assume you meant that

@GWicke: In the example responses from Parsoid, you'll showing the pagebundle but the route examples say /html/.

Do you mean html2wt or wt2html in the issue summary?

For a no-cache request, RESTBase instead first checks whether the *current* revision is found in storage. If it is, it sends the data for that in the original key:

what, if anything, does Parsoid do with the original data.

It reuses template expansions or images, depending which of the two triggered the update. The reason for the update is relayed in an x-parsoid header currently.

Do you mean html2wt or wt2html in the issue summary?

wt2html

It reuses template expansions or images, depending which of the two triggered the update. The reason for the update is relayed in an x-parsoid header currently.

I was thinking it only did that with the previous revision data but I guess it's both then
https://gerrit.wikimedia.org/r/#/c/165685/12/api/routes.js

can you review https://gerrit.wikimedia.org/r/#/c/165685/ when you get a chance?

@Arlolra: See my review there.

As you noted in your response, we also need a way to request more targeted updates. This includes the current wt2html updates (re-render templates or re-render images), but we should use a mechanism that can also cover more refined requests in the future:

  • re-sanitization of HTML / upgrade of HTML spec
  • update of transclusions affected by certain templates only
  • update of specific images only

A key in the request seems to make most sense. Perhaps something like this?

{
   update: {
      templates: true,
      images: false,
      spec: true
   },
   original: { ... }
}

Not all of these need to be set all the time, only the actually requested updates could be set to a truish value. In the future, we could add an object value with finer-grained options instead to support targeted template / image updates.

I updated the patch with that.

Is it feasible to run this Parsoid patch in my local dev environment so that I can develop against it?

EDIT: I found instructions here.

@Jdouglas: Yes, that should work. See https://www.mediawiki.org/wiki/Parsoid/Setup#Manual_installation_on_Linux_or_Mac_OS_X

Basic summary:

# clone parsoid
git clone ssh://<user>@gerrit.wikimedia.org:29418/mediawiki/services/parsoid

cd parsoid

# check out the patch with the git commandline shown in gerrit:
git fetch ssh://<user>@gerrit.wikimedia.org:29418/mediawiki/services/parsoid refs/changes/85/165685/13 && git checkout FETCH_HEAD
npm install

node api/server

This should bring up parsoid locally. The wikipedia domains are all configured by default.

I converted the POST parsoid/v2/{domain}/wt/{title}/{oldid} example above to a JSON file, and attempted to run it locally, but something went wrong:

$ curl -H "Content-Type: application/json" -X POST -d @T75955.json localhost:8000/v2/en.wikipedia.org/wt/Main_Page/1
Cannot call method on undefined
TypeError: Cannot call method on undefined
    at Object.ES.CheckObjectCoercible (/home/jdouglas/code/parsoid/node_modules/es6-shim/es6-shim.js:152:17)
    at Object.ES.ToObject (/home/jdouglas/code/parsoid/node_modules/es6-shim/es6-shim.js:164:26)
    at Function.Object.keys (/home/jdouglas/code/parsoid/node_modules/es6-shim/es6-shim.js:905:38)
    at Object.DU.applyDataParsoid (/home/jdouglas/code/parsoid/lib/mediawiki.DOMUtils.js:1958:10)
    at Promise.then.timeout.then.apiUtils.jsonResponse.wikitext.headers.content-type (/home/jdouglas/code/parsoid/api/routes.js:275:7)
    at /home/jdouglas/code/parsoid/node_modules/es6-shim/es6-shim.js:1220:20
    at Object._onImmediate (/home/jdouglas/code/parsoid/node_modules/es6-shim/es6-shim.js:1177:28)
    at processImmediate [as _immediateCallback] (timers.js:345:15)

Any debugging pointers?

EDIT: It works if I s/wt/html/g in the URL.

This appears to work for pagebundle in place of wt.

curl -s -H "Content-Type: application/json" -X POST -d @T75955.json localhost:8000/v2/en.wikipedia.org/pagebundle/Main_Page/1
{
    "data-parsoid": {
        "body": {
            "counter": 1,
            "ids": {
                "mwAA": {
                    "dsr": [
                        0,
                        949,
                        0,
                        0
                    ]
                },
                "mwAQ": {
                    "dsr": [
                        928,
                        949,
                        0,
                        0
                    ]
                }
            }
        },
        "headers": {
            "content-type": "application/json;profile=mediawiki.org/specs/data-parsoid/0.0.1"
        }
    },
    "html": {
        "body": "<!DOCTYPE html>\n<html prefix=\"dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/\" about=\"http://en.wikipedia.org/wiki/Special:Redirect/revision/12345\"><head prefix=\"mwr: http://en.wikipedia.org/wiki/Special:Redirect/\"><meta property=\"mw:articleNamespace\" content=\"0\"/><link rel=\"dc:replaces\" resource=\"mwr:revision/12342\"/><meta property=\"dc:modified\" content=\"2002-02-16T19:00:49.000Z\"/><meta about=\"mwr:user/0\" property=\"dc:title\" content=\"210.55.230.17\"/><link rel=\"dc:contributor\" resource=\"mwr:user/0\"/><meta property=\"mw:revisionSHA1\" content=\"1ec7fece702a00b9d99c0e848eea8c524d12fe3f\"/><meta property=\"dc:description\" content=\"*\"/><meta property=\"mw:parsoidVersion\" content=\"0\"/><link rel=\"dc:isVersionOf\" href=\"//en.wikipedia.org/wiki/Main_Page\"/><title>Main_Page</title><base href=\"//en.wikipedia.org/wiki/\"/><link rel=\"stylesheet\" href=\"//en.wikipedia.org/w/load.php?modules=mediawiki.legacy.commonPrint,shared|mediawiki.skinning.elements|mediawiki.skinning.content|mediawiki.skinning.interface|skins.vector.styles|site|mediawiki.skinning.content.parsoid&amp;only=styles&amp;debug=true&amp;skin=vector\"/></head><body id=\"mwAA\" lang=\"en\" class=\"mw-content-ltr sitedir-ltr ltr mw-body mw-body-content mediawiki\" dir=\"ltr\"><p id=\"mwAQ\">the original wikitext</p></body></html>",
        "headers": {
            "content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
        }
    }
}

This works for wt when I replace `"body": {}" with the following:

"body": {
    "counter": 1,
    "ids": {
        "mwAA": {
            "dsr": [
                0,
                949,
                0,
                0
            ]
        },
        "mwAQ": {
            "dsr": [
                928,
                949,
                0,
                0
            ]
        }
    }
}

EDIT: The key is having body.ids -- at least "body": { "ids": {} }

Here's the source of the problem: https://github.com/wikimedia/parsoid/blob/04721c235f9cc6d4ffdca8418e73d602aecfebba/lib/mediawiki.DOMUtils.js#L1963

Would it be better to allow for an id-less body?

e.g.

Object.keys( dp.ids || {} ).forEach(function ( key ) {

@Jdouglas: Passing in data-parsoid without ids in the body could be considered a bug in the client worth highlighting. If that passed unnoticed, the result would likely be wikitext corruption in edits, which would make many people unhappy. If we send the 'original' object, then it should contain both html *and* a valid data-parsoid attribute, both from storage.

Parsoid also needs to support html2wt conversion without the 'original' object (for new page creation for example). That might be a simpler mode to test for the first iteration.

Gotcha. Does Parsoid have any infrastructure for validation? A 400 with a reason might be more informative to an erroneous client than a 500 with an obscure es6-shim stack trace.

@Jdouglas I think validation is fairly embryonic at this point. Definitely agreed on 400 being better than 500.

Understandable, considering we're operating without a type system. :]

@Jdouglas: To be fair, JSON schema validation is not usually happening for free just by having stronger typing.

Change 165685 merged by Jdouglas:
API v2 parsing and serialization routes

https://gerrit.wikimedia.org/r/165685

@GWicke:

If the modified html is just a string, then Parsoid is expected to assume that it's in the latest html version.

Does this refer to a POST to parsoid/v2/{domain}/wt/{title}/{oldid} with a request body containing raw HTML?

@Jdouglas: I think we can always wrap things into a full request on the restbase end, and just add the latest configured parsoid mime type if we didn't receive any from the client. So we can perhaps just strike this as a parsoid requirement.

When there's no title or revision, i.e. POST /{domain}/v1/transform/html/to/wt, what should be the Parsoid request URL?

POST parsoid/v2/{domain}/wt/{title}/{oldid}

Parsoid appears to allow gibberish:

POST parsoid/v2/{domain}/wt/___NOT_A_REAL_TITLE_1231___/

Should we designate a special placeholder title? Or should we update Parsoid to respond to POST parsoid/v2/{domain}/wt?

@Jdouglas, basically title and oldid are optional, so both POST parsoid/v2/{domain}/wt/{title} and POST parsoid/v2/{domain}/wt should work. If the request without title doesn't work in parsoid right now, then we could work around it by inventing one like 'Main_Page' for now. IIRC that's what Parsoid used to do as well if the title is missing.

Edit: Parsoid uses the configured mainpage for the wiki, which happens to be 'Main Page' for enwiki, but differs for other languages.

Oh, my mistake. The new patch already supports POST parsoid/v2/{domain}/wt.

The transform endpoint is now functioning: https://github.com/wikimedia/restbase/pull/98

After #98, is there anything else needed for T75955?

Answer: yes: https://coveralls.io/files/412841510#L93

I think we're pretty close to marking this one as done. Anyone want to do a final check of the tests to verify they cover the scope of this task?

On-demand generation of HTML and data-parsoid:

https://github.com/wikimedia/restbase/blob/3412329b102e69bb25482d9f21a74151bd00abdd/test/features/parsoid/ondemand.js

html2wt conversion support:

https://github.com/wikimedia/restbase/blob/3412329b102e69bb25482d9f21a74151bd00abdd/test/features/parsoid/transform.js

Strictly speaking https://github.com/wikimedia/restbase/blob/3412329b102e69bb25482d9f21a74151bd00abdd/test/features/parsoid/ondemand.js#L87 does not really test whether the original content was indeed passed to parsoid. Maybe we could let parsoid emit a header if it received the original properly. We could then assert that header in the response.

test whether the original content was indeed passed to parsoid

@GWicke For specification purposes, can you define "original" in this context?

Does it always mean "penultimate" with respect to the state of the wiki behind {domain}?

Anyone want to do a final check of the tests to verify they cover the scope of this task?

We're not quite there yet. The Parsoid API accepts a body flag that reduces the scope of the response.

@GWicke do you know who is using this feature? Is it something RESTBase should provide via its own API? If so, how will it effect storage? We might not want to store the body.innerHTML, but instead store the entire content. The drawback to this is that we'll have to replicate some of the same DOM parsing that Parsoid currently handles to be able to subset the content on demand.

Anyone want to do a final check of the tests to verify they cover the scope of this task?

We're not quite there yet. The Parsoid API accepts a body flag that reduces the scope of the response.

@GWicke do you know who is using this feature?

AFAIK this is currently used by Flow, likely also VE.

Is it something RESTBase should provide via its own API?

Yes. We talked about supporting a plain POST body with the html/wt in 'content' and optional other fields like a 'bodyOnly' boolean. I would vote for using this option in the transform API, as it is easy enough and more efficient for binary content when using multipart/form-data encoding. We can always add JSON as an option later.

Does the following look correct?

User -> RESTBase:

{
  method: 'post',
  uri: '/{domain}/v1/transform/<html|wt>/to/html',
  headers: {
    'content-type': 'application/json',
  },
  body: {
    content: 'Hello, world!',
    bodyOnly: true,
  },
}

RESTBase -> Parsoid:

{
  method: 'post',
  uri: 'http://parsoid-lb.eqiad.wikimedia.org/{domain}/FAKE_PAGE_NAME',
  headers: {
    'content-type': 'application/x-www-form-urlencoded',
  },
  body: '$BODY',
}

Where $BODY is the form-urlencoded representation of:

{
  oldid: 123, // optional
  html: 'Hello, world!',
  body: true, // optional
}

test whether the original content was indeed passed to parsoid

@GWicke For specification purposes, can you define "original" in this context?

Does it always mean "penultimate" with respect to the state of the wiki behind {domain}?

On second thought, does it mean "the revision for which new changes are being submitted"?

@Jdouglas, I think we need to always request the full HTML from Parsoid, so that our storage has the full version. If bodyOnly was requested in the POST, then we should probably pluck out the body with a regexp in restbase.

Regarding multipart/form-data vs. application/json: Should we promote the former for the transform API?

Ok, so it sounds like we need only augment the POST /{domain}/v1/transform/<html|wt>/to/html endpoint to:

  • allow an optional bodyOnly field
  • accept a multipart/form-data body rather than JSON

Sound right?

  • accept a multipart/form-data body rather than JSON

We could also accept both.

The other missing bit is that we currently don't actually fetch wikitext from the MW API. We need to do so (and perhaps store it) for proper html2wt support.

The other missing bit is that we currently don't actually fetch wikitext from the MW API. We need to do so (and perhaps store it) for proper html2wt support.

Can you point me to the docs on how to get Wikitext from the MW API?

Can you point me to the docs on how to get Wikitext from the MW API?

action=raw ought to help

What is the multipart/form-data counterpart to the following application/json?

{
    "html": {
        "headers": {
            "content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
        },
        "body": "<html>The modified HTML</html>"
    },
    "original": {
        "revid": 12345,
        "wikitext": {
            "headers": {
                "content-type": "text/plain;profile=mediawiki.org/specs/wikitext/1.0.0"
            },
            "body": "the original wikitext"
        },
        "html": {
            "headers": {
                "content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
            },
            "body": "the original HTML"
        },
        "data-parsoid": {
            "headers": {
                "content-type": "application/json;profile=mediawiki.org/specs/data-parsoid/0.0.1"
            },
            "body": {
                "ids": {}
            }
        }
    }
}

Should each multipart field contain JSON? For example:

--------------------------d0f6b49b1b041887
Content-Disposition: form-data; name="html"; filename="html"
Content-Type: application/json

{
    "headers": {
        "content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
    },
    "body": "<html>The modified HTML</html>"
}

--------------------------d0f6b49b1b041887
Content-Disposition: form-data; name="original"; filename="original"
Content-Type: application/json

{
    "revid": 12345,
    "wikitext": {
        "headers": {
            "content-type": "text/plain;profile=mediawiki.org/specs/wikitext/1.0.0"
        },
        "body": "the original wikitext"
    },
    "html": {
        "headers": {
            "content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
        },
        "body": "the original HTML"
    },
    "data-parsoid": {
        "headers": {
            "content-type": "application/json;profile=mediawiki.org/specs/data-parsoid/0.0.1"
        },
        "body": {
            "ids": {}
        }
    }
}

--------------------------d0f6b49b1b041887--

@Jdouglas, is there a use case for this? The discussion about multipart/form-data is about the RESTBase API, not the Parsoid v2 API.

@Jdouglas, your example is from the Parsoid v2 API, while the RESTBase transform API can just use flat posts. We make the assumption that the transform API does not need to deal with mis-matches in the posted content, as the use cases this is intended for (VE, edit widgets etc) aren't long-running & don't involve storage.

multipart/form-data does support per-field content-type headers, but by default those are not set afaik.

Notes from conversation with @GWicke:

  • We don't want the client to pass in original -- RESTBase will always find that on its own (e.g. from storage)
  • We don't need to allow the client to specify the content type yet
  • Assume client is always using the latest spec for whichever Content-Type they're using
  • Client might know the specific Content-Type, but we shouldn't break if we can fallback to a functioning default
Jdouglas updated the task description. (Show Details)
Jdouglas updated the task description. (Show Details)

This task is getting rather unwieldy. I'm going to split it up into better-digestible subtasks.

Latest iteration of the RESTBase transform API: https://github.com/wikimedia/restbase/pull/173

Testing revealed an issue with selser, possibly on the Parsoid side: T89411

GWicke renamed this task from RESTBase / Parsoid integration to RESTBase / Parsoid integration: done, waiting for parsoid patch / deploy.Feb 14 2015, 6:11 PM
GWicke claimed this task.
GWicke renamed this task from RESTBase / Parsoid integration: done, waiting for parsoid patch / deploy to RESTBase / Parsoid integration - waiting for parsoid deploy.Feb 14 2015, 9:56 PM

Resolved by Parsoid deploy yesterday.