Page MenuHomePhabricator

html2wt conversion support
Closed, ResolvedPublic

Description

Common requirements

Input/output formats (to substitute into {input} and {output} in the use cases below):

  • html -> html
  • html -> wikitext
  • wikitext -> html

For html as a output format, an optional bodyOnly field causes the response to include only the innerHTML of the DOM body

Use case: raw content transformation

Request to RESTBase: POST /{domain}/v1/transform/{input}/to/{output}.

TBD

  • Who the first round of users of this feature will be
  • Request content type, headers, body model
  • Response status, content type, headers body model

Use case: transformation of content from an existing page

  • Title and revision are passed: POST /{domain}/v1/transform/{input}/to/{output}/{title}/{revision}

TBD

  • Who the first round of users of this feature will be
  • Request headers, body model
  • Response status, content type, headers body model

Implementation notes

For html-to-wikitext support, we'll need to store the Wikitext in storage. See how Parsoid gets wt from the MW API.

RESTBase retrieves the original html, wikitext & data-parsoid from storage & POSTs it to Parsoid:

POST parsoid/v2/{domain}/wt/{title}/{oldid}

{
    "html": {
        "headers": {
            "content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
        },
        "body": "<html>The modified HTML</html>"
    },
    "original": {
        "revid": 12345,
        "wikitext": {
            "headers": {
                "content-type": "text/plain;profile=mediawiki.org/specs/wikitext/1.0.0"
            },
            "body": "the original wikitext"
        },
        "html": {
            "headers": {
                "content-type": "text/html;profile=mediawiki.org/specs/html/1.0.0"
            },
            "body": "the original HTML"
        },
        "data-parsoid": {
            "headers": {
                "content-type": "application/json;profile=mediawiki.org/specs/data-parsoid/0.0.1"
            },
            "body": {
                "ids": {}
            }
        }
    }
}

If the modified html is just a string, then Parsoid is expected to assume that it's in the latest html version.

Parsoid returns the wikitext, which RESTBase returns to the client without any caching:

{
  wikitext: {
    headers: {
      'content-type': 'text/plain;profile=mediawiki.org/specs/wikitext/1.0.0'
    },
    body: "the modified wikitext"
  }
}

We could also just return the wikitext with the appropriate content-type header, but there is something to be said for consistency & the ability to return additional metadata (like warnings or errors) in the future.

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

Jdouglas claimed this task.
Jdouglas raised the priority of this task from to High.
Jdouglas updated the task description. (Show Details)

The raw content transformation use case has a few TBDs:

  • Who the first round of users of this feature will be
  • Request content type, headers, body model
  • Response status, content type, headers body model

Based on most recent conversations, I've put the following together:

Initial users

Editors using VisualEditor, via the VisualEditor team.

Request

  • Method: POST
  • Content-Type: multipart/form-data
  • Body parts:
    • data: the raw content (either html or wikitext) to be transformed
    • bodyOnly: whether to return only the html body innerHTML
    • Possible other values in the future, e.g. title, revision, etc.

Response

  • Status: 201
  • Content-Type: either text/html for html, or text/plain for wikitext
    • e.g. text/plain;profile=mediawiki.org/specs/wikitext/1.0.0
    • e.g. text/html;profile=mediawiki.org/specs/html/1.0.0
  • Body model: raw content, either html or wikitext

The transformation of content from an existing page use case has a few TBDs:

  • Who the first round of users of this feature will be
  • Request headers, body model
  • Response status, content type, headers body model

Based on most recent conversations, I've put the following together:

Initial users

Editors using VisualEditor, via the VisualEditor team.

Request

  • Method: POST
  • Content-Type: multipart/form-data
  • Body parts:
    • data: the raw content (either html or wikitext) to be transformed
    • bodyOnly: whether to return only the html body innerHTML
    • Possible other values in the future, e.g. title, revision, etc.

Response

  • Status: 201
  • Content-Type: application/json
  • Body model:
{
  "<output format>": {
    "headers": {
      "content-type": "<output content type>"
    },
    "body": "<output content>"
  }
}

From https://phabricator.wikimedia.org/T88456#1015672:

We were talking about a flat POST API as in the current Parsoid v1 API, possibly slightly generalized / cleaned up for the /transform hierarchy.

Perhaps something like this:

POST /{domain}/v1/transform/wikitext/to/html/{title}/{oldid}
Content-type: multipart/form-data

wikitext: '== Foo =='
bodyOnly: 'true'
POST /{domain}/v1/transform/html/to/wikitext/{title}/{oldid}
Content-type: multipart/form-data

html: '<html>...</html>'

Chatted with @GWicke, came up with the following changes:

Request

  • Body parts:
    • <input format name>: the raw content (either html or wikitext) to be transformed

Response

  • Content-Type: <output content type>
  • Body: (the raw output content)

Based on more conversations and comments above, I've put the following together:

The raw content transformation use case and the transformation of content from an existing page use case both have a few TBDs:

  • Who the first round of users of this feature will be
  • Request content type, headers, body model
  • Response status, content type, headers body model

Here are the current proposals for these:

Initial users

For both use cases, the users are those of Parsoid: https://www.mediawiki.org/wiki/Parsoid/Users

Request

Use case: raw content transformation

  • Method: POST
  • Content-Type: multipart/form-data
  • Body parts:
    • (input format name): (the raw content to be transformed)
    • bodyOnly: true or false -- whether to return only the html body innerHTML
    • Possible other values in the future, e.g. title, revision, etc.

Use case: transformation of content from an existing page

  • Method: POST
  • Content-Type: multipart/form-data
  • Body parts:
    • (input format name): (the raw content to be transformed)
    • bodyOnly: true or false -- whether to return only the html body innerHTML
    • Possible other values in the future, e.g. title, revision, etc.

Examples

POST /{domain}/v1/transform/wikitext/to/html/{title}/{oldid}
Content-type: multipart/form-data

wikitext: '== Foo =='
bodyOnly: 'true'
POST /{domain}/v1/transform/html/to/wikitext/{title}/{oldid}
Content-type: multipart/form-data

html: '<html>...</html>'

Response

Use case: raw content transformation

  • Status: 201
  • Content-Type: <output content type>
    • e.g. text/plain;profile=mediawiki.org/specs/wikitext/1.0.0
    • e.g. text/html;profile=mediawiki.org/specs/html/1.0.0
  • Body: (the raw output content)

Use case: transformation of content from an existing page

  • Status: 201
  • Content-Type: <output content type>
    • e.g. text/plain;profile=mediawiki.org/specs/wikitext/1.0.0
    • e.g. text/html;profile=mediawiki.org/specs/html/1.0.0
  • Body: (the raw output content)

Here's the latest, based on our conversations. @GWicke, @mobrovac please comment with any corrections you'd like to make.


The raw content transformation use case and the transformation of content from an existing page use case both have a few TBDs:

  • Who the first round of users of this feature will be
  • Request content type, headers, body model
  • Response status, content type, headers body model

Here are the current proposals for these:

Initial users

Nobody in particular.

Request

Use case: raw content transformation

  • Method: POST
  • Content-Type: multipart/form-data
  • Body parts:
    • (input format name): (the raw content to be transformed)
    • bodyOnly: true or false -- whether to return only the html body innerHTML
    • Possible other values in the future, e.g. title, revision, etc.

Use case: transformation of content from an existing page

  • Method: POST
  • Content-Type: multipart/form-data
  • Body parts:
    • (input format name): (the raw content to be transformed)
    • bodyOnly: true or false -- whether to return only the html body innerHTML
    • Possible other values in the future, e.g. title, revision, etc.

Examples

POST /{domain}/v1/transform/wikitext/to/html/{title}/{oldid}
Content-type: multipart/form-data

wikitext: '== Foo =='
bodyOnly: 'true'
POST /{domain}/v1/transform/html/to/wikitext/{title}/{oldid}
Content-type: multipart/form-data

html: '<html>...</html>'

Response

Use case: raw content transformation

  • Status: 200
  • Content-Type: <output content type>
    • e.g. text/plain;profile=mediawiki.org/specs/wikitext/1.0.0
    • e.g. text/html;profile=mediawiki.org/specs/html/1.0.0
  • Body: (the raw output content)

Use case: transformation of content from an existing page

  • Status: 200
  • Content-Type: <output content type>
    • e.g. text/plain;profile=mediawiki.org/specs/wikitext/1.0.0
    • e.g. text/html;profile=mediawiki.org/specs/html/1.0.0
  • Body: (the raw output content)

title and revision parameter support is critical for the release. Support for them is a precondition for selective serialization as used by VE. These parameters are currently part of the URL.

Should we talk about changes to the spec instead of trying to replicate it here?

Should we talk about changes to the spec instead of trying to replicate it here?

First, let's establish the specification. Then we can make sure it's covered by tests, and make any necessary changes to the implementation.

Chatted with @GWicke, and we decided the spec shall be as implemented in https://github.com/wikimedia/restbase/blob/26cb090655ab710c58ecf35501645e89ea6f8f14/specs/mediawiki/v1/content.yaml#L272:

/{module:transform}/html/to/wikitext{/title}{/revision}:
  post:
    tags:
      - Transforms
    description: Transform HTML to wikitext
    consumes:
      - multipart/form-data
    produces:
      - text/plain; profile=mediawiki.org/specs/wikitext/1.0.0
    parameters:
      - name: domain
        in: path
        description: The top-level API domain
        type: string
        required: true
        default: en.wikipedia.org
      - name: title
        in: path
        description: The page title
        type: string
        required: false
      - name: revision
        in: path
        description: The page revision
        type: integer
        required: false
      - name: html
        in: formData
        description: The HTML to transform
        type: string
        required: true
    x-backend-request:
      uri: /{domain}/sys/parsoid/transform/html/to/wikitext{/title}{/revision}

Additional requirement discovered during standup: ensure that, internally, Selective Serialization is being used.

GWicke closed this task as Resolved.EditedFeb 13 2015, 1:09 AM

Resolving this issue, as the functionality is now implemented. The remaining known issue with selser is tracked in the more general T75955.