Page MenuHomePhabricator

Implement something similar to the RESTBase 'section' API to provide wikitext structure information
Open, MediumPublic0 Estimated Story Points

Description

We would like to build a smarter wikitext editor with (for example) smart template/infobox collapsing.

But parsing wikitext on the client is a losing proposition.

Let's do something similar to the [sections API](http://restbase.wikimedia.org/en.wikipedia.org/v1/?doc#resource_Mobile) (T94890) used by mobile to deliver a full JSON description of the structure of a given wikitext revision. It should include wikitext offsets for:

  • The start/end of sections
  • The start/end of templates (and their names, so that certain templates can be identified for special treatment)
  • The start/end of extension tags (and the extension name)
  • what else? links? media, for contextual popovers?

As a strawman, the result could look like:

{
  sections: [ 0, 10, 50, 70],
  templates: [ ['foo', 10, 14], ['infobox', 17, 23] ],
  media: [ ['Foobar.jpg', 10, 17] ],
  extensions: {
    gallery: [ [ 40, 45], [60, 65] ]
  }
}

The wikitext editor can use these source offsets to collapse regions of wikitext by default to allow a cleaner editing experience.

To go further, perhaps we should export source offsets for things like bold face, italics, and links to allow the wikitext editor to do syntax highlighting? (Or not -- perhaps syntax highlighting is best done with imperfect-but-okay regexps on the client side.)

Event Timeline

cscott raised the priority of this task from to Needs Triage.
cscott updated the task description. (Show Details)
cscott added projects: RESTBase, Parsoid.
cscott subscribed.

Parsoid already provides these "section offsets" (basically for all children of <body>) and they are stored in RESTBase. So, once we add true <section> wrappers, the offsets should translate over.

Halfak updated the task description. (Show Details)

@Halfak, those offsets (for all wikitext constructs) are already all there as part of data-parsoid (which is right now considered private information, but I know that google is using it), and is also stored in RESTBase. Without these offsets, selective serialization (to minimize dirty diffs in VE and other HTML client edits) wouldn't work.

Right. The idea is to expose some JSON structure built from the data-parsoid information via a stable-ish RESTBase API, so we don't expose the raw data-parsoid stuff to the client. (Also so that the section information gets properly cached, so the page doesn't have to be parsed on the fly when the user fires up the wikitext editor.)

As a strawman, consider a result object like:

{
  sections: [ 0, 10, 50, 70],
  templates: [ ['foo', 10, 14], ['infobox', 17, 23] ],
  media: [ ['Foobar.jpg', 10, 17] ],
  extensions: {
    gallery: [ [ 40, 45], [60, 65] ]
  }
}

I picked features which are likely to be large and cumbersome in the wikitext, and therefore would benefit from collapse. And the smart editor could use the media information to display popover previews, and/or choose not to hide certain whitelisted templates and/or template strings which are shorter than a specified length.

To clarify, there are two section APIs. The first one, provided as /page/html/{title}/{revision}?sections=id1,id2 gets the sections with the (Parsoid) IDs id1 and id2, while the other - /page/mobile-html-sections - is tailored for serving mobile clients relevant HTML and metadata for display on mobile clients (native Android app currently only).

A big note for the first endpoint: these are not sections as displayed on the Wiki ( denoted by /=+/ formatting in wikitext), but rather Parsoid sections, i.e. page elements.

"Page elements" == top level HTML tags? If/when we can do T114407: Add <section> tags for sections., then wiki sections will be a subset of these.

cscott renamed this task from Extend RESTBase 'section' API to include more wikitext structure information to Implement something similar to the RESTBase 'section' API to provide wikitext structure information.Oct 1 2015, 7:32 PM
cscott added a project: VisualEditor.
This comment was removed by cscott.

We actually store those section offsets both in data-parsoid, and separately in another table for faster lookup. We could expose just the offsets fairly efficiently. We could also provide a section retrieval API for wikitext using this information.

But, all this will have a more useful granularity with T114072: <section> tags for MediaWiki sections.

For marking up / extracting inline content elements, there are some thoughts in T105845#1650013.

Arlolra triaged this task as Medium priority.Nov 18 2015, 3:43 AM
Arlolra subscribed.