Page MenuHomePhabricator

Contributor gets page source
Closed, ResolvedPublic5 Estimated Story Points

Description

As a Contributor, I want to get the source code for a page, so that I can edit it locally.

GET /page/{title}
Returns the page as JSON. Title is escaped for slashes

Payload: empty

Notable request headers:

Status:
200 – this is the page
403 - user isn't authorized to read the page
404 – page does not exist (never created or deleted)

Notable response headers: none

Body: JSON

  • id: numeric id of the page
  • key: prefixed DB key of the page, like "Talk:Main_Page"
  • title: title for display, like "Talk:Main Page"
  • latest: latest revision of the page, object with these properties
    • id: revision ID
    • timestamp: revision timestamp
  • content_model: content model for the main slot of page
  • license: Object for the preferred license of the page, including these properties:
    • spdx: SPDX code
    • url: URL for the license
    • title: title of the license
  • other_licenses: array of objects with {spdx, url, title} for other licenses the page is available under
  • source: source of the main slot for editing; usually wikitext, but depends on content_model

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mobrovac added a subscriber: mobrovac.EditedAug 22 2019, 1:14 PM

Status:
200 – this is the page
304 – not modified

It should be probably noted (for completeness) that 3xx status codes imply no return body whatsoever.

307 – page is redirect; Location: header gives location of redirect page

It would be better to use 301 and 302 here as we already use them in the existing REST API. 301 should be used for permanent moves such as Main%20Page -> Main_Page, while 302 should be used for user-set redirects, i.e. pages that have #REDIRECT.

404 – there has never been such a page
410 – page was deleted

IMHO 410 should not be used since, per RfC2616, 410 indicates a permanent removal of a resource. However, in the case of MW, one can again create a page with the same title. Therefore, 404 is the appropriate response.

Body: JSON
id: id of the page
version_id: version ID of the page
title: title of the page
text: wikitext of the page

If, as a reader, I want to get a page, I will likely not be happy with getting a JSON response. The metadata about the page itself can either be delivered through headers (part of it will be, such as the revid which is part of the ETag) or we can have an appropriate REST URI just for that. Returning HTML directly would be more appropriate here, IMHO. Furthermore, by returning wikitext this end point violates the API/FORMAT architecture principle which states that APIs MUST be designed to enable clients to process content without the need to parse or manipulate wikitext.

Like @mobrovac I think it's very unclear what sort of reader would want a page as JSON, and what should be in the JSON. Is the user here really a reader, or are they an editor who wants to subsequently edit the page via T230843?

I think if we had some user stories that featured actual real-live users, then it would be easier to figure out what the requirements are.

tstarling renamed this task from Reader gets a page to REST: Reader gets a page.Aug 26 2019, 6:39 AM
tstarling added a project: MediaWiki-REST-API.
eprodromou updated the task description. (Show Details)
eprodromou edited subscribers, added: eprodromou; removed: EvanProdromou.

@mobrovac I appreciate the feedback. I've made changes to the status codes per your recommendations, including outputting text of the page in HTML (although this will need another pass for pages that are not wikitext, like JS and CSS). I appreciate the link to the architecture principles, and I agree that using the rendered version makes more sense.

I'll add a separate endpoint for getting the "source code" for a page, like /page/{title}/source, which will be more useful for the Contributor persona.

I'm leaving the JSON body type for the output. Developers are conditioned to work with JSON in the current generation of APIs, even if, as you say, most of the metadata could be encoded in HTTP headers.

For these use cases, I've intentionally elided the role of the developer as an intermediary persona, since it makes them complex and wordy. Can we assume that a developer will be using these endpoints to serve the requirements of their own end users? "As a Developer, I want an API endpoint to fetch the page to show to the Reader, because they want to read the page."

@tstarling It's always useful to question our assumptions!

I'll see if I can come up with some supporting data for the hypothesis that a) people want to read our content and b) JSON is a good format for APIs.

For the purposes of the CRUD interface, this needs to return whatever is sent with PUT, and that probably means wikitext. The path could be a subresource /page/{title}/wikitext, following the principle that subresources should be used for "larger, more expensive, less-frequently-used properties, or properties with different access rights" (ref), which would seem to describe the wikitext of the page. For HTML delivery we could use /page/{title}/html, but that doesn't seem to be necessary for the CRUD sprint.

The /page/{title} endpoint could contain links to the subresources, as suggested for document endpoints at https://restfulapi.net/resource-naming/ .

I don't think it's correct to send a 302 for a wikitext endpoint since that would prevent the redirect from being edited. This is why we don't use 302 responses for redirects in the web UI. It would be fine for there to be a reader-focused view in which redirects are delivered with a 302 code, although we should check the whole redirect chain and avoid sending clients into redirect loops. But for an editor-focused CRUD interface, redirects are just a kind of document.

"version_id" in the response should be called revision_id or structured as {"revision": { "id": ... }}.

"title" should be specified as the PDBK as discussed.

The key in the JSON response should be "html" not "text", maybe that was left in since before @mobrovac's request that this endpoint deliver HTML.

eprodromou updated the task description. (Show Details)Oct 1 2019, 8:27 PM
eprodromou updated the task description. (Show Details)
eprodromou renamed this task from REST: Reader gets a page to Contributor gets page source.Oct 1 2019, 8:30 PM
eprodromou triaged this task as Medium priority.
eprodromou updated the task description. (Show Details)
eprodromou updated the task description. (Show Details)Oct 1 2019, 8:32 PM

@tstarling I agree on most of these points. I changed "text" to "source" to make it clear it's wikitext source (or whatever source format the page has).

I changed the title so it's clear this is for getting the source. We'll keep the GET /page/{title} and PUT /page/{title} endpoints parallel, and we can keep the POST /page creation endpoint with the same inputs, too.

I updated the revision_id property.

I broke up the title_key and display_title into separate properties, so the client can use whichever one is necessary without having to do any processing on it.

eprodromou added a subscriber: EvanProdromou.
eprodromou updated the task description. (Show Details)Oct 11 2019, 2:12 AM

The path could be a subresource /page/{title}/wikitext, following the principle that subresources should be used for "larger, more expensive, less-frequently-used properties, or properties with different access rights" (ref), which would seem to describe the wikitext of the page. For HTML delivery we could use /page/{title}/html, but that doesn't seem to be necessary for the CRUD sprint.

The /page/{title} endpoint could contain links to the subresources, as suggested for document endpoints at https://restfulapi.net/resource-naming/ .

I very much agree with this approach, it's much more flexible than what's currently proposed.

I very much agree with this approach, it's much more flexible than what's currently proposed.

I'd really like the GET and PUT URLs to be the same.

eprodromou updated the task description. (Show Details)Oct 28 2019, 8:19 PM

I'd really like the GET and PUT URLs to be the same.

Given that we don't have the PUT endpoint specified yet, this in not a valid response to the concerns Tim and me has had.

Bug even with having the same paths, perhaps adding /wikitext in the PUT endpoint will make sense as well. Given that we have Parsoid in core now, we can imagine having PUT endpoint where you can post HTML. We have it in RESTBase as an experiment, it's totally possible (not that it's necessarily a good idea). Also, with MCR, a revision could possibly have some other content-types in it, so you can envision creating a new revision with some other type of content. So, I wouldn't set in stone the fact that the PUT endpoint will not have a suffix just yet.

This ship has probably sailed with /revision/bare being in production, but if we are to add /wikitext, perhaps we should design for MCR right away. I would imagine {entity}/{slot_id}/{representation_type} to be an ultimate goal. Backwards-compatible move to a design like this is impossible. So, I think this needs to be /page/{title}/main/wikitext if we were to be really extensible and future-proof

I'd really like the GET and PUT URLs to be the same.

Given that we don't have the PUT endpoint specified yet, this in not a valid response to the concerns Tim and me has had.

PUT endpoint is here: T230843

Bug even with having the same paths, perhaps adding /wikitext in the PUT endpoint will make sense as well.

I'd like to have every type have a main representation at /<type>/<id>. I'd like that main representation to be read-write if the type is read-write.

Given that we have Parsoid in core now, we can imagine having PUT endpoint where you can post HTML.

Agreed. I think PUT /page/<title>/with_html (JSON with embedded HTML) and PUT /page/<title>/html (bare HTML) could both be useful, especially for visual editors.

We should design for MCR right away.

Based on my conversations with @tstarling last week, we're adding MCR endpoints to the design. I'm going to add it in a backwards-compatible way that favors the main slot over any others.

eprodromou updated the task description. (Show Details)Nov 10 2019, 7:14 PM

I added the content_model property to the page type.

eprodromou updated the task description. (Show Details)Nov 12 2019, 8:02 PM

contributors: array of User objects for the contributors to the page, each with these properties
id: user ID
name: user name

Hm... This ain't gonna be quick nor it is gonna be easy on the database. We can totally do it, there's a method in RevisionStore for getting this list, but it will require a full page history scan and a join, and it's DISTNCT. We need to try out this query on some larger pages first.

Also, including this makes it much harder to cache the result. All the rest of the properties seem to be properties of the latest revision of the page, this is the property of the whole page. So, much more conditions for cache invalidation. What's the use-case for having this info in this endpoint? Does it worth introducing extra complexity and load?

What's the use-case for having this info in this endpoint? Does it worth introducing extra complexity and load?

One of our architecture principles is:

API/LICENSE: data formats and APIs that provide access to user generated content MUST be designed to provide easy access to all relevant information about authorship and licensing.

(emphasis mine) This is a "MUST", so we really don't have a lot of options.

This is a "MUST", so we really don't have a lot of options.

  • Can we only provide an author of the latest revision?
  • Can we limit the number of contributors to X?
  • Can we introduce a separate more in-depth contributors endpoint into the history API and add a link to it in here?

Here's the query that RevisionStore::getAuthorsBetween would use if we have no limit and no from/to constraints:

SELECT DISTINCT actor_rev_user.actor_user AS rev_user,actor_rev_user.actor_name AS rev_user_text,temp_rev_user.revactor_actor AS rev_actor FROM revision JOIN revision_actor_temp temp_rev_user ON ((temp_rev_user.revactor_rev = rev_id)) JOIN actor actor_rev_user ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) WHERE rev_page = 534366 AND ((rev_deleted & 4) = 0);

It takes ~0.8 seconds to execute on Barack_Obama. This endpoint looks like we can expect it to have way more traffic than the history count endpoints, so this performance doesn't seem acceptable. Or not? What's the expected usage of this in comparing to other endpoints? The result is cacheable, but cache should be invalidated under a lot of conditions.

I'm not advocating for dropping this info from the response, I'm presenting the data so that we know the tradeoffs.

WDoranWMF reassigned this task from eprodromou to nnikkhoui.Dec 3 2019, 5:12 PM
WDoranWMF set the point value for this task to 5.Dec 3 2019, 5:43 PM

@Pchelolo you seemed to touch on this earlier, but regarding content_model, I see a todo comment in Revision.php to remove the getContentModel() function. It states:

with MCR there no longer is a single model associated with a revision.

So with MCR each slot has its own content model, not each page ya? Should we modify this endpoint to instead give back slots on a given page? I've heard some about MCR, but not sure whether its supposed to be taken into consideration for this endpoint!

Yeah, that's why the definition was changed to

content_model: content model for the main slot of page

I've heard some about MCR, but not sure whether its supposed to be taken into consideration for this endpoint!

There's been a lot of discussion on this, and it's been decided to postpone it. So everything you return is for the main slot.

@Pchelolo oh - i missed that...ok thank you for clarifying.

eprodromou updated the task description. (Show Details)Dec 4 2019, 4:47 PM
eprodromou updated the task description. (Show Details)Dec 4 2019, 5:53 PM
eprodromou updated the task description. (Show Details)

@nnikkhoui I changed the "license" and "other_licenses" elements to objects, so we have some more options. Instead of just a SPDX code, we also have an URL and a name.

You should be able to get the URL for the license either from $wgRightsURL or $wgRightsPage. See https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/includes/DefaultSettings.php#7124 for descriptions of these settings. If it's a page, you'll need to turn it into a fully-qualified URL.

You should be able to get the title property from $wgRightsText, or from $wgRightsPage. Check the details in the default settings file.

Getting the SPDX code is trickier. If it were me, I'd use the data at the SPDX licenses table to map URLs to codes. But that might be a hassle. See what you can do and I'm happy leaving this field null for now.

Change 555560 had a related patch set uploaded (by Nikki Nikkhoui; owner: Nikki Nikkhoui):
[mediawiki/core@master] GET Page Source Endpoint

https://gerrit.wikimedia.org/r/555560

Change 555560 merged by jenkins-bot:
[mediawiki/core@master] GET Page Source Endpoint

https://gerrit.wikimedia.org/r/555560

Now you can add integration tests to the endpoint.

Change 563614 had a related patch set uploaded (by Nikki Nikkhoui; owner: Nikki Nikkhoui):
[mediawiki/tools/api-testing@master] Etag Utility Function

https://gerrit.wikimedia.org/r/563614

Change 563614 abandoned by Nikki Nikkhoui:
Etag Utility Function

https://gerrit.wikimedia.org/r/563614

eprodromou closed this task as Resolved.Mar 11 2020, 6:11 PM

Looks good. Thanks @nnikkhoui