Contributor gets page source
Closed, ResolvedPublic5 Estimated Story Points
Actions

Description

As a Contributor, I want to get the source code for a page, so that I can edit it locally.

GET /page/{title}
Returns the page as JSON. Title is escaped for slashes

Payload: empty

Notable request headers:

Status:
200 – this is the page
403 - user isn't authorized to read the page
404 – page does not exist (never created or deleted)

Notable response headers: none

Body: JSON

id: numeric id of the page
key: prefixed DB key of the page, like "Talk:Main_Page"
title: title for display, like "Talk:Main Page"
latest: latest revision of the page, object with these properties
- id: revision ID
- timestamp: revision timestamp
content_model: content model for the main slot of page
license: Object for the preferred license of the page, including these properties:
- spdx: SPDX code
- url: URL for the license
- title: title of the license
other_licenses: array of objects with {spdx, url, title} for other licenses the page is available under
source: source of the main slot for editing; usually wikitext, but depends on content_model

Details

	Subject	Repo	Branch	Lines +/-
	Etag Utility Function	mediawiki/tools/api-testing	master	+15 -0
	GET Page Source Endpoint	mediawiki/core	master	+262 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T229661 Core REST API in MediaWiki
Resolved	None	T229662 Minimal client REST API
Resolved	• nnikkhoui	T229663 Contributor gets page source
Invalid	None	T234382 Implement GET /page/{title}
Invalid	apaskulin	T234387 Document page source endpoint
Declined	• nnikkhoui	T240294 Add $wgRightsSPDXId to production settings

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

EvanProdromou moved this task from Epic Backlog to User Story Tasks being created on the Platform Team Workboards (Epics) board.Aug 2 2019, 1:24 PM

EvanProdromou added a project: Story.

EvanProdromou updated the task description. (Show Details)Aug 20 2019, 11:20 PM

EvanProdromou moved this task from User Story Tasks being created to User Stories Needing EM Sign Off on the Platform Team Workboards (Epics) board.Aug 20 2019, 11:31 PM

• kchapman moved this task from User Stories Needing EM Sign Off to User Stories Ready for Engineering on the Platform Team Workboards (Epics) board.Aug 21 2019, 12:14 AM

• eprodromou moved this task from Epics to Green on the Platform Team Workboards board.Aug 21 2019, 11:41 AM

• eprodromou edited projects, added Platform Team Workboards (Green); removed Platform Team Workboards (Epics).

• eprodromou moved this task from Backlog to Next Sprint on the Platform Team Workboards (Green) board.

Status:
200 – this is the page
304 – not modified

It should be probably noted (for completeness) that 3xx status codes imply no return body whatsoever.

307 – page is redirect; Location: header gives location of redirect page

It would be better to use 301 and 302 here as we already use them in the existing REST API. 301 should be used for permanent moves such as Main%20Page -> Main_Page, while 302 should be used for user-set redirects, i.e. pages that have #REDIRECT.

404 – there has never been such a page
410 – page was deleted

IMHO 410 should not be used since, per RfC2616, 410 indicates a permanent removal of a resource. However, in the case of MW, one can again create a page with the same title. Therefore, 404 is the appropriate response.

Body: JSON
id: id of the page
version_id: version ID of the page
title: title of the page
text: wikitext of the page

If, as a reader, I want to get a page, I will likely not be happy with getting a JSON response. The metadata about the page itself can either be delivered through headers (part of it will be, such as the revid which is part of the ETag) or we can have an appropriate REST URI just for that. Returning HTML directly would be more appropriate here, IMHO. Furthermore, by returning wikitext this end point violates the API/FORMAT architecture principle which states that APIs MUST be designed to enable clients to process content without the need to parse or manipulate wikitext.

• mobrovac mentioned this in T230848: Reader gets file description.Aug 23 2019, 11:08 AM

Like @mobrovac I think it's very unclear what sort of reader would want a page as JSON, and what should be in the JSON. Is the user here really a reader, or are they an editor who wants to subsequently edit the page via T230843?

I think if we had some user stories that featured actual real-live users, then it would be easier to figure out what the requirements are.

tstarling renamed this task from Reader gets a page to REST: Reader gets a page.Aug 26 2019, 6:39 AM

tstarling added a project: MediaWiki-REST-API.

• eprodromou claimed this task.Aug 27 2019, 1:01 PM

• eprodromou updated the task description. (Show Details)

• eprodromou edited subscribers, added: • eprodromou; removed: EvanProdromou.

@mobrovac I appreciate the feedback. I've made changes to the status codes per your recommendations, including outputting text of the page in HTML (although this will need another pass for pages that are not wikitext, like JS and CSS). I appreciate the link to the architecture principles, and I agree that using the rendered version makes more sense.

I'll add a separate endpoint for getting the "source code" for a page, like /page/{title}/source, which will be more useful for the Contributor persona.

I'm leaving the JSON body type for the output. Developers are conditioned to work with JSON in the current generation of APIs, even if, as you say, most of the metadata could be encoded in HTTP headers.

For these use cases, I've intentionally elided the role of the developer as an intermediary persona, since it makes them complex and wordy. Can we assume that a developer will be using these endpoints to serve the requirements of their own end users? "As a Developer, I want an API endpoint to fetch the page to show to the Reader, because they want to read the page."

@tstarling It's always useful to question our assumptions!

I'll see if I can come up with some supporting data for the hypothesis that a) people want to read our content and b) JSON is a good format for APIs.

WDoranWMF edited projects, added Platform Team Workboards (User Stories); removed Platform Team Workboards (Green).Sep 3 2019, 5:07 PM

For the purposes of the CRUD interface, this needs to return whatever is sent with PUT, and that probably means wikitext. The path could be a subresource /page/{title}/wikitext, following the principle that subresources should be used for "larger, more expensive, less-frequently-used properties, or properties with different access rights" (ref), which would seem to describe the wikitext of the page. For HTML delivery we could use /page/{title}/html, but that doesn't seem to be necessary for the CRUD sprint.

The /page/{title} endpoint could contain links to the subresources, as suggested for document endpoints at https://restfulapi.net/resource-naming/ .

I don't think it's correct to send a 302 for a wikitext endpoint since that would prevent the redirect from being edited. This is why we don't use 302 responses for redirects in the web UI. It would be fine for there to be a reader-focused view in which redirects are delivered with a 302 code, although we should check the whole redirect chain and avoid sending clients into redirect loops. But for an editor-focused CRUD interface, redirects are just a kind of document.

"version_id" in the response should be called revision_id or structured as {"revision": { "id": ... }}.

"title" should be specified as the PDBK as discussed.

The key in the JSON response should be "html" not "text", maybe that was left in since before @mobrovac's request that this endpoint deliver HTML.

• eprodromou updated the task description. (Show Details)Oct 1 2019, 8:27 PM

• eprodromou updated the task description. (Show Details)

• eprodromou renamed this task from REST: Reader gets a page to Contributor gets page source.Oct 1 2019, 8:30 PM

• eprodromou triaged this task as Medium priority.

• eprodromou updated the task description. (Show Details)

• eprodromou updated the task description. (Show Details)Oct 1 2019, 8:32 PM

@tstarling I agree on most of these points. I changed "text" to "source" to make it clear it's wikitext source (or whatever source format the page has).

I changed the title so it's clear this is for getting the source. We'll keep the GET /page/{title} and PUT /page/{title} endpoints parallel, and we can keep the POST /page creation endpoint with the same inputs, too.

I updated the revision_id property.

I broke up the title_key and display_title into separate properties, so the client can use whichever one is necessary without having to do any processing on it.

• eprodromou mentioned this in T230842: Contributor creates a page.Oct 1 2019, 8:43 PM

• eprodromou moved this task from Backlog to Waiting to be Scheduled on the Platform Team Workboards (User Stories) board.Oct 1 2019, 11:00 PM

• eprodromou reassigned this task from • eprodromou to EvanProdromou.Oct 1 2019, 11:28 PM

• eprodromou claimed this task.Oct 8 2019, 3:59 PM

• eprodromou added a subscriber: EvanProdromou.

• eprodromou mentioned this in T231588: Implement GET Page Revision.Oct 8 2019, 7:48 PM

• eprodromou mentioned this in T235239: Curator gets revision source.Oct 11 2019, 1:40 AM

• eprodromou updated the task description. (Show Details)Oct 11 2019, 2:12 AM

WDoranWMF moved this task from Waiting to be Scheduled to Scheduled in Upcoming Sprint on the Platform Team Workboards (User Stories) board.Oct 11 2019, 5:27 PM

The path could be a subresource /page/{title}/wikitext, following the principle that subresources should be used for "larger, more expensive, less-frequently-used properties, or properties with different access rights" (ref), which would seem to describe the wikitext of the page. For HTML delivery we could use /page/{title}/html, but that doesn't seem to be necessary for the CRUD sprint.

The /page/{title} endpoint could contain links to the subresources, as suggested for document endpoints at https://restfulapi.net/resource-naming/ .

I very much agree with this approach, it's much more flexible than what's currently proposed.

In T229663#5576117, @Pchelolo wrote:

I very much agree with this approach, it's much more flexible than what's currently proposed.

I'd really like the GET and PUT URLs to be the same.

• eprodromou moved this task from Scheduled in Upcoming Sprint to Backlog on the Platform Team Workboards (User Stories) board.Oct 21 2019, 2:47 PM

tstarling mentioned this in T230846: Reader gets language links.Oct 25 2019, 1:08 AM

• eprodromou updated the task description. (Show Details)Oct 28 2019, 8:19 PM

I'd really like the GET and PUT URLs to be the same.

Given that we don't have the PUT endpoint specified yet, this in not a valid response to the concerns Tim and me has had.

Bug even with having the same paths, perhaps adding /wikitext in the PUT endpoint will make sense as well. Given that we have Parsoid in core now, we can imagine having PUT endpoint where you can post HTML. We have it in RESTBase as an experiment, it's totally possible (not that it's necessarily a good idea). Also, with MCR, a revision could possibly have some other content-types in it, so you can envision creating a new revision with some other type of content. So, I wouldn't set in stone the fact that the PUT endpoint will not have a suffix just yet.

This ship has probably sailed with /revision/bare being in production, but if we are to add /wikitext, perhaps we should design for MCR right away. I would imagine {entity}/{slot_id}/{representation_type} to be an ultimate goal. Backwards-compatible move to a design like this is impossible. So, I think this needs to be /page/{title}/main/wikitext if we were to be really extensible and future-proof

In T229663#5645280, @Pchelolo wrote:

I'd really like the GET and PUT URLs to be the same.

Given that we don't have the PUT endpoint specified yet, this in not a valid response to the concerns Tim and me has had.

PUT endpoint is here: T230843

Bug even with having the same paths, perhaps adding /wikitext in the PUT endpoint will make sense as well.

I'd like to have every type have a main representation at /<type>/<id>. I'd like that main representation to be read-write if the type is read-write.

Given that we have Parsoid in core now, we can imagine having PUT endpoint where you can post HTML.

Agreed. I think PUT /page/<title>/with_html (JSON with embedded HTML) and PUT /page/<title>/html (bare HTML) could both be useful, especially for visual editors.

We should design for MCR right away.

Based on my conversations with @tstarling last week, we're adding MCR endpoints to the design. I'm going to add it in a backwards-compatible way that favors the main slot over any others.

• eprodromou mentioned this in T237866: Reader reads a page with slots.Nov 10 2019, 5:30 PM

I added the content_model property to the page type.

• eprodromou updated the task description. (Show Details)Nov 12 2019, 8:02 PM

• eprodromou closed subtask T234382: Implement GET /page/{title} as Invalid.Nov 14 2019, 9:05 PM

• eprodromou closed subtask T234387: Document page source endpoint as Invalid.

• eprodromou mentioned this in T229662: Minimal client REST API.Nov 14 2019, 9:34 PM

• eprodromou edited projects, added Platform Team Workboards (Green); removed Platform Team Workboards (User Stories).Dec 2 2019, 8:43 PM

WDoranWMF moved this task from Next Sprint to Tasks Ready for Estimation on the Platform Team Workboards (Green) board.Dec 2 2019, 9:27 PM

contributors: array of User objects for the contributors to the page, each with these properties
id: user ID
name: user name

Hm... This ain't gonna be quick nor it is gonna be easy on the database. We can totally do it, there's a method in RevisionStore for getting this list, but it will require a full page history scan and a join, and it's DISTNCT. We need to try out this query on some larger pages first.

Also, including this makes it much harder to cache the result. All the rest of the properties seem to be properties of the latest revision of the page, this is the property of the whole page. So, much more conditions for cache invalidation. What's the use-case for having this info in this endpoint? Does it worth introducing extra complexity and load?

In T229663#5707760, @Pchelolo wrote:

What's the use-case for having this info in this endpoint? Does it worth introducing extra complexity and load?

One of our architecture principles is:

API/LICENSE: data formats and APIs that provide access to user generated content MUST be designed to provide easy access to all relevant information about authorship and licensing.

(emphasis mine) This is a "MUST", so we really don't have a lot of options.

This is a "MUST", so we really don't have a lot of options.

Can we only provide an author of the latest revision?
Can we limit the number of contributors to X?
Can we introduce a separate more in-depth contributors endpoint into the history API and add a link to it in here?

Here's the query that RevisionStore::getAuthorsBetween would use if we have no limit and no from/to constraints:

SELECT DISTINCT actor_rev_user.actor_user AS rev_user,actor_rev_user.actor_name AS rev_user_text,temp_rev_user.revactor_actor AS rev_actor FROM revision JOIN revision_actor_temp temp_rev_user ON ((temp_rev_user.revactor_rev = rev_id)) JOIN actor actor_rev_user ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) WHERE rev_page = 534366 AND ((rev_deleted & 4) = 0);

It takes ~0.8 seconds to execute on Barack_Obama. This endpoint looks like we can expect it to have way more traffic than the history count endpoints, so this performance doesn't seem acceptable. Or not? What's the expected usage of this in comparing to other endpoints? The result is cacheable, but cache should be invalidated under a lot of conditions.

I'm not advocating for dropping this info from the response, I'm presenting the data so that we know the tradeoffs.

WDoranWMF reassigned this task from • eprodromou to • nnikkhoui.Dec 3 2019, 5:12 PM

WDoranWMF set the point value for this task to 5.Dec 3 2019, 5:43 PM

WDoranWMF moved this task from Tasks Ready for Estimation to Ready on the Platform Team Workboards (Green) board.Dec 3 2019, 6:03 PM

@Pchelolo you seemed to touch on this earlier, but regarding content_model, I see a todo comment in Revision.php to remove the getContentModel() function. It states:

with MCR there no longer is a single model associated with a revision.

So with MCR each slot has its own content model, not each page ya? Should we modify this endpoint to instead give back slots on a given page? I've heard some about MCR, but not sure whether its supposed to be taken into consideration for this endpoint!

Yeah, that's why the definition was changed to

content_model: content model for the main slot of page

I've heard some about MCR, but not sure whether its supposed to be taken into consideration for this endpoint!

There's been a lot of discussion on this, and it's been decided to postpone it. So everything you return is for the main slot.

@Pchelolo oh - i missed that...ok thank you for clarifying.

• eprodromou updated the task description. (Show Details)Dec 4 2019, 4:47 PM

• eprodromou mentioned this in T239843: Reader gets list of contributors to a page.Dec 4 2019, 5:20 PM

• eprodromou updated the task description. (Show Details)Dec 4 2019, 5:53 PM

• eprodromou updated the task description. (Show Details)

@nnikkhoui I changed the "license" and "other_licenses" elements to objects, so we have some more options. Instead of just a SPDX code, we also have an URL and a name.

You should be able to get the URL for the license either from $wgRightsURL or $wgRightsPage. See https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/includes/DefaultSettings.php#7124 for descriptions of these settings. If it's a page, you'll need to turn it into a fully-qualified URL.

You should be able to get the title property from $wgRightsText, or from $wgRightsPage. Check the details in the default settings file.

Getting the SPDX code is trickier. If it were me, I'd use the data at the SPDX licenses table to map URLs to codes. But that might be a hassle. See what you can do and I'm happy leaving this field null for now.

WDoranWMF moved this task from Ready to Doing on the Platform Team Workboards (Green) board.Dec 5 2019, 4:56 PM

Change 555560 had a related patch set uploaded (by Nikki Nikkhoui; owner: Nikki Nikkhoui):
[mediawiki/core@master] GET Page Source Endpoint

https://gerrit.wikimedia.org/r/555560

gerritbot added a project: Patch-For-Review.Dec 6 2019, 6:12 PM

• nnikkhoui moved this task from Doing to Waiting for Review on the Platform Team Workboards (Green) board.Dec 9 2019, 3:42 PM

Jdforrester-WMF changed the status of subtask T240294: Add $wgRightsSPDXId to production settings from Open to Stalled.Dec 15 2019, 7:19 PM

• nnikkhoui closed subtask T240294: Add $wgRightsSPDXId to production settings as Declined.Dec 19 2019, 6:00 PM

Change 555560 merged by jenkins-bot:
[mediawiki/core@master] GET Page Source Endpoint

https://gerrit.wikimedia.org/r/555560

Maintenance_bot removed a project: Patch-For-Review.Jan 8 2020, 5:11 PM

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.15; 2020-01-14).Jan 8 2020, 6:00 PM

Now you can add integration tests to the endpoint.