Page MenuHomePhabricator

REST route handler extension interface RFC
Closed, ResolvedPublic

Description

This is an RFC for the Handler interface of the proposed new REST API for MediaWiki.

The purpose of this RFC is to establish a bare-bones interface for core and extension REST API handlers. We want to achieve consensus on the elements proposed, to the extent necessary to avoid backwards-incompatible changes in the foreseeable future.

This RFC is not intended to be feature-complete and is not a general discussion of the REST API. There will presumably be other RFCs on other elements of the REST API prior to its public deployment.

There is a work-in-progress patch showing the proposed details at Gerrit 508972.

Glossary

  • “Route” - a method name and URL pattern for a group of related URLs, with named parameters.
  • “Handler” - a code unit (function or class or object) responsible for accepting HTTP requests to a given route and
  • “Router” - accepts HTTP requests for the REST API, determines which route the request maps best to, and executes the handler, returning the results as an HTTP response.

Route declaration

There would be an extension attribute called RestRoutes, which is an array of routes, for example:

"RestRoutes": [
  {
    "method": "GET",
    "path": "/user/{userName}/hello",
    "class": "HelloHandler",
    "parameters": [
      {
        "name": "userName",
        "in": "path",
        "required": true,
        "type": "string"
      }
    ]
  }
]

The keys in this JSON route object are:

  • method: A string HTTP method name. Or an array of method names, which is equivalent to declaring multiple routes, identical except for their method.
  • path: A path template. Parameter names are enclosed in braces. A parameter must be a whole path component, where path components are separated by slashes.
  • class, factory, args, calls: Parameters passed to ObjectFactory::getObjectFromSpec() to create the Handler object. Further considerations and future directions for object construction and dependency injection are discussed at T222409.
  • parameters: The syntax in this example is derived from OpenAPI.

Routes in core would be declared in a JSON file, say includes/Rest/coreRoutes.json.

RequestInterface

The proposed RequestInterface is identical to PSR-7's ServerRequestInterface, except:

  • The cloning mutation methods "with..." are omitted. These could be added at a later date, when the need arises.
  • Mutation of attributes is supported via setAttributes()
  • Prefixed cookie names are supported with getCookie()
  • Instead of getParsedBody(), we provide getPostParams(), which is strictly a $_POST wrapper, it is not used for other request body deserialization. This modification came out of a Gerrit discussion of the drawbacks of getParsedBody(). In our opinion, getParsedBody() represented an inappropriate mixing of $_POST with extensible validation and deserialization of other types of request body.
  • Instead of getAttributes(), we have getPathParams().

The WIP patch proposes two implementations of this interface: RequestFromGlobals and RequestData. RequestFromGlobals gets its data on demand directly from the global state, for example the superglobals. RequestData gets its data from an associative array passed to its constructor, with sensible defaults for omitted parameters. RequestData is intended for internal requests and testing.

Why can't we inject global state into RequestData for use in an externally triggered request? There are three details that make this inconvenient:

  • RequestInterface has getParsedBody(), which returns structured data representing the body of the POST request. In RequestFromGlobals the parsing is done on demand via an injected BodyParserFactory. In RequestData the parsed body is injected into the constructor.
  • RequestInterface has getBody() which returns a stream representing the request body. In RequestFromGlobals, this stream is automatically derived from php://input. In RequestData, the request body, if any, is a string, converted to a stream by the constructor.
  • RequestInterface has getUploadedFiles(), which returns an array of UploadedFileInterface objects. In RequestFromGlobals this array is lazy-initialized from $_FILES. In Request, the objects are constructed by the caller.

In summary, RequestData and RequestFromGlobals have different constructor parameters, responding to the different needs of internal versus external callers.

I'm proposing that RequestFromGlobals gets its data from the superglobals directly, instead of from $wgRequest. My feeling is that WebRequest, with its direct access to the superglobals in the base class, is awkward for testability, and could be improved by having it take a Rest\RequestInterface in its constructor. The superglobals in WebRequest would be replaced by injected data from a RequestInterface.

ResponseInterface

The proposed ResponseInterface is similar to PSR-7's ResponseInterface, except:

  • For convenience and efficiency, the cloning mutator methods "with..." have been replaced by non-cloning mutator methods "set...". We considered PSR-7's rationale for making ResponseInterface immutable, but did not accept it.
  • getRawHeaderLines() was added, for convenient access to header lines.
  • setCookie() was added, for cookie prefixing and serialization.

The response contains the body text as a Psr\Http\Message\StreamInterface object. This may be a simple wrapper around a string. Guzzle (already a core dependency) provides several convenient classes derived from StreamInterface. Using Guzzle, a stream can be constructed which reads from a local file or HTTP client request.

Handler

Handler is the abstract base class for route handlers.

Handler itself requires no constructor parameters. Constructor parameters may be used by subclasses to identify a particular route to a shared handler.

State is injected into the handler object by the Router shortly after its construction, using an init() method. Handler provides accessors such as:

  • getRequest(): The RequestInterface object.
  • getConfig(): The configuration array of the route, decoded from the JSON.
  • getResponseFactory(): A ResponseFactory object, containing a variety of convenient factory functions for creation of a response object.
  • getValidator(): A helper object for parameter validation.

Router::executeHandler() is equivalent to ApiBase::executeAction(), and performs the following tasks:

  • Parameter validation
  • Authorization
  • Conditional 304 responses
  • Calling of the abstract execute() function
  • Optionally, construction of the Response object from a scalar or array return value.

The following methods will have default implementations, but may also be overridden by the subclass:

  • getParameters(): Allowing the parameter declaration to be dynamically constructed, instead of derived from JSON
  • getLastModified(), getETag(): The last-modified time and ETag for 304 handling, similar to ApiBase::getConditionalRequestData()
  • There will presumably also be documentation/reflection methods, for example an equivalent to ApiBase::getSummaryMessage()

SimpleHandler

SimpleHandler maps path template parameters to formal parameters of the run() method, providing slightly simpler code:

class SimpleHandler extends Handler {
	public function execute() {
		$params = array_values( $this->getRequest()->getAttributes() );
		return $this->run( ...$params );
	}
}

class HelloHandler extends SimpleHandler {
	public function run( $name ) {
		return "Hello, $name!";
	}
}

Using a factory function, handler classes may be anonymous.

Router

class Router {
	public function execute( RequestInterface $request ): ResponseInterface { ... }
}

Router::execute() takes a request object, matches the path against the list of known path templates, and constructs a handler. It sets the request attributes to be the path template parameters derived from the template match. It executes the handler, and returns the response object.

Router catches exceptions from the Handler. If the exception is derived from RestException, the exception is not logged, the exception is just converted to a normal response. Other exceptions will be dealt with in a similar way to the Action API.

It is proposed to make Router be a service. This is controversial because it implies that internal requests will be encouraged.

EntryPoint

There will be a new entry point file, rest.php, which invokes the static method MediaWiki\Rest\EntryPoint::main(). This method constructs a request object, executes the router, and outputs the response.

The root path for routing is defined by $wgRestPath.

Open questions

  • What should standard responses look like? For example, standard 404s and parameter validation errors. Should they be JSON? Localised?
  • ResponseFactory methods and their arguments need to be defined. What sorts of responses will endpoints typically need?
  • Is ResponseFactory responsible for localisation?
  • Do iteration/index routes require any special handling?
  • The current path template matcher allows only string literals and wildcards. Conflicting paths are not allowed. Is this sufficient? This means that for example the existing RESTBase routes /lists/setup and /lists/{id} would conflict, preventing registration. If parameter validation was integrated with path template matching, such routes could be achieved. This would come at the cost of increased complexity of the matcher, and parameter validation errors would typically lead to an unexplained 404 instead of being customisable by the handler.
  • The Handler base class is going to be quite large. Not quite the kitchen sink of ApiBase (2700 lines), because substantial functionality like RequestInterface and ResponseFactory are split out, but it could easily be 500 lines. Is that OK?
  • Should Handler implement IContextSource? Or provide getContext()?
  • How should deserialized parameters be provided to the handler? The Action API and OpenAPI both have a concept of parameter deserialization, which occurs simultaneously with validation. Parameters are converted from strings to a specified type.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Anomie added a comment.May 6 2019, 6:35 PM

AIUI we want to do that by discouraging people from using MediaWikiServices::getInstance(), and eventually deprecating it (there's no reason for a DI container to be a singleton).

That doesn't solve the problem of someone deciding they want to inject the entry point handler into their class from the wiring file.

I note that seems to assume the body is always in some standard format (JSON or the like).

Not really, the framework would just skip parsing unknown formats (and the parsed body will be the same as the raw body) and leave it to the module.

It seems like what you really want is a method on the Request class that tries to parse the body as JSON, rather than the Router parsing it.

Tgr added a comment.May 6 2019, 9:48 PM

That doesn't solve the problem of someone deciding they want to inject the entry point handler into their class from the wiring file.

I see what you mean. But there is another use for getting almost everything from wiring files, which is that the site owner can replace them with local customizations (instead of having to patch/fork MediaWiki). I'd really like to support that.

A bit hacky, but I guess we could 1) use defines to mark the entry point, so the router can only be used in rest.php (and tests) and 2) mark somehow that the router is processing a request, so you cannot do "nested" processing.

tstarling added a comment.EditedMay 6 2019, 11:36 PM

Most are just to generate the right version when manually calling header( "HTTP/1.1 404 Not Found" ) or the like. The one in OutputHandler was added in rSVN19996 to send a Content-Length header to HTTP/1.0 clients; it doesn't specify why it's not also including it for 1.1 clients.

It's because "Transfer-Encoding: chunked" was introduced in HTTP/1.1, and Apache sends it by default. RFC 2616 section 4.4 says "The Content-Length header field MUST NOT be sent if these two lengths are different (i.e., if a Transfer-Encoding header field is present)". In HTTP/1.0, sending a Content-Length header allows the end of the body to be detected without closing the connection, thus allowing keepalive. This HTTP/1.0 keepalive was the only kind of keepalive supported by Squid at the time.

bearND added a subscriber: bearND.May 7 2019, 4:00 PM

I wrote a "path template matcher" class based on the tree building code I previously pasted. It's decoupled from the router and could go in a separate library. I wrote a benchmark script for it, which uses path templates like '/6/{8}/{9}/1/2/{7}'. They are random templates of length 1-6 where 30% of the path components are parameters, filtered in advance to remove conflicts. For 1000 such templates, it takes about 12ms to build the tree, and 1.3µs to match a path against it. The paths were prefiltered so that they always matched a template. This is a very rough model for intended usage. But it supports the proposition that tree based routing is an efficient approach to matching templates, but that a cache of the routing tree is required.

Seems like a not very impactful performance optimization, given that the API module itself will probably be 2-3 magnitudes slower. Or are you thinking of a scenario where the router also handles caching?

I think 2-3 orders of magnitude, i.e. 0.1% to 1% overhead, is an appropriate performance target. If it was only 1 order of magnitude (10% overhead) then we would call it low-hanging fruit.

I wanted to add a note about caching headers. @Tgr got the algorithm just about correct; chapter 11 of "Restful Web APIs" covers the whole complicated mess really nicely, and I recommend the whole book for anyone interested in API design.

One reason I wanted to get ETag and Last-Modified date into to the framework, or at least under discussion, is that this HTTP caching mechanism tends to be a source of performance hot spots early on with APIs. Including it as part of the framework is a nudge for handler implementers to get it correct the first time.

Alternatively, they can be left out of the framework as long as we implement them in the handler. It might be possible to add a helper method...?

public function process(Request $request, Response $response) {
    var $skip = $this->doCacheStuff($request, $response, $this->page->lastModified, md5($this->page->text));
    if ($skip) {
      return;
    } else {
        // do actually processing here
    }  
}

Explicit Expires headers also help, but so many of the entities in our system (page, media file, user) are volatile that it might not be useful. I guess page version and media file versions are close enough to immutable that it would be good to add Expires headers for them.

I wrote a "path template matcher" class based on the tree building code I previously pasted. It's decoupled from the router and could go in a separate library. I wrote a benchmark script for it, which uses path templates like '/6/{8}/{9}/1/2/{7}'. They are random templates of length 1-6 where 30% of the path components are parameters, filtered in advance to remove conflicts. For 1000 such templates, it takes about 12ms to build the tree, and 1.3µs to match a path against it. The paths were prefiltered so that they always matched a template. This is a very rough model for intended usage. But it supports the proposition that tree based routing is an efficient approach to matching templates, but that a cache of the routing tree is required.

This is great news.

Change 508972 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] [WIP] REST API

https://gerrit.wikimedia.org/r/508972

tstarling updated the task description. (Show Details)May 10 2019, 6:54 AM
tstarling updated the task description. (Show Details)May 13 2019, 12:42 AM

Some questions copied from the task description:

  • How should MediaWiki policies like cookie name prefixes be handled? Should RequestInterface and ResponseInterface be extended as necessary? Should they wrap $wgRequest?
  • How should deserialized parameters be provided to the handler? The Action API and OpenAPI both have a concept of parameter deserialization, which occurs simultaneously with validation. Parameters are converted from strings to a specified type.
  • How should MediaWiki policies like cookie name prefixes be handled? Should RequestInterface and ResponseInterface be extended as necessary? Should they wrap $wgRequest?

We probably should apply the prefixes and such, since we probably do want them applied on cookies set from the REST handlers.

To do that, we will need to extend ResponseInterface with a cookie-setting method. It needn't actually wrap WebResponse, instead EntryPoint could use the collected data to call WebResponse's existing method. Adding a cookie-getting method to RequestInterface seems less needed, but it'd also be simple enough. WebRequest::getCookie() seems simple enough to just copy if we omit the Unicode normalization bit that's only intended for $_GET anyway.

  • How should deserialized parameters be provided to the handler? The Action API and OpenAPI both have a concept of parameter deserialization, which occurs simultaneously with validation. Parameters are converted from strings to a specified type.

I see you already are passing the positional parameters via the 'attributes'. You could add validated $_GET and $_POST parameters in that way too easily enough.

Post data in JSON format could be more difficult to validate and handle, but on the other hand the JSON already has scalar types so conversion may be less useful. You might just pull in some existing JSON schema validation library for the validation. Or just leave validation of the JSON data up to the handlers entirely.

Question from @Krinkle from Gerrit: can getLastModified() be omitted? You can construct an ETag from a timestamp, so ETag is a more generic facility.

tstarling updated the task description. (Show Details)May 14 2019, 3:14 AM

To do that, we will need to extend ResponseInterface with a cookie-setting method. It needn't actually wrap WebResponse, instead EntryPoint could use the collected data to call WebResponse's existing method. Adding a cookie-getting method to RequestInterface seems less needed, but it'd also be simple enough. WebRequest::getCookie() seems simple enough to just copy if we omit the Unicode normalization bit that's only intended for $_GET anyway.

OK, so we already have RequestInterface::getCookieParams(), which will continue to return the raw $_COOKIE array per PSR-7. I added RequestInterface::getCookie() which gets a prefixed cookie from this array, with the prefix injected into the constructor. In ResponseInterface I added setCookie(), which has the same parameters as WebResponse::setCookie(), and I added getCookies(), which just returns the collected parameters. EntryPoint calls WebResponse::setCookie().

tstarling updated the task description. (Show Details)May 14 2019, 6:39 AM
Tgr added a comment.May 14 2019, 10:54 AM

Question from @Krinkle from Gerrit: can getLastModified() be omitted? You can construct an ETag from a timestamp, so ETag is a more generic facility.

Last-Modified can handle the situation where you get older content than what you already have (due to replication lag, for example), with an ETag that would be a cache miss and the user would get outdated content (e.g. a version of the article where they don't see the edit they just made). So I don't think either mechanism is strictly more generic than the other.

Question from @Krinkle from Gerrit: can getLastModified() be omitted? You can construct an ETag from a timestamp, so ETag is a more generic facility.

If we want to support clients using If-Modified-Since in their requests, we probably need getLastModified() and the Last-Modified header. I don't know if we can rely on all clients also implementing ETag/If-None-Match.

Krinkle added a comment.EditedMay 14 2019, 3:19 PM

Question from @Krinkle from Gerrit: can getLastModified() be omitted? You can construct an ETag from a timestamp, so ETag is a more generic facility.

If we want to support clients using If-Modified-Since in their requests, we probably need getLastModified() and the Last-Modified header. I don't know if we can rely on all clients also implementing ETag/If-None-Match.

Are there specific performance-sensitive use cases where we expect legacy clients to need old LastModified-based validation? For a new API built in 2019, I don't think we should invest resources for those use cases. Especially considering this is only a performance optimisation. The handling of this header is entirely optional. The API works perfectly fine without it.

For modern API clients, would ETag be harder or less likely to be supported than Last-Modified? Both are an opaque value to read from a response and reflect back in a future request.

From a server perspective, every resource, can be provided an E-Tag without additional complexity or overhead. (Resources that are only addressed by their content, can use derive a tag from their address or content hash). Timestamp-based validation, on the other hand, would require significant extra storage to be allocated, which I assume we will not always do. This is among the reasons why ResourceLoader has switched from timestamps to E-Tags only. (Dynamically generated modules are generated at run-time, they are what they are at any given moment in time. These can be validated by their content hash. They cannot have a last-modified timestamp unless we were to permanently store all these hashes solely for the purpose of tracking when we first observed them. This is a system we had in 2015 and choose to get rid of because it was slow, didn't scale well (yet), and ultimately wasn't necessary.)

I have found at least two problems with Last-Modified:

  • Reverting a resource to a previous version should invalidate the cache, but would not if the timestamp runs backwards as a result. This could be worked around by requiring a separate layer of storage in the backend separate from the content tracking to track when the link was updated. Requiring that extra storage is not trivial, and sometimes infeasible. We ran into this with Git-based resources, for example.
  • Using timestamps for version identification is not sufficiently precise. This will cause bugs and would require additional workarounds, if indeed feasible, for resources generated with close-enough timestamps.

How will we address these issues?

Last-Modified can handle the situation where you [would otherwise] get older content than what you already have (due to replication lag, for example), with an ETag that would be a cache miss and the user would get outdated content [..]

This is an interesting case. You're right that a LastModified-based system would correctly prefer the newer object in that case. But, I'm not sure that's a feature. I would classify that as a coincidence, a side-effect from the two larger problems described above.

First and foremost, this is imho not the responsibility of the rest API to handle. But rather something like ChronologyProtector to only expose users about information in a forwards-progressing manner.

Anyhow, the Rest API may indeed need forward-consistency. One way to solve that for E-Tag (which we're gonna have to do either way, because the replication issues applies there as well, is to include opaque domain-specific information in the tag that the server can use to ensure forward-consistency. For page histories, this could be revision ID+timestamp of the top entry and the API module checking if a newer revision exists on that page. And for some module, a timestamp might even suffice (which could of course be the E-Tag and parsed accordingly).

In any event, a simple timestamp can't be the solution for all cases. If we find that internally (not exposed to client) plain timestamps are a common solution that suffices for some cases, we could add a utility method for that. But I'd still recommend to externally standardise on E-Tag for consistency.

For modern API clients, would ETag be harder or less likely to be supported than Last-Modified?

It may depend on the HTTP library and such said clients are using. Remember that our third-party API-consuming software ecosystem is not generally all that good at updating things until they break, and sometimes not even then. They're also really good at reusing old code that worked in other projects in the past.

Both are an opaque value to read from a response and reflect back in a future request.

RFC 7232 § 3.3 does not define the timestamp in If-Modified-Since as opaque, it defines it as an HTTP-date and specifically allows for the client to generate its own value.

From a server perspective, every resource, can be provided an E-Tag without additional complexity or overhead. (Resources that are only addressed by their content, can use derive a tag from their address or content hash).

You can only derive the tag from the address if it's an immutable resource. If you're fetching the wikitext content of a revision, sure, but even then access controls (e.g. RevDel status) can change. But a page's wikitext depends on its current revision, a revision's or page's HTML depends on transcluded templates and such, and so on.

For the content hash, you have to either have stored the hash or already fetched the content. You argue against the former in the case of timestamps, and if the goal is to reduce response time (rather than response bandwidth) the latter is unlikely to gain you much.

Dynamically generated modules are generated at run-time, they are what they are at any given moment in time. These can be validated by their content hash. They cannot have a last-modified timestamp unless we were to permanently store all these hashes solely for the purpose of tracking when we first observed them.

You could non-permanently store the hash→timestamp or name→(hash,timestamp) mapping in memcached or the like. Or you could just not have a last-modified timestamp for those resources, or consider them "last modified" at the current moment. You probably couldn't do the latter two for RL because it would break caching when you combine modules' contents into one response, but that sort of combining seems unlikely to be a use case here. For our use cases, I suspect "just not have a last-modified timestamp for those resources" would work fine.

  • Reverting a resource to a previous version should invalidate the cache, but would not if the timestamp runs backwards as a result. This could be worked around by requiring a separate layer of storage in the backend separate from the content tracking to track when the link was updated. Requiring that extra storage is not trivial, and sometimes infeasible. We ran into this with Git-based resources, for example.

For MediaWiki page histories, the "revert" creates a new revision row with a new timestamp referencing the old content. For a revert in a git repository, the same thing happens.

I suppose you could get reversed timestamps if you're looking at a file mtime-based timestamp and a rollback from 1.X.0-wmf.Y to 1.X.0-wmf.Y-1 (which changes the deployment tree used, not the filesystem). Or if you're using something like git-restore-mtime. That sounds like another case that's likely to be encountered for RL but not so much here.

  • Using timestamps for version identification is not sufficiently precise. This will cause bugs and would require additional workarounds, if indeed feasible, for resources generated with close-enough timestamps.

If a client only supports If-Modified-Since, it's up to the client to deal with such issues. All we can do is provide RFC-compliant behavior for them to work with.

Bottom line: IMO we should keep getLastModified() as implemented, where a handler can return null if determining the timestamp is too much trouble.

tstarling updated the task description. (Show Details)May 16 2019, 3:24 AM

Latest task description edit: RequestInterface::getParsedBody() has been removed, replaced by getPostParams().

Conversation copied from Gerrit, to be continued here:


I wrote:
Is there a strong reason to have the REST routing framework in core? If not, I strongly advise to put it into a separate repository right away.

This would prevent tight coupling and cyclic dependencies, and force us to figure out how to inject the MW specific functionality that is currently used - I suppose that would primarily be the ObjectFactory. This could be resolved by moving ObjectFactory, or the proposed ObjectFactoryService, into a separate library as well.

For the topic of defining core routes, this would just mean that the location of the route definition file would have to be injected into the router.


Tim replied:

The following dependencies exist in the current code:

  • RequestFromGlobals calls WebRequest::getGlobalRequestURL().
  • Router takes a BagOStuff parameter, which is a core class.
  • EntryPoint depends on many things.

The following dependencies are proposed:

  • ResponseFactory might use FormatJson, which is a core class.
  • ResponseFactory will probably use MediaWiki's i18n.
  • Body parsing might use FormatJson.
  • Request validation may need i18n for error message output.

As I said in T223240, ResponseFactory provides policy, not basic functionality. Its job is to give the responses a consistent MediaWiki flavour. Router depends on it for 404 responses and exception responses.

This could be resolved by moving ObjectFactory, or the proposed ObjectFactoryService, into a separate library as well.

ObjectFactory is already in a separate library.

I think it would be necessary to:

  • Introduce and libraryize a new interface for BagOStuff
  • Split RequestFactory into a (mostly unused) library base class and a MediaWiki-specific derived class
  • Move WebRequest::getGlobalRequestURL() to the REST library (or to its own library), and have WebRequest call the new function, to avoid code duplication

EntryPoint would remain in core.

I don't really think it is warranted. I planned on making PathTemplateMatcher into a separate library, but not the rest of it. But I'd be interested to know what others think.

I think it would be necessary to:

  • Introduce and libraryize a new interface for BagOStuff

Instead of creating a library that just contains an abstract K/V store or cache interface, the REST library could define an interface for what it needs in terms of caching, with a trivial nor null implementation. MW can then provide an implementation wrapping a BagOStuff. This approach avoids overgeneralization of interfaces and proliferation of micro-libraries, at the price of creating (potentially many) trivial adapter classes.

  • Split RequestFactory into a (mostly unused) library base class and a MediaWiki-specific derived class

That doesn't sound too terrible.

  • Move WebRequest::getGlobalRequestURL() to the REST library (or to its own library), and have WebRequest call the new function, to avoid code duplication

Maybe this information can be injected cdirectly, instead of injecting a way to ask for it?

EntryPoint would remain in core.

Yes.

I don't really think it is warranted. I planned on making PathTemplateMatcher into a separate library, but not the rest of it. But I'd be interested to know what others think.

I think new code should make an effort to avoid strong coupling between components. Putting it into a separate library isn't necessary for that, but is a n easy way to enforce the separation from core code, ensuring no cycles can creep in. And it has the nice side effect of being re-usable for other projects.

In the principles, we have OPEN/REUSE: software components that implement broadly applicable functionality SHOULD be designed for reusability and be published for re-use. It seems to me this applies here.

Decoupling comes with some overhead, but I think it's a fair price to pay for improved maintainability. I'd be happy to put in the necessary work to keep the rest routing code independent of MediaWiki.

In T221177#5188930, @daniel wrote that @tstarling wrote:
  • Introduce and libraryize a new interface for BagOStuff

BagOStuff is in includes/libs/, which in theory means it's in a state where it could be put into a library as-is (i.e. it has no dependencies on MediaWiki). MediaWiki-specific implementations such as SqlBagOStuff are in includes/objectcache/.

  • Move WebRequest::getGlobalRequestURL() to the REST library (or to its own library), and have WebRequest call the new function, to avoid code duplication

I'd go with moving it to RequestFromGlobals, and letting WebRequest call it there. This goes well with your suggestion of eventually having WebRequest wrap the new RequestInterface.

I don't really think it is warranted. I planned on making PathTemplateMatcher into a separate library, but not the rest of it. But I'd be interested to know what others think.

I have no strong opinion. If we do do it, I think the plan you laid out sounds sensible.

  • Move WebRequest::getGlobalRequestURL() to the REST library (or to its own library), and have WebRequest call the new function, to avoid code duplication

Maybe this information can be injected cdirectly, instead of injecting a way to ask for it?

Some code somewhere needs to calculate it from the globals. RequestFromGlobals is the obvious place.

BagOStuff is in includes/libs/, which in theory means it's in a state where it could be put into a library as-is (i.e. it has no dependencies on MediaWiki). MediaWiki-specific implementations such as SqlBagOStuff are in includes/objectcache/.

That is T146257, which is apparently close to completion. RESTBagOStuff depends on MultiHttpClient. @Krinkle suggested leaving RESTBagOStuff out of the library to avoid that problem.

I'm leaning towards a descriptive name for this library, either Rest or RestServer. The Gerrit repository would be mediawiki/libs/Rest, the Composer name would be wikimedia/rest, and the namespace would be Wikimedia\Rest.

Coming a bit late to the party, perhaps, but here are some of my (high-level) thoughts on this subject.

Who is the API for vs by whom the API is built

This was perhaps discussed elsewhere (and I'm simply not aware of it), but the ticket and its discussion seem to be based on the (unspoken) assumption that modules/extensions have to be able to register their own REST API paths and end points independently, much in the way the Action API currently works. I'd like to challenge that assumption as I think it brings about a conundrum. I agree 100% that doing so brings a lot of flexibility to the platform and allows for easier development. However, IMO, the API should be user-centric. Having an API allows to expose the functionalities to clients, but it should be done in such a way as to (i) hide implementation and software-level decisions from the client; and (ii) allow the client to have a coherent, unified view of the data we expose to them. By accepting the premise that each module/extension would have its own hierarchy/prefix, we put the burden of understanding the platform's internal structure and module organisation onto the client prior to them using the API. One of the great benefits of the current WM REST API is that it presents a coherent, logical hierarchy to the end user because it's hand-crafted, in this way lowering the barrier to entry. For example, the mobile content service provides output for both /page/mobile-html/ as well as /feed/* end points - these fall under different sub-hierarchies because it makes more sense to present them to the client in such a way, regardless of the underlying implementation. Giving complete freedom to modules and hand-crafting the hierarchy are (likely) two extremes on the spectrum of the possible choices, so I think we should try to balance them.

Versioning

Due to the aforementioned assumption, versioning has been abandoned altogether. However, having versioning allows us to signal to consumers of our data in which stage/shape/form the proposed APIs come. This is very valuable information to them, and allows them to plan accordingly. I am not sure that dropping versioning support because of implementation decisions is a good way to go.

Currently, in the REST API there two levels of versioning: a global one (denoted by the v1 prefix) and individual, end point-specific versioning (coming in the form of Accept headers). The global one refers to the overall look and contents of the hierarchy: we guarantee that stable end points will not be changed/removed/etc. On the per-end-point basis, they can be stable, unstable and experimental, giving different guarantees about their existence and the level of reliance clients can have. Furthermore, each end point may choose to declare a specific content version: if the content outline changes in any way, the version is to be increased following the rules of semantic versioning.

Caching

In the current implementation of the REST API, the general flow is the following: for a request, (i) check the storage (based on the request params); (ii) if there is a cache hit, return it; (iii) else, have the handler execute it; (iv) store the response before returning it. It honours two caching directives: no-cache (short-circuits (ii)) and no-store (short-circuits (iv)). We use the former for background updates (e.g. re-rendering HTML of pages when an included template changed) and the latter for delivering language variants (we store only the canonical variant and convert the others in-place).

Relatedly, I agree that ETag / If-Modified-Since / If-Match should be preferred to Last-Modified. However, they can all be used together. Each response can contain a Last-Modified header, allowing clients to perform client-side request short-circuiting if warranted. That said, it should be based on when the data that is being delivered has been last modified, not when the code delivering it has changed.

Route Handling

I agree that parameter validation should be made part of the framework itself rather than left to each handler, especially when we'll have the path parameter types defined in advance. For route conflicts, IMHO the best would be to be able to pick the obvious choice, if it exists. For example, given /foo/{var} and /foo/bar, the latter should be picked when the path is /foo/bar and the former should be used for everything else. Tim's suggestion of using prefix-based trees plays nicely into that. RESTBase currently does not have path conflict resolution; instead, the first route that matches is picked. This is both good and bad, as it allows us to list the higher-priority route first.

I am not sure I understand the need for typed path parameters, though. Allowing for both /foo/{var:int} and /foo/{var:string} seems like trying to conflate multiple routes/resources into one, and as such is something that should be avoided and discouraged.

I'm leaning towards a descriptive name for this library, either Rest or RestServer. The Gerrit repository would be mediawiki/libs/Rest, the Composer name would be wikimedia/rest, and the namespace would be Wikimedia\Rest.

Seems sane to me.

This was perhaps discussed elsewhere (and I'm simply not aware of it), but the ticket and its discussion seem to be based on the (unspoken) assumption that modules/extensions have to be able to register their own REST API paths and end points independently, much in the way the Action API currently works.

That's because they do, unless you're wanting to force every site administrator to configure the router endpoints manually for every extension they install.

But that doesn't seem to actually matter for the point you're making in the rest of the paragraph.

Requiring an extension's modules to have a consistent prefix was discussed early on (T221177#5142765), but was eventually dropped (T221177#5145852). In the code Tim implemented there's nothing preventing an MCS extension from providing both /page/mobile-html/ and /feed/* end points as long as those don't collide with some other endpoint being defined by core or another extension.

Currently, in the REST API there two levels of versioning: a global one (denoted by the v1 prefix) and individual, end point-specific versioning (coming in the form of Accept headers). The global one refers to the overall look and contents of the hierarchy: we guarantee that stable end points will not be changed/removed/etc. On the per-end-point basis, they can be stable, unstable and experimental, giving different guarantees about their existence and the level of reliance clients can have. Furthermore, each end point may choose to declare a specific content version: if the content outline changes in any way, the version is to be increased following the rules of semantic versioning.

Since the endpoints aren't under central control (see above), we can't really make the guarantees that you're claiming for the global version number.

If any core handler or any extension wants to include a version number or stability indicator in its endpoint paths, documentation, or response data, there's nothing stopping it from doing so. If every core handler and extension (or site-level overrides) wants to collaboratively register all paths to include the appearance of a global version number, there's nothing stopping that either. The semantics of the paths (beyond the existence of {var} path components) are opaque to the router, it just routes and lets the handler do the rest.

In the current implementation of the REST API, the general flow is the following: for a request, (i) check the storage (based on the request params); (ii) if there is a cache hit, return it; (iii) else, have the handler execute it; (iv) store the response before returning it. It honours two caching directives: no-cache (short-circuits (ii)) and no-store (short-circuits (iv)). We use the former for background updates (e.g. re-rendering HTML of pages when an included template changed) and the latter for delivering language variants (we store only the canonical variant and convert the others in-place).

Let's be careful not to confuse "HTTP caching" and "response caching". The former is ETags and Last-Modified dates and other HTTP headers that let things like Varnish work and let clients receive a 304. The latter is putting data into something like memcached.

Most of the discussion here has been around "HTTP caching", we haven't yet given much consideration to "response caching" which is what it sounds like you're talking about in the statement quoted above. At the moment there's nothing stopping a handler from doing all that itself at the start of its execute method. At some point including some sort of generalized caching support would likely be a good idea, but what form that might take is an open question and will likely be a separate RFC.

That said, [Last-Modified] should be based on when the data that is being delivered has been last modified, not when the code delivering it has changed.

Since it's entirely up to the handler, that's likely outside the scope of this RFC. Here we're just providing the mechanism for handlers to specify etags and last-modified dates and using them to implement the relevant HTTP-specified behavior.

I note @Krinkle is coming at it from experience with ResourceLoader, where the data being delivered is often derived entirely from static JS, CSS, and LESS files for which "when the data has been last modified" and "when the code was changed" are equivalent.

I agree that parameter validation should be made part of the framework itself rather than left to each handler, especially when we'll have the path parameter types defined in advance.

Note that at the moment we're not defining path parameter types at the routing level. /foo/{var}/bar will be routed first without consideration for any type restrictions on {var}. Only after routing has determined the Handler will those restrictions be checked.

For route conflicts, IMHO the best would be to be able to pick the obvious choice, if it exists. For example, given /foo/{var} and /foo/bar, the latter should be picked when the path is /foo/bar and the former should be used for everything else. Tim's suggestion of using prefix-based trees plays nicely into that. RESTBase currently does not have path conflict resolution; instead, the first route that matches is picked. This is both good and bad, as it allows us to list the higher-priority route first.

Currently we're just declaring /foo/{var} and /foo/bar as being a conflict. IMO this is probably more sane, as it prevents accidentally winding up with a situation where you actually do want {var} to be "bar".

We don't want to have a "first route that matches" rule, since when routes are coming in from extensions it would easily result in the choice depending on the extension load ordering which is unlikely to be very useful behavior.

I am not sure I understand the need for typed path parameters, though. Allowing for both /foo/{var:int} and /foo/{var:string} seems like trying to conflate multiple routes/resources into one, and as such is something that should be avoided and discouraged.

The point would be to allow /foo/{var:int} and /foo/bar to coexist without being flagged as a conflict, since "bar" is not a valid representation of an integer. At the moment though, we've decided not to include that feature.

This was perhaps discussed elsewhere (and I'm simply not aware of it), but the ticket and its discussion seem to be based on the (unspoken) assumption that modules/extensions have to be able to register their own REST API paths and end points independently, much in the way the Action API currently works.

That's because they do, unless you're wanting to force every site administrator to configure the router endpoints manually for every extension they install.

If I understand Marko directly, the idea is for extensions to hook into the handling of pre-defined routes, instead of defining their own.

I agree that the former is preferable (in the same way that extensions now can hook into the handling of API actions), but that the later will also be necessary, especially for extensions that introduce new concepts (resources). E.g. Wikibase should serve entity JSON via the pre-defined "content data" route, but there is no place for running SPARQL queries in the core API.

Since the endpoints aren't under central control (see above), we can't really make the guarantees that you're claiming for the global version number.

That still raises the question of versioning the core API. I think it would be good to do that, but agree that it's difficult to think of a way to embed the version in the path in a sensible way if every route can have its own version. We may have to resort to custom headers or something.

Joe added a subscriber: Joe.May 23 2019, 5:27 AM
tstarling renamed this task from Route Handler Interface RFC to REST route handler extension interface RFC.May 23 2019, 6:21 AM
tstarling updated the task description. (Show Details)

Task description update: I'm explicitly limiting the scope so that we can move towards approval.

Tgr added a comment.May 24 2019, 12:07 PM

I don't really think it is warranted. I planned on making PathTemplateMatcher into a separate library, but not the rest of it. But I'd be interested to know what others think.

A bit late to comment on this but IMO making the REST framework a standalone Composer module is a good idea in theory, and a bad idea to do now.

For one thing, I think there should be more discussion about librarization. There are two levels of it: extracting utility components which have wide applicability outside Wikimedia (and typically don't change much) which is mostly about good OS citizenship / developer engagement, and does not affect code quality much, and tearing the entire codebase into independent modules (a harsh way of enforcing decoupling, basically). There was a 2014 librarization RfC but it's not very clear which version of librarization it was about (it talks about un-monolithizing MediaWiki, but also about not breaking up core) and there was not much discussion of the cost ("componentization" makes life easier for contributors, but harder for maintainers who need to deal with the overhead of versioning, updating depending modules and so on). Also, some of its assumptions didn't really withstand the test of time - it talks about a service-oriented architecture, but it's not very clear whether we are still committed to that today; it talks about the ability to swap out components, but since then we have a dependency injection framework which already makes that possible (and wasn't really written with componentization in mind). So IMO there should be a discussion about the librarization of not-really-reusable components before we put much effort into it.

For another thing, the librarization project is IMO not very mature - a lot of the low-hanging fruit got done, but the fundamental questions were not touched. How does dependency injection work for a librarized component? How does it integrate with the hook system? The i18n system? Would it use /vendor (so version bump + double review for every patch that needs to be deployed within a reasonable timeframe)?

What I'm trying to say is that the MediaWiki REST framework is a Big Project, librarization (if it's to proceed beyond the current level of extracting simple utilities) is a Big Project, and it's generally a bad idea to make one of those depend on another one. Extracting the router later will not be any handler, so I'd plan for that instead.
(Also, more specifically to the current task, IMO the BagOStuff interface is a trainwreck, and librarizing it would just make it harder to fix.)

@Tgr I see your point, and I share your concern about the annoying procedures for a version bump.

Can we agree for now that the rest routing should live in libs OR a separate repo, and nothing under the Wikimedia/REST namespace (or whatever we may end up using) should depend on anything that is not itself a library (either a real composer component, or a pseudo-library in libs)? Can we also agree to enforce this via automated tests?

I'd really like to discuss the other points from your comment, but I feel like this is the wrong place. Perhaps we can discuss this on the talk page of https://www.mediawiki.org/wiki/Core_Platform_Team/Projects/Librarization?

Tgr added a comment.May 24 2019, 12:32 PM

Currently, in the REST API there two levels of versioning: a global one (denoted by the v1 prefix) and individual, end point-specific versioning (coming in the form of Accept headers).

As others have said, global versioning does not make much sense for a system where extensions can dynamically register themselves. (I'd also question its value, given the RESTBase version was never incremented.) It's probably good to have some kind of prefix that ensures we can have a completely different future API (say, GraphQL) in a different namespace, but that's all what's needed.

End-point versioning seems like a good idea, the question is, does it need framework support? The cache would have to be split on it, but that can be handled via Vary. Being able to register different handlers for different Accept versions might be a nice convenience over having to register a proxy handler, though.

In the current implementation of the REST API, the general flow is the following: for a request, (i) check the storage (based on the request params); (ii) if there is a cache hit, return it; (iii) else, have the handler execute it; (iv) store the response before returning it. It honours two caching directives: no-cache (short-circuits (ii)) and no-store (short-circuits (iv)). We use the former for background updates (e.g. re-rendering HTML of pages when an included template changed) and the latter for delivering language variants (we store only the canonical variant and convert the others in-place).

I guess there could be direct accessors to those directives on the request object, by the same logic by which we have direct methods for setting Last-Modified and ETag (to making correct caching as easy as possible for API module developers).

Tgr added a comment.EditedMay 24 2019, 12:35 PM

Can we agree for now that the rest routing should live in libs OR a separate repo, and nothing under the Wikimedia/REST namespace (or whatever we may end up using) should depend on anything that is not itself a library (either a real composer component, or a pseudo-library in libs)?

There is no harm in putting it in libs, and IMO we should be conscious about dependencies and avoid them were possible, but if using Message or Hooks seems to be the sane thing somewhere, I don't think we should go out of our way to avoid it just because those haven't been librarized yet.

Can we also agree to enforce this via automated tests?

I don't see how that would work, short of something very hacky like a custom autoloader.

I'd really like to discuss the other points from your comment, but I feel like this is the wrong place. Perhaps we can discuss this on the talk page of https://www.mediawiki.org/wiki/Core_Platform_Team/Projects/Librarization?

Sure; wasn't aware that it existed as a current project. I copied my comment there.

Can we agree for now that the rest routing should live in libs OR a separate repo, and nothing under the Wikimedia/REST namespace (or whatever we may end up using) should depend on anything that is not itself a library (either a real composer component, or a pseudo-library in libs)?

There is no harm in putting it in libs, and IMO we should be conscious about dependencies and avoid them were possible, but if using Message or Hooks seems to be the sane thing somewhere, I don't think we should go out of our way to avoid it just because those haven't been librarized yet.

The way to work around that is to define interfaces for these as needed by the rest router, and then inject a mediawiki-specific implementation. Or we just make them libraries. Or better, make libraries with better versions of them, with legacy implementations based on them.

TechCom is hosting a meeting on IRC in #wikimedia-office to discuss this IRC on Wednesday May 29 at 21:00 UTC/23:00 CEST/14:00 PDT

There are two levels of it: extracting utility components which have wide applicability outside Wikimedia (and typically don't change much) which is mostly about good OS citizenship / developer engagement, and does not affect code quality much, and tearing the entire codebase into independent modules (a harsh way of enforcing decoupling, basically). There was a 2014 librarization RfC but it's not very clear which version of librarization it was about (it talks about un-monolithizing MediaWiki, but also about not breaking up core)

The first was the intention, I think. Although I can't say people with the second intention didn't also support it.

Also, some of its assumptions didn't really withstand the test of time - it talks about a service-oriented architecture, but it's not very clear whether we are still committed to that today;

Personally, I'm not sure how committed "we" ever were to that.

How does dependency injection work for a librarized component?

The same as for a non-librarized component, wiring code inside MediaWiki passes services into the component's entry-class's constructor. Any dependencies injected would have to already have been similarly librarized, or it wouldn't make much sense. Thus, for example, the use of PSR-3 logging, and the fact that the config code I wrote for Parsoid-PHP has wrapper interfaces for MediaWiki to implement instead of Parsoid-PHP just using MediaWiki's Config, RevisionStore, and so on.

How does it integrate with the hook system?

Unless we librarize the hook system, it doesn't directly. A few places in Parsoid-PHP we have the MediaWiki implementation of a method call a hook.

I don't think the hook system would be all that hard to librarize, actually, other than the fact that a lot of code still uses the $wgHooks global and often for more than just registering for the hooks at startup. Where it starts to really get weird is when people also want to refactor it at the same time, like for example T154674 and T154675, or if people wanted to tie librarization with T212482.

The i18n system?

Same. I'm feeling like I may have to shave that yak at some point soon.

We currently have MessageSpecifier in includes/libs that in theory could be librarized, but the one time I tried to actually make a non-Message implementation (for unit testing) I quickly got pushback that it lacked too many necessary features (e.g. the different kinds of params).

Would it use /vendor (so version bump + double review for every patch that needs to be deployed within a reasonable timeframe)?

That has been how it has worked so far, yes.

What I'm trying to say is that the MediaWiki REST framework is a Big Project, librarization (if it's to proceed beyond the current level of extracting simple utilities) is a Big Project, and it's generally a bad idea to make one of those depend on another one. Extracting the router later will not be any handler, so I'd plan for that instead.

IMO making it a library from the start may not be the best plan, versus developing it with the library part in includes/libs/ and extracting it when it's done and proven.

Can we also agree to enforce this via automated tests?

I wouldn't want to make that a blocker on actual development. But if you make it reasonable to do, I wouldn't oppose enabling the CI tests.

With your recent post to wikitech-l about analyzing dependencies, I believe you could indeed make it happen.

Tarrow added a subscriber: Tarrow.May 24 2019, 5:04 PM

There will be a new entry point file, rest.php, which invokes the static method MediaWiki\Rest\EntryPoint::main().

Apologies if this was answered above, I have not read the entire thread.

How do you envision the urls relative to the root? For instance, will this be:

  1. https://en.wikipedia.org/rest.php/user/{userName}/hello (requires no changes to Apache's config)
  2. https://en.wikipedia.org/user/{userName}/hello (will not work unless you do this on index.php instead of rest.php or specifying the routes in Apache)
  3. https://en.wikipedia.org/rest/user/{userName}/hello (will require updating Apache's config to rewrite /rest to /rest.php)

or is this something we'll just allow the user to configure however they want? (excluding #2)

EvanProdromou added a comment.EditedMay 24 2019, 6:21 PM

Task description update: I'm explicitly limiting the scope so that we can move towards approval.

I don't understand the limitation. Won't REST API routes that are implemented in core use this same interface? Or is that we don't need an RFC for interfaces that are only used in core?

I think it's only the JSON definition that is specific to extensions.

How do you envision the urls relative to the root? For instance, will this be:

The path would be https://en.wikipedia.org/w/rest.php/user/{userName}/hello. Much like how one path for a page view is https://en.wikipedia.org/w/index.php/Main_page, and occasionally you'll see a MediaWiki wiki using that.

On Wikimedia sites we might configure rewrite rules to map that to something less odd-looking, like we do so the canonical URL for that page view is actually https://en.wikipedia.org/wiki/Main_page. Or we might not bother. I see a few things in the current patch that might not work right if we did that, but per T221177#5207250 those are probably out of scope here.

Tgr added a comment.May 25 2019, 9:57 PM

We currently have MessageSpecifier in includes/libs that in theory could be librarized, but the one time I tried to actually make a non-Message implementation (for unit testing) I quickly got pushback that it lacked too many necessary features (e.g. the different kinds of params).

Also fallback keys, forcing the language, probably some kind of abstraction for the useDatabase flag... maybe setting the title for {{FULLPAGENAME}} although that's only used in a handful of messages.

maybe setting the title for {{FULLPAGENAME}} although that's only used in a handful of messages.

I think that last one crosses the line from something that belongs in a generic interface to something that belongs in Mediawiki's implementation of the interface.

How do you envision the urls relative to the root? For instance, will this be:

  1. https://en.wikipedia.org/rest.php/user/{userName}/hello (requires no changes to Apache's config)

In the WIP patch, it is configured via $wgRestPath, which defaults to {$wgScriptPath}/rest.php. In WMF production, it would presumably be aliased to some other place.

tstarling updated the task description. (Show Details)May 29 2019, 8:45 PM

Task description edit: Handler::outerExecute() -> Router::executeHandler() per CR comment by @Tgr .

In the IRC meeting, people wrote:

<TimStarling> I think there was also a comment from anomie to the effect that typed wildcards could be done and maybe that they would be a good idea

Yes, for a simple set of types anyway. We could easily enough support integers specially during the routing, for example.

Also enumerations could be looked at as mostly syntactic sugar for multiple routes: if the enum has values "foo", "bar", and "baz", the only difference between "/path/{var:enum:foo|bar|baz}/stuff" and the three routes "/path/foo/stuff", "/path/bar/stuff", and "/path/baz/stuff" is that your handler gets a variable with the "foo"/"bar"/"baz".

<TimStarling> well, for that we would need a combined routing tree, just for 405s

There are a limited number of available methods and generating a 404 or 405 probably isn't a critical path. We might just run ->match() for every method that has at least one registered path instead of trying to make a separate 405 tree.

Otherwise a 405 tree would need somewhat different logic for add() where it currently just raises a conflict.

<TimStarling> I imagine there will be a method like Handler::getValidator()
<TimStarling> rather than having the whole validation framework in the Handler base class

For non-form-data POST bodies, that's my plan. We could potentially piggyback on that to bring back getParsedBody() functionality too.

For pathinfo parameters, $_GET, and $_POST, my plan is one validator that works based on "settings arrays" much like the Action API does. The REST version will probably differentiate based on source though, where for the Action API $_GET and $_POST are almost entirely interchangeable.

<duesen> "Do iteration/index routes require any special handling?"
<duesen> Paging is a recurring theme, and something that we are doing in multiple places in core, and have gotten wrong a few times.

At the level of this RFC, I don't think we need anything in code, just documentation as to the standard way of doing it. It sounds like everyone agrees that a client-opaque query string parameter with a consistent name ("continue"?) is the way to go.

As a completely separate task, a better abstraction to replace "Pager" that would be usable to represent the business logic parts from existing API query modules would be what we'd need. Then special pages, Action API modules, and REST API handlers could all wrap the same business logic classes to render paged queries appropriately.

<duesen> I'm dubious about that to be honest... it seems like putting presentation logic into the wrong place. but then, human readable error messages are rather convenient. and let's please not use numeric error codes. for anything, ever...

A standard method for clients communicating desired output language to handlers might be useful (uselang paramater? Accept-Language header?). As far as actual i18n here, error messages seem about the only thing that would need it.

As for machine-readable error codes, I like the Action API approach of using short ASCII strings generally matching /^[a-z][a-z0-9-]*$/.

<Nikerabbit> it seems handy to be able to request localised error messages, but then again the client knows better and can give better error message... one thing that the action api lacks is a list of possible errors that can happen for a given request

The Action API lacks that because often enough the possible errors come from MediaWiki business logic, and sometimes even extension hooks. It used to have a list but I removed it in rMWf0a6435f3b72: API: Remove action=paraminfo 'props' and 'errors' result properties because it was never kept up to date. I expect REST API handlers will find the same thing.

And for that same reason, I doubt the client really does know better about all errors. More likely there are a few specific errors it might want to handle specially (e.g. handle "badtoken" as "fetch a new CSRF token and retry"), and for the rest it would be happy to not have to duplicate MediaWiki's i18n to give something more useful than a generic "operation failed".

<tgr> I feel for something so basic and wide-ranging as changing the API paradigm, if no one shows up to the discussion, we must be doing something wrong
<duesen> tgr: well, the scope of this rfc is rather narrow. another rfc will be needed for the CRUD endpoint. that's much more visible to the outside
<duesen> we could have an RFC proposing to retire the action. I'm sure that wopuld draw attention :)

If you meant "proposing to retire the action API", let's not encourage the angry mob to break out the torches and pitchforks. As far as I'm aware there is in no way any plan to have the REST API replace the Action API, and many operations (e.g. page editing) will remain available in both.

As I've said before, the Action API is really good at returning specific information about many pages, but not very good at caching. A REST API will be good at returning all the information about one page with caching, but probably not very good at batch queries. Restbase has historically had an advantage for people who wanted to write their code as a service and/or in a language other than PHP.

For reference, here's the MeetBot record of yesterday's IRC meeting:
https://tools.wmflabs.org/meetbot/wikimedia-office/2019/wikimedia-office.2019-05-29-21.02.html.

Minutes:

https://phabricator.wikimedia.org/T221177 (duesen, 21:02:52)
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/508972/12 (duesen, 21:08:00)
<TimStarling> there's no validation of POST bodies proposed for core (duesen, 21:32:41)
TimStarling> the extension interface includes the Handler base class, RequestInterface and ResponseInterface (duesen, 21:32:52)
<TimStarling> we are doing things similarly to PSR-7 because it was a nice conceptual starting point, they had some good ideas (duesen, 21:38:17)
TimStarling> we are not using PSR-7 because they also had some bad ideas, especially immutable responses (duesen, 21:38:33)
for paging, use a (standard) query parameter, and a (standard) field on the output. headers could also be used, but are more obswcure (duesen, 21:47:33)
TimStarling> ResponseFactory may need some i18n (duesen, 22:00:00)
<tgr> [de] we need to differentiate between content language and response language (duesen, 22:01:02)

Full log: https://tools.wmflabs.org/meetbot/wikimedia-office/2019/wikimedia-office.2019-05-29-21.02.log.html

<Nikerabbit> it seems handy to be able to request localised error messages, but then again the client knows better and can give better error message... one thing that the action api lacks is a list of possible errors that can happen for a given request

The Action API lacks that because often enough the possible errors come from MediaWiki business logic, and sometimes even extension hooks. It used to have a list but I removed it in rMWf0a6435f3b72: API: Remove action=paraminfo 'props' and 'errors' result properties because it was never kept up to date. I expect REST API handlers will find the same thing.
And for that same reason, I doubt the client really does know better about all errors. More likely there are a few specific errors it might want to handle specially (e.g. handle "badtoken" as "fetch a new CSRF token and retry"), and for the rest it would be happy to not have to duplicate MediaWiki's i18n to give something more useful than a generic "operation failed".

My experience with building an API for Wikibase confirms this.

<tgr> I feel for something so basic and wide-ranging as changing the API paradigm, if no one shows up to the discussion, we must be doing something wrong
<duesen> tgr: well, the scope of this rfc is rather narrow. another rfc will be needed for the CRUD endpoint. that's much more visible to the outside
<duesen> we could have an RFC proposing to retire the action. I'm sure that wopuld draw attention :)

If you meant "proposing to retire the action API", let's not encourage the angry mob to break out the torches and pitchforks. As far as I'm aware there is in no way any plan to have the REST API replace the Action API, and many operations (e.g. page editing) will remain available in both.

Yea - I was, indeedn joking.

As I've said before, the Action API is really good at returning specific information about many pages, but not very good at caching. A REST API will be good at returning all the information about one page with caching, but probably not very good at batch queries. Restbase has historically had an advantage for people who wanted to write their code as a service and/or in a language other than PHP.

I completely agree that REST is a good alternative only for some, use cases currently covered by the action API. Specifically, most query modules don't fit the REST paradigm very well.

This might be naive, but why are immutable Responses so bad? Couldn't we provide a mechanism to replace the response rather than mutating it?

This might be naive, but why are immutable Responses so bad? Couldn't we provide a mechanism to replace the response rather than mutating it?

That's what PSR-7's interface does: it has various methods that return a new ResponseInterface object similar to $this but with some bit of state different. The down side is that it means every change has to allocate a new object and copy the data from the old, only for the old one to typically be immediately discarded. When performance is a concern, all that cloning can add up.

dbarratt added a comment.EditedMay 30 2019, 5:56 PM

The down side is that it means every change has to allocate a new object and copy the data from the old, only for the old one to typically be immediately discarded. When performance is a concern, all that cloning can add up.

Have we benchmarked this?

Change 513532 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/vagrant@master] Add new MediaWiki REST API endpoint

https://gerrit.wikimedia.org/r/513532

There are a limited number of available methods and generating a 404 or 405 probably isn't a critical path. We might just run ->match() for every method that has at least one registered path instead of trying to make a separate 405 tree.

Good idea, done in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/513545

It appears (from what I can tell) that PHP (as of version 7?) implements a copy-on-write strategy for cloning objects. If this is the case, then the performance concerns with implementing PSR-7 do not appear to be justified.

Anomie added a comment.Jun 3 2019, 3:10 PM

It appears (from what I can tell) that PHP (as of version 7?) implements a copy-on-write strategy for cloning objects.

Even if it did, the write would happen immediately after the clone in this use case. But let's test it quick with a bit of a toy example.

<?php

class Test {
    private $foo = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.';

    private $id = 0;

    public function setId( $newId ) {
        $this->id = $newId;
    }

    public function withId( $newId ) {
        $ret = clone $this;
        $ret->id = $newId;
        return $ret;
    }
}

$t0 = microtime( true );
$obj = new Test;
for ( $i = 0; $i < 1e7; $i++ ) {
    $obj->setId( $i );
}
$t1 = microtime( true );
echo "Setter: " . ( $t1 - $t0 ) . "\n";
-- Takes about 0.30s on my laptop in 7.2.11, and 0.25s in 7.3.4

$t0 = microtime( true );
$obj = new Test;
for ( $i = 0; $i < 1e7; $i++ ) {
    $obj = $obj->withId( $i );
}
$t1 = microtime( true );
echo "Cloning: " . ( $t1 - $t0 ) . "\n";
-- Takes about 0.83s on my laptop in 7.2.11, and 0.78s in 7.3.4

Feel free to run your own tests and post the results here.

So we're talking about a difference of (0.83 / 1e7) - (0.30 / 1e7) = ~0.000000053 seconds (if my math is correct). Seems like a micro-optimization at best.

To put this another way, you have to do this operation at least 18,867 times in order to see a difference of 1 millisecond.

Anomie added a comment.Jun 3 2019, 4:26 PM

OTOH, a real implementation would likely have more data in the object that would have to be copied.

OTOH, a real implementation would likely have more data in the object that would have to be copied.

You're right.

class Test {
	
	private $foo;
	
	private $id = 0;

	public function __construct() {
		$this->foo = random_bytes( 1e7 );
	}

	public function setId( $newId ) {
		$this->id = $newId;
	}

	public function withId( $newId ) {
		$ret = clone $this;
		$ret->id = $newId;
		return $ret;
	}
}

$t0 = microtime( true );
$obj = new Test;
for ( $i = 0; $i < 1e7; $i++ ) {
	$obj->setId( $i );
}
$t1 = microtime( true );
echo "Setter: " . ( $t1 - $t0 ) . "\n";

$t0 = microtime( true );
$obj = new Test;
for ( $i = 0; $i < 1e7; $i++ ) {
	$obj = $obj->withId( $i );
}
$t1 = microtime( true );
echo "Cloning: " . ( $t1 - $t0 ) . "\n";

Result:

Setter: 1.0590550899506
Cloning: 1.8972351551056

(1.8813560009003/1e7) - (1.0553591251373/1e7) = 0.0000000825996875763 seconds

Following negative comments on libraryization from @Anomie and @Tgr above, I'm not planning to split the implementation right now. It can stay in includes/Rest for the initial commit.

@tstarling have all of your concerns with PSR-7 been resolved? If so, can we move forward with a proper implementation?

daniel added a comment.Jun 4 2019, 2:03 PM

Following negative comments on libraryization from @Anomie and @Tgr above, I'm not planning to split the implementation right now. It can stay in includes/Rest for the initial commit.

I agree that it would be premature to put the rest router into a separate repo right now, given the overhead for maintaining mediawiki-vendor. However, I still advocate for putting the new code under /libs/, and have a very good look at which bits of core the current code depends on that would make this problematic. Using everything directly without any pesky extra abstractions is easier, but it's also what got us into cyclic dependency hell in the first place.

I can't find any strong opinions against isolation from core code above. Am I missing any compelling arguments?

TechCom is placing this on Last Call ending June 19 at 22:00 PDT/June 20th 05:00 UTC/07:00 CEST

tstarling updated the task description. (Show Details)Jun 11 2019, 1:12 AM

Task description edit reflecting the rename of getAttributes() to getPathParams() as suggested in Gerrit.

So are we going to implement PSR-7 properly? If so, can the task description be updated? If not, could it be explained why we wont be?

TK-999 added a subscriber: TK-999.Jun 13 2019, 9:49 PM

Is https://phabricator.wikimedia.org/T221177#5219503 considered valid for the final version of the RfC? i.e. will the base path for the REST routes be configurable / respect existing $wgScriptPath values?

Change 508972 merged by jenkins-bot:
[mediawiki/core@master] REST API initial commit

https://gerrit.wikimedia.org/r/508972

Tgr added a comment.Jun 13 2019, 10:26 PM

Is https://phabricator.wikimedia.org/T221177#5219503 considered valid for the final version of the RfC? i.e. will the base path for the REST routes be configurable / respect existing $wgScriptPath values?

Yeah, that part just got merged. (Can be changed if there's a good reason for it, but it matches how we handle all other MediaWiki entry points.)

So are we going to implement PSR-7 properly? If so, can the task description be updated? If not, could it be explained why we wont be?

No we are not going to implement PSR-7. The reasons are: because we don't have any reason to do so, and because we want to do things that are incompatible with the spec. PSR-7 is simple and lightweight, and provided a useful starting point for discussion, but is not suitable as a handler interface for MediaWiki, due to MediaWiki's scale and complexity.

Approved by TechCom after last call.

we don't have any reason to do so

Isn't interoperability enough of a reason? Or are we ok intentionally being incompatible with the existing PHP ecosystem?

because we want to do things that are incompatible with the spec.

Can you provide some examples? I may have missed it in the thread (apologies).

PSR-7 is simple and lightweight, and provided a useful starting point for discussion, but is not suitable as a handler interface for MediaWiki, due to MediaWiki's scale and complexity.

I'm not sure I understand this. On one hand you're saying it's "simple and lightweight" and therefore is performant and can be built on top of, but on the other hand saying it's also not feature-rich enough for our scale(?) and complexity. Those two things feel at odds with each other. Can you provide some examples for some ways in which it is not suitable?

WDoranWMF closed this task as Resolved.Wed, Jul 17, 12:15 AM

Change 513532 merged by jenkins-bot:
[mediawiki/vagrant@master] Add new MediaWiki REST API endpoint

https://gerrit.wikimedia.org/r/513532