Page MenuHomePhabricator

REST API Parameter Validation
Closed, ResolvedPublic1 Story Points

Details

Related Gerrit Patches:

Event Timeline

EvanProdromou removed tstarling as the assignee of this task.May 14 2019, 11:44 AM
EvanProdromou added a subscriber: tstarling.

Most important types to validate are in the routes to finish the Parsoid REST API, T221738. Looks like string and integer...?

Tgr added a comment.EditedMay 14 2019, 2:41 PM

Seems like https://www.mediawiki.org/wiki/Parsoid/API has more parameters than the OpenAPI spec (the one at https://en.wikipedia.org/api/rest_v1/, anyway). More endpoints, too. The OpenAPI spec has scalar parameters only, the mw.org one has some more complex ones (e.g. page bundles).

Brainstorming a bit, please give feedback.

Scope

I'd like to be able to use this parameter validation for the Action API too, to avoid having two different code bases that will probably behave subtly differently. But we'll see if the needs of both APIs are sufficiently well aligned that that actually makes sense.

Action API overview

The Action API's parameters all come in as $_GET or $_POST data, and with the exception of POST file uploads for action=upload are restricted to NFC-normalized UTF-8. If necessary that could be worked around, but we haven't found a need as of yet.

The Action API works on the basis of "parameter definitions", which can be thought of as associative arrays of settings. These settings include:

  • Parameter type.
    • For integer and limit types, minimum and maximum allowed values. Limit types can have two maximums, for users with and without the 'apihighlimits' user right.
      • Also, a flag for whether the integer min/max are advisory or should be enforced.
    • For multi-value enumerated-type parameters, whether "*" is a magic value meaning "all values of the enumeration".
    • For the 'namespace' type, extra "namespaces" to include (typically used for NS_SPECIAL and/or NS_MEDIA).
    • For string types, maximum length (in bytes or Unicode codepoints).
  • Default value.
  • Whether the parameter accepts multiple values, those being separated by U+007C (|) or U+001F at the client's discretion.
    • Whether the set of multiple values preserves duplicates.
    • Maximum number of values accepted, for users with and without the 'apihighlimits' user right.
  • Flags for requiredness, deprecation, sensitiveness (i.e. "don't log"),
  • A mechanism for rudimentary structured data, e.g. that setting slots=foo|bar means that parameters text-foo and text-bar should exist (and use the provided definition) while others such as text-baz should not.
  • Various settings for controlling auto-generated documentation.

Data types recognized by the Action API are:

  • boolean: Works like an HTML checkbox, any value (including 0 and the empty string) is considered "true" while only the absence of the parameter is "false".
  • integer: Nominally just /^[+-]?\d+$/, but in practice anything allowed by PHP (int)$value is currently accepted. I'd like to fix that.
  • limit: Like 'integer', but also accepts the string 'max' to mean the maximum allowed value for the current user.
  • namespace: A namespace number (only those known to MediaWiki). By default not including NS_SPECIAL or NS_MEDIA.
  • submodule: The name of an Action API submodule available via the module's ApiModuleManager.
  • tags: A string naming an existing, explicitly-defined change tag. Despite the plural name, it doesn't force the parameter to be multi-value.
  • timestamp: A timestamp in any format recognized by Timestamp, or the string 'now' to mean the current timestamp. The validator normalizes to TS_MW format.
  • user: A MediaWiki user name or IP, which the validator will normalize.
  • upload: A file uploaded via POST.
  • Several different string types:
    • string: A non-empty string where <input type="text"> would be the expected form widget. Not expected to contain newlines, but they aren't forbidden either.
    • text: A non-empty string where <textarea> would be a valid form widget.
    • password: Like 'string', but the recommended widget is <input type="password">. The "sensitive" flag is treated as always true for these.
    • NULL: Any string. Really this is the "no type specified" type.

Also of note is that the Action API "framework" makes it easy for modules to use CSRF tokens: the module need only implement a method needsToken() method to return the name of a token-type, and the "framework" adds a parameter with the proper definition and checks the token before executing the module. That's currently done outside the scope of the parameter validation, but could potentially be incorporated if we instead add a 'token' type.

Currently most of the validation is hard-coded, but last year I wrote I01a1247d as an attempt at refactoring it. I'd probably use that as a basis for this work.

REST API guesswork

REST API "parameters" will come in four types: path positional parameters, query string ($_GET) parameters, form-data POST parameters (including possibly file uploads), and POSTed data in other formats such as JSON. The first three can hopefully share definition and validation code with the Action API, while the fourth likely can't.

We'll also likely have to differentiate the first three somehow, where the action API generally doesn't care about the source.

We may need to add additional types for REST. In particular, we'll likey want a boolean type that accepts values such as "", "0", "f", "false", "off", and "no" as meaning false, "1", "t", "true", "on", and "yes" as meaning true, and rejects unrecognized values.

For "POSTed data in other formats such as JSON", if we do anything we'd likely want to define an generic "parse and validate" interface and have the Handler somehow specify which format(s) it wants to use. The main benefit would be code sharing, so e.g. every JSON-using handler doesn't have to reinvent parsing and validation. For the MVP implementation, though, we'd likely only make the interfaces and maybe a "null" validator.

Proposal

REST API Handlers would have a method to return parameter definitions similar to the parameter definitions returned by the Action API's ::getAllowedParams() method. Code similar to what I have in I01a1247d will be used to validate parameters against the definitions in both the Action API and the REST API; the Action API port can serve as a test bed since complex REST stuff isn't quite there yet. Since it seems to be what we're aiming for, I'll try to refactor that code as library part in includes/libs/ParamValidator that could be librarized once it's stable plus any MediaWiki-specific parts in includes/ParamValidator.

Other notes:

  • While the existing Action API parameter definitions include a bunch of help-generation i18n stuff, the ParamValidator itself won't touch that and will ignore it when present.
  • I hope to do error reporting from ParamValidator in a format that doesn't involve human-readable messages at all, but that MediaWiki code can easily munge into Message or Status objects. Compare css-sanitizer's error reporting, which reports codes plus relevant data and TemplateSandbox uses those codes as i18n message keys.
  • This would mean REST handlers could potentially use Action API style multi-value parameters. It would be up to a style guide to forbid it if we don't want that.
  • Possible yak to be shaved: T224197: ConvertibleTimestamp inconsistent timezone handling in accepted time strings. Ok, I just nerd-sniped myself into shaving that yak.

I like @Anomie's proposal a lot. Some minor points:

The Action API works on the basis of "parameter definitions", which can be thought of as associative arrays of settings. These settings include:

  • Parameter type.
    • For integer and limit types, minimum and maximum allowed values. Limit types can have two maximums, for users with and without the 'apihighlimits' user right.
      • Also, a flag for whether the integer min/max are advisory or should be enforced.
    • For multi-value enumerated-type parameters, whether "*" is a magic value meaning "all values of the enumeration".
    • For the 'namespace' type, extra "namespaces" to include (typically used for NS_SPECIAL and/or NS_MEDIA).

None of these will likely be used or usable in REST paths.

    • For string types, maximum length (in bytes or Unicode codepoints).
  • Default value.
  • Whether the parameter accepts multiple values, those being separated by U+007C (|) or U+001F at the client's discretion.
    • Whether the set of multiple values preserves duplicates.
    • Maximum number of values accepted, for users with and without the 'apihighlimits' user right.
  • Flags for requiredness, deprecation, sensitiveness (i.e. "don't log"),

Multi-value is probably also not needed/wanted for rest paths, nor do optional values, really (would be implemented as a different route, afaiu)

  • A mechanism for rudimentary structured data, e.g. that setting slots=foo|bar means that parameters text-foo and text-bar should exist (and use the provided definition) while others such as text-baz should not.

Could perhaps be applied to POSTed JSON.

  • Various settings for controlling auto-generated documentation.

Would use a different path prefix? Or just content negotiation?

Data types recognized by the Action API are:

The data types could be re-used in POSTed JSON.

  • boolean: Works like an HTML checkbox, any value (including 0 and the empty string) is considered "true" while only the absence of the parameter is "false".
  • limit: Like 'integer', but also accepts the string 'max' to mean the maximum allowed value for the current user.
  • namespace: A namespace number (only those known to MediaWiki). By default not including NS_SPECIAL or NS_MEDIA.

None of these is really usable in paths. Namespace names might be.

  • integer: Nominally just /^[+-]?\d+$/, but in practice anything allowed by PHP (int)$value is currently accepted. I'd like to fix that.
  • submodule: The name of an Action API submodule available via the module's ApiModuleManager.
  • tags: A string naming an existing, explicitly-defined change tag. Despite the plural name, it doesn't force the parameter to be multi-value.
  • timestamp: A timestamp in any format recognized by Timestamp, or the string 'now' to mean the current timestamp. The validator normalizes to TS_MW format.
  • user: A MediaWiki user name or IP, which the validator will normalize.

Usable in paths.

  • Several different string types:
    • string: A non-empty string where <input type="text"> would be the expected form widget. Not expected to contain newlines, but they aren't forbidden either.
    • text: A non-empty string where <textarea> would be a valid form widget.
    • password: Like 'string', but the recommended widget is <input type="password">. The "sensitive" flag is treated as always true for these.
    • NULL: Any string. Really this is the "no type specified" type.

All but string make no sense outside a POST (form or JSON, both).

I'm missing a "title" type, btw.

Also of note is that the Action API "framework" makes it easy for modules to use CSRF tokens: the module need only implement a method needsToken() method to return the name of a token-type, and the "framework" adds a parameter with the proper definition and checks the token before executing the module. That's currently done outside the scope of the parameter validation, but could potentially be incorporated if we instead add a 'token' type.

Having a generalized mechanims for this sounds good, but the concern seems rather different from parameter validation, both in terms of semantics and in terms of implementation.

REST API guesswork

REST API "parameters" will come in four types: path positional parameters, query string ($_GET) parameters, form-data POST parameters (including possibly file uploads), and POSTed data in other formats such as JSON. The first three can hopefully share definition and validation code with the Action API, while the fourth likely can't.

It seems to me like types could be re-used for JSON literals. It also seems like many types are not useful in positional (path) parameters.

We'll also likely have to differentiate the first three somehow, where the action API generally doesn't care about the source.

I think the declaration would need to specify whether it's for a path param, a GET (which also accepts POSTed form fields), a POSTed form field, or a POSTed JSON structure. For JSON, the parameter name would probably be a path, like "search/query/flags/case_sensitive" or something. Validation beyond that would need a proper schema.

We may need to add additional types for REST. In particular, we'll likey want a boolean type that accepts values such as "", "0", "f", "false", "off", and "no" as meaning false, "1", "t", "true", "on", and "yes" as meaning true, and rejects unrecognized values.

For use in the path, or in JSON? Booleans in the path seem bad. A two-value enum should probably be used instead.

For "POSTed data in other formats such as JSON", if we do anything we'd likely want to define an generic "parse and validate" interface and have the Handler somehow specify which format(s) it wants to use. The main benefit would be code sharing, so e.g. every JSON-using handler doesn't have to reinvent parsing and validation. For the MVP implementation, though, we'd likely only make the interfaces and maybe a "null" validator.

Null is probably fine for the MVP.

An alternative approach to up-front validation would be on-the-fly validation: wrap the parsed JSON in an object that has methods like getInt() and getString(), etc. Each takes a path (an array, or a string split on "/") and an optional default. If no default is given, an error is raised when the requested path is missing.

Proposal

REST API Handlers would have a method to return parameter definitions similar to the parameter definitions returned by the Action API's ::getAllowedParams() method. Code similar to what I have in I01a1247d will be used to validate parameters against the definitions in both the Action API and the REST API; the Action API port can serve as a test bed since complex REST stuff isn't quite there yet. Since it seems to be what we're aiming for, I'll try to refactor that code as library part in includes/libs/ParamValidator that could be librarized once it's stable plus any MediaWiki-specific parts in includes/ParamValidator.

Sounds sensible.

Other notes:

  • While the existing Action API parameter definitions include a bunch of help-generation i18n stuff, the ParamValidator itself won't touch that and will ignore it when present.

I'd be interested to know what the plan is for self-documentation of the REST router.

  • I hope to do error reporting from ParamValidator in a format that doesn't involve human-readable messages at all, but that MediaWiki code can easily munge into Message or Status objects. Compare css-sanitizer's error reporting, which reports codes plus relevant data and TemplateSandbox uses those codes as i18n message keys.

+1.

Tgr added a comment.May 24 2019, 4:18 PM
  • NFC and other normalization should IMO be limited to known content types. Path and query parameters should always be normalized, text/wikitext/HTML/XML/JSON likewise, POST/PUT bodies with a different type should be treated as binary data (but the handler should be able to indicate otherwise).
  • POST/PUT bodies will typically be JSON and require appropriate validation. Which presumably means JSON Schema (I don't think there's any mainstream alternative), the question is which version? OpenAPI 3 uses draft 5, justinrainbow/json-schema (by far the most popular PHP schema validator) seems to support draft 4 (which is pretty much the same), the Opis validator supports draft 7 (the current version), Swaggest supports both 5 and 7. At a glance the 4/6 differences (6 and 7 are mostly the same) are not terribly useful. So we should probably go for 4/5 for OpenAPI compatibility.
  • We'd like to be able to express the AI module spec as OpenAPI. OpenAPI's parameter schema (used for path, query, header and cookie parameters) normally specifies things by a combination of type and format, where type is string, number, integer or bool, and format can be int32, int64, float, double, byte (which means base64-encoded), binary (any string value), password (this is more of a security flag than a real format), date and date-time (RFC 3339 / ISO 8601). Custom formats are allowed but generic clients have no way to interpret them. There are a bunch of JSON Schema validators that can be added to this: minimum/maximum for numbers, minLength/maxLength, enum and pattern for strings, required. (Pattern is a Javascript regexp - that seems like a PITA to handle accurately.) Non-validation related properties are default, description (must be markdown/CommonMark), example and deprecated. (There are more validation and non-validation options but they are not really useful, and most are meant to be used for parameters in the body.)
    • "Mixed" types like limit seem hard to express in OpenAPI, although I guess they can always be stringly typed.
    • OpenAPI does not seem to support parameters in form-data, as opposed to handling them as properties on the body (which would be described with a single JSON schema). That seems like a reasonable
  • A bit off scope, but I think we should rethink CSRF tokens for the REST API. They are a nuisance for clients, non-JS clients don't need them because there's no way to maliciously trigger a request with a cookie, same-origin JS clients don't need them because the API can tell with an Origin check that they are reliable, so it would be nice to make them framework-level and more optional.
  • The action API user type is canonicalized. How would that look in a caching-friendly REST API? Would the framework / validator issue a redirect? (If we do that, I imagine we'd want similar support for titles and files as well.)
  • FWIW OpenAPI does recognize piped lists, although I can't see any reason to use them.
Tgr added a comment.May 24 2019, 4:48 PM

The Action API works on the basis of "parameter definitions", which can be thought of as associative arrays of settings. These settings include:

  • Parameter type.
    • For integer and limit types, minimum and maximum allowed values. Limit types can have two maximums, for users with and without the 'apihighlimits' user right.
      • Also, a flag for whether the integer min/max are advisory or should be enforced.
    • For multi-value enumerated-type parameters, whether "*" is a magic value meaning "all values of the enumeration".
    • For the 'namespace' type, extra "namespaces" to include (typically used for NS_SPECIAL and/or NS_MEDIA).

None of these will likely be used or usable in REST paths.
...
Multi-value is probably also not needed/wanted for rest paths, nor do optional values, really (would be implemented as a different route, afaiu)

How will the API expose query modules (which include limits and filters, which use all of the features above)?

  • One option is not at all - they already exist as action API modules, they wouldn't really benefit from caching, and they benefit from the action API capability to combine multiple modules in a request. I guess this depends on whether consider this project to be building a parallel API for endpoints which match REST semantics well and require caching, or as the start of a (no doubt very long) migration which replaces the action API with a different paradigm.
  • Another option is to force them to use POST requests, which breaks REST semantics but there isn't really any pragmatic reason not to. That makes it is to handle any kind of complex parameter schema - validation will still be needed, but it doesn't necessarily have to use the same mechanism as the path/query parameters.
  • Or we can use GET endpoints with query parameters for limit, filter etc, in which case the features above will definitely be needed.

The Action API works on the basis of "parameter definitions", which can be thought of as associative arrays of settings. These settings include:

  • Parameter type.
    • For integer and limit types, minimum and maximum allowed values. Limit types can have two maximums, for users with and without the 'apihighlimits' user right.
      • Also, a flag for whether the integer min/max are advisory or should be enforced.
    • For multi-value enumerated-type parameters, whether "*" is a magic value meaning "all values of the enumeration".
    • For the 'namespace' type, extra "namespaces" to include (typically used for NS_SPECIAL and/or NS_MEDIA).

None of these will likely be used or usable in REST paths.

Used? Possibly not. Usable? If someone writes a parameter definition for a path parameter that uses these features, they'll function.

Multi-value is probably also not needed/wanted for rest paths,

REST APIs sometimes have query-string parameters too, don't forget.

I haven't used many REST APIs, but I can think that one way to do some limited batch operations would be a path like /foo/bar/ID1|ID2|ID3|.../baz. I think I've even seen that sort of thing, although I can't remember where that might have been and they might have used a different delimiter.

nor do optional values, really (would be implemented as a different route, afaiu)

Sure, optional parameters can be implemented as multiple routes pointing to the same Handler, with some of the routes lacking one or more of the path parameters. We could put some syntactic sugar for that in the path-template syntax (currently we haven't), or the multiple routes can be individually registered. The Handler still needs to deal with being called via the routes where the parameter wasn't provided, and supplying "defaults" for the path parameters seems a straightforward way to do it.

  • A mechanism for rudimentary structured data, e.g. that setting slots=foo|bar means that parameters text-foo and text-bar should exist (and use the provided definition) while others such as text-baz should not.

Could perhaps be applied to POSTed JSON.

Yeah, this one probably wouldn't be used in a REST API when it could use posted JSON instead. Other than the hazard of developers doing something weird, there seems to be no harm in the validator having features that are never used by REST API handlers or Action API modules but are used by the other API.

  • Various settings for controlling auto-generated documentation.

Would use a different path prefix? Or just content negotiation?

I think you're confused here. This doesn't have anything to do with how that documentation is made available to users. These settings are like

  • Use this i18n message to describe the parameter, instead of the default one derived from the module and param names.
  • Append this i18n message after that first one. Intended for conditional notes or standardized notes.
  • For this enumerated-value parameter, generate a <dl> that documents each value using a separate i18n message for each value. And optionally use specified i18n messages for certain values instead of deriving one from the module name, param name, and value.

And as I mentioned elsewhere, all these are outside the scope of this task anyway. The validator will ignore these settings, it would be the help generator that uses them.

Data types recognized by the Action API are:

The data types could be re-used in POSTed JSON.

As I mentioned, I don't think we should use parameter definitions for validation of posted JSON at all. We should pull in something like https://packagist.org/packages/justinrainbow/json-schema and use that to validate the JSON against a schema.

I'm missing a "title" type, btw.

Yeah, the Action API doesn't currently have that. Part of abstracting the validation will involve making it easier to add new types.

Having a generalized mechanims for [CSRF tokens] sounds good, but the concern seems rather different from parameter validation, both in terms of semantics and in terms of implementation.

I agree. You could kind of fit it into validation if you look at it like a more complicated version of /^[+-]?\d+$/ or checking the value against an enumeration, but just because you can doesn't mean it's a great idea. It would probably be better for the REST framework to inject a $_GET parameter upon a similar signal from the handler that CSRF is needed.

REST API "parameters" will come in four types: path positional parameters, query string ($_GET) parameters, form-data POST parameters (including possibly file uploads), and POSTed data in other formats such as JSON. The first three can hopefully share definition and validation code with the Action API, while the fourth likely can't.

It seems to me like types could be re-used for JSON literals. It also seems like many types are not useful in positional (path) parameters.

If types aren't useful in path parameters, then people can just not use them in path parameters.

As for JSON, rather than halfway doing it with a more complicated version of parameter definitions designed for "scalar|scalar[]", why not just use a solution designed from the start for hierarchical data structures?

We'll also likely have to differentiate the first three somehow, where the action API generally doesn't care about the source.

I think the declaration would need to specify whether it's for a path param, a GET (which also accepts POSTed form fields), a POSTed form field, or a POSTed JSON structure. For JSON, the parameter name would probably be a path, like "search/query/flags/case_sensitive" or something. Validation beyond that would need a proper schema.

Yes, the first part is the same thing I said.

But again I don't think the bit about posted JSON should be done in this way though, it's too different and trying to shoehorn it into the same mechanism seems too likely to overcomplicate the mechanism and still be inadequate for JSON. My plan at the moment is to punt on validating posted JSON entirely, with a @todo pointing out that JSON schemas would be the way to go.

We may need to add additional types for REST. In particular, we'll likey want a boolean type that accepts values such as "", "0", "f", "false", "off", and "no" as meaning false, "1", "t", "true", "on", and "yes" as meaning true, and rejects unrecognized values.

For use in the path, or in JSON? Booleans in the path seem bad. A two-value enum should probably be used instead.

For use in $_GET and $_POST. Boolean-typed path parameters seem unlikely (but I won't go out of my way to prevent someone from defining one), and as I said I think JSON should be validated with JSON schemas.

An alternative approach to up-front validation would be on-the-fly validation: wrap the parsed JSON in an object that has methods like getInt() and getString(), etc. Each takes a path (an array, or a string split on "/") and an optional default. If no default is given, an error is raised when the requested path is missing.

*shrug* Could work. If someone wants to use that, they should write it.

I'd be interested to know what the plan is for self-documentation of the REST router.

No idea. That'll be a different task.

  • NFC and other normalization should IMO be limited to known content types. Path and query parameters should always be normalized, text/wikitext/HTML/XML/JSON likewise, POST/PUT bodies with a different type should be treated as binary data (but the handler should be able to indicate otherwise).

POSTs as application/x-www-form-urlencoded and the parts of multipart/form-data that wind up in $_POST should probably be normalized too. Validation of other content types, and of other actions like PUT, seems outside the scope of this task beyond possibly providing an interface for validators of other types to implement and a way for handlers to specify the validator to use.

  • POST/PUT bodies will typically be JSON and require appropriate validation. Which presumably means JSON Schema (I don't think there's any mainstream alternative), the question is which version? OpenAPI 3 uses draft 5, justinrainbow/json-schema (by far the most popular PHP schema validator) seems to support draft 4 (which is pretty much the same), the Opis validator supports draft 7 (the current version), Swaggest supports both 5 and 7. At a glance the 4/6 differences (6 and 7 are mostly the same) are not terribly useful. So we should probably go for 4/5 for OpenAPI compatibility.

I plan to not care about any of that for this task. For this task I'll make an interface for non-form-data content validation, and someone else can write an implementation of that interface for whichever version of JSON schemas they want to use. Different handlers can even use different implementations if they want. As for which to write and bundle in the REST library or in MediaWiki core, I'll leave that decision to someone else.

  • We'd like to be able to express the AI module spec as OpenAPI.

I don't know what an "AI module spec" is, unless AI is a typo for API.

I'm skeptical of several of the requirements you went on to list. We're unlikely to care about width of integers, while min and max are already included. I doubt making "byte" a first-class type (rather than validating as "string" and letting the Handler base64-decode it) would be terribly useful at this level, but will be easy to add if we find otherwise. MediaWiki's idea of timestamps has historically been more permissive while at the same time not supporting all of ISO 8601 (such as "2019-W21-5T17:07:24Z" or "2019-143T17:07:24Z" or "2019-05"; I'm adding support for non-Z timezones in gerrit:512201), and I'm not sure there's much benefit in changing that. So far the Action API doesn't seem to have had much need for dates rather than timestamps either.

And you're extremely unlike to get parameter documentation in markdown/CommonMark format rather than wikitext out of MediaWiki, while for this task itself parameter documentation is out of scope.

  • A bit off scope, but I think we should rethink CSRF tokens for the REST API. They are a nuisance for clients, non-JS clients don't need them because there's no way to maliciously trigger a request with a cookie, same-origin JS clients don't need them because the API can tell with an Origin check that they are reliable, so it would be nice to make them framework-level and more optional.

I think Security-Team would disagree with you on this, but feel free to ask them. CSRF tokens aren't intended for non-browser clients, and they're not intended for same-origin JS clients. They're intended to defend against malicious webpages that embed an auto-posting form in an iframe or the like.

On the other hand, CSRF tokens are probably outside the scope of this task specific task.

  • The action API user type is canonicalized. How would that look in a caching-friendly REST API? Would the framework / validator issue a redirect? (If we do that, I imagine we'd want similar support for titles and files as well.)

That's a good question, and also outside the scope of this task. You could ask the same about integer representations with leading 0s, non-NFC strings where those get normalized, and so on. When you file the task for that question, possibly more interesting is that the possibility of having $_GET parameters is also part of the REST API and for caching you probably want them in a canonical order. That seems to never have been resolved in T66214, BTW.

How will the API expose query modules (which include limits and filters, which use all of the features above)?

[...]

  • One option is not at all - they already exist as action API modules, they wouldn't really benefit from caching, and they benefit from the action API capability to combine multiple modules in a request.

This, IMO. I've said many times before that the Action API action=query is good at batch fetching specific pieces of data about many different pages, while REST APIs are generally better at fetching all the data about one page at a time with heavy caching.

We could easily enough make another Action API endpoint (action=pageinfo?) for fetching all the data about one page, although the usual REST API's use of positional parameters in pathinfo makes the heavy caching easier to do there. We could also make a REST-seeming API for batch fetching data about many pages, although actually doing that well seems to be getting away from REST's ideas of "resources" and into the space occupied by things like GraphQL instead.

I guess this depends on whether consider this project to be building a parallel API for endpoints which match REST semantics well and require caching, or as the start of a (no doubt very long) migration which replaces the action API with a different paradigm.

The former, IMO. I think it likely that we'll have many of the same operations (e.g. edit a page) available via both, and if that's done with proper sharing of the business-logic code and the Action API modules and REST API handlers are thin wrappers that just handle converting request parameters into calls and returned data into responses I don't think that would be a bad thing as it reduces the need for clients to have to talk to two different APIs at one site.

Tgr added a comment.May 26 2019, 5:37 PM

Validation of other content types, and of other actions like PUT, seems outside the scope of this task beyond possibly providing an interface for validators of other types to implement and a way for handlers to specify the validator to use.

Filed as T224375: REST API Developer declares JSON validation parameters.

I don't know what an "AI module spec" is, unless AI is a typo for API.

A typo, yes.

I'm skeptical of several of the requirements you went on to list.

They aren't really requirements - we want to be able to translate a MediaWiki REST API spec into OpenAPI 3 but the other direction is not meaningful, so there is no need to support something just because OpenAPI supports it. It's just a list of things we can easily express in OpenAPI.

We're unlikely to care about width of integers

Width is for strings, obviously. IIRC that's supported in the action API.

I doubt making "byte" a first-class type (rather than validating as "string" and letting the Handler base64-decode it) would be terribly useful at this level, but will be easy to add if we find otherwise.

For URL parameters, JSON bodies, x-www-form-urlencoded and form-data the encoding is part of the protocol, so there is no point in handling it on the type level. I think that type mainly exists because OpenAPI allows specifying header and cookie parameters (something we should IMO consider), and those are underspecified and and it's up to the handler to specify the encoding. So if we want to have cookie/header parameters (and header we definitely should, I'm not sure about cookie) then having such a type seems like a good idea. Of course we could just make the handler specify it as string and take care of the rest manually, but you could say the same about integer etc. - providing this kind of functionality for the handlers is exactly what one would expect a validation helper to do.

MediaWiki's idea of timestamps has historically been more permissive while at the same time not supporting all of ISO 8601 (such as "2019-W21-5T17:07:24Z" or "2019-143T17:07:24Z" or "2019-05"; I'm adding support for non-Z timezones in gerrit:512201), and I'm not sure there's much benefit in changing that. So far the Action API doesn't seem to have had much need for dates rather than timestamps either.

As long as it accepts ISO 8601 dates, we can put that in the generated OpenAPI spec. I don't think it's a problem if that spec is more restrictive than the actual API. OTOH not accepting formats that the OpenAPI datetime definition allows would be a problem.

And you're extremely unlike to get parameter documentation in markdown/CommonMark format rather than wikitext out of MediaWiki, while for this task itself parameter documentation is out of scope.

Converting wikitext to markdown on a best effort basis does not seem super hard, but yeah, different task.

On the other hand, CSRF tokens are probably outside the scope of this task specific task.

Fair. A somewhat more specific version of the same issue is already filed at T126257: The API should not require CSRF tokens for an OAuth request, maybe we can revisit that once we have decided how authentication should work for the API.

[canonization is] a good question, and also outside the scope of this task. You could ask the same about integer representations with leading 0s, non-NFC strings where those get normalized, and so on. When you file the task for that question, possibly more interesting is that the possibility of having $_GET parameters is also part of the REST API and for caching you probably want them in a canonical order. That seems to never have been resolved in T66214, BTW.

Filed T224365: REST API URL canonization about that.

We're unlikely to care about width of integers

Width is for strings, obviously. IIRC that's supported in the action API.

Your summary of OpenAPI stuff included "int32" and "int64", that's what I was referring to. And also separate "float" and "double".

Change 515143 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] API: Abstract out parameter validation

https://gerrit.wikimedia.org/r/515143

Hi all!
I haven't read all of this discussion so ignore me if this has already been considered.

I'd appreciate if all URI parameters were required to be lower cased, and if passed in from client with upper case chars, they be lower cased before they are handed to the request handler. Parameter names often make their way into case insensitive (e.g. SQL) systems, and dealing with different casing there can be a pain. I'm not sure if you are also planning on validating POST bodies, but if you do, I'd appreciate the same lower case restrition there.

This would likely mean preferring snake to camel case for parameter names, e.g. page_title instead of pageTitle.

Thank you!

Parameter names often make their way into case insensitive (e.g. SQL) systems, and dealing with different casing there can be a pain.

This does not seem like a reasonable justification for the proposed restriction. While SQL field names are (in some cases) case-insensitive, there is no requirement or reasonable expectation that SQL field names would directly match query string or post parameter names. SQL field values typically are case sensitive.

SQL field values typically are case sensitive.

SQL field values are case sensitive, I'm not asking for a requirement for parameter values, just for their names.

there is no requirement or reasonable expectation that SQL field names would directly match query string or post parameter names

Indeed, but often they do. People designing schemas for either the Mediawiki MySQL database, or in my case for event schemas (which do get persisted to case insenistive analytics SQL databases) then have to make translation decisions between camel case style parameters to something else. Imagine a terrible parameter name like wikiUserTextNameOrIP. If we wanted to save REST API request events to a database, we have to decide what to do with this parameter. If we just do a string to lower, we get wikiusertextnameorip field. Likely we'll snake case it to wiki_user_text_name_or_ip. It'd be nice if the parameter was just called that in the first place.

My example is contrived, but this kind of thing happens fairly often.

Tgr added a comment.Jun 11 2019, 2:07 PM

RESTBase uses lowercase underscore notation, and the action API is also lowercase (except it smashes words together without any delimiter, which is a bit ugly) so I imagine we'll go with that as a naming guideline, whether or not it gets enforced by the software (I'd rather not).

If I were saving generic requests to a database, I'd be more likely to have a table with fields "param_name" and "param_value" rather than trying to make every possible parameter name a separate field. Or serialize all the parameters to a single text field. If I were designing storage for a specific handler, I'd be more likely to be using a getWikiUserTextNameOrIP() method so the code doing the DB insertion didn't have to know or care how exactly that value was determined.

Tgr added a comment.EditedJun 11 2019, 2:09 PM

That also goes for "flat" POST bodies (form-data etc). For JSON bodies, see T224375: REST API Developer declares JSON validation parameters - I imagine there won't be much point in having a convention there as part of the JSON schema will usually come from some domain logic.

Ottomata added a comment.EditedJun 11 2019, 3:00 PM

If I were saving generic requests to a database, I'd be more likely to have a table with fields "param_name" and "param_value" rather than trying to make every possible parameter name a separate field.

It depends on the data model. Sometimes having key,value pairs works, this is a Map type, and we can now support that in JSONSchema and in Hive. But Maps have strongly typed keys (always strings in our case) and values. E.g. a params Map field in a database table can't have both string and integer values.

If I were designing storage for a specific handler, I'd be more likely to be using a getWikiUserTextNameOrIP() method so the code doing the DB insertion didn't have to know or care how exactly that value was determined.

Indeed, I agree that there shouldn't be a requirement that the storage field names and the code variables/methods should match. I just want to reduce confusion for developers and users of the data. A developer that wants statistics on API usage will find it much less confusing if the API parameters matched the fields used in an analytics context.

RESTBase uses lowercase underscore notation, and the action API is also lowercase (except it smashes words together without any delimiter, which is a bit ugly) so I imagine we'll go with that as a naming guideline, whether or not it gets enforced by the software (I'd rather not).

A naming guideline/convention might be good enough. ...What is 'that' in this case? The lowercase underscore notation or the smashed together words? :)

Tgr added a comment.Jun 13 2019, 8:46 AM

What is 'that' in this case? The lowercase underscore notation or the smashed together words? :)

Underscore notation.

Change 517076 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] rest: Use ParamValidator library, add BodyValidator

https://gerrit.wikimedia.org/r/517076

Change 515143 merged by jenkins-bot:
[mediawiki/core@master] API: Abstract out parameter validation

https://gerrit.wikimedia.org/r/515143

Anomie set the point value for this task to 1.Aug 21 2019, 6:04 PM

Story point number is pretty arbitrary. It doesn't represent time already spent, of course. I have no idea whether review will really take 1 day or many, both from a "time waiting" perspective and whether the review calls for major changes or is accepted pretty much as-is.

Change 517076 merged by jenkins-bot:
[mediawiki/core@master] rest: Use ParamValidator library, add BodyValidator

https://gerrit.wikimedia.org/r/517076

Anomie closed this task as Resolved.Sep 6 2019, 4:52 PM

I think this task is done now. Implementing different BodyValidators and the OpenAPI stuff should be tracked as their own tasks.