Page MenuHomePhabricator

REST API i18n
Open, Needs TriagePublic

Description

Internationalisation has emerged as a dependency for the REST API. There are i18n-shaped holes in several merged and WIP patches. Since we're doing everything the "right way", with minimal coupling, this ideally means injecting a message localisation service into ResponseFactory.

There was some discussion in San Antonio about introducing a constructible value class similar to MessageSpecifier. A limitation of MessageSpecifier is that unlike Message, it does not have a concept of parameter types. Message has approximately 9 parameter types: raw, num, duration, expiry, period, size, bitrate, plaintext and list. A reduced set of parameter types would help to make the message formatter interface narrower. However, it could cause difficulties if we need to convert a Message object returned by MediaWiki internals into a REST API message.

We can also look at IApiMessage as a precedent. It has getApiData() and getApiCode(), which allows the action API to provide extended error information in JSON-formatted responses.

There is the question of what language to use, or whether to support having multiple languages in a single response. The action API has the errorlang parameter, which may be "content", "uselang" (apparently a misspelling of userlang) or an actual language code. Should the injected message formatter have a concept of user and content language? If we support multiple languages in a response, we won't be able to inject a message formatter which is initialised for a single language. There will either need to be a factory class, or the formatting functions will need to have a language parameter.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 26 2019, 3:11 AM

Some options:

  1. Refactor the existing Message class into a service and value object model. The service interface and value objects can theoretically be libraryized. Use the resulting classes directly from REST.
  2. Introduce a RestMessageFormatter and RestMessageValue. The concrete implementation of RestMessageFormatter will convert the RestMessageValue to a Message and then have it convert itself to a string. This is basically the same as 1. but private to the REST API and may be duplicative of eventual efforts to do 1.
  3. Just use Message directly from ResponseFactory. Eventually refactor this to use service injection after 1. is done.

I think these are all on the table.

daniel added a comment.Jul 2 2019, 8:44 PM
  1. Introduce a RestMessageFormatter and RestMessageValue. The concrete implementation of RestMessageFormatter will convert the RestMessageValue to a Message and then have it convert itself to a string. This is basically the same as 1. but private to the REST API and may be duplicative of eventual efforts to do 1.

I prefer this one, because it isolates the code from the i18n mechanism (unlike 3), and dosn't require a new i18n mechanism to be specified (like 1) which would have to try to cover all bases.

Looping a Message object returned from inside application logic through this mechanism would indeed be tricky. Providing a mechanism for that would defy the purpose of the isolation layer. If we need to do this, we should bind to Message directly. I very much hope though that we don't actually need that. What would be the use case?

I see two uses for an i18n system in the REST router: localized error messages that can be shown to the user directly, and self-documentation. I'll talk about error messages in a separate comment. For self-documentation, I wonder if we need markup. If plain text was sufficient, that would make the i18n mechanism a lot simpler. The output for self-documentation should probably be HTML, but could be generated from plain text messages.

daniel added a comment.Jul 2 2019, 8:54 PM

For error handling, the client needs an error code. For some such codes, it may want to do something smart, but it can't really know about all possible codes. In order to show something informative to the user, it needs access to a localized version of the error message, returned directly with the response, or separately via some kind of error code lookup API. Using a separate API call for this would remove the need to somehow convey the desired language for error messages with every request.

On the server side, there is two cases to be considered:

  1. An exception is raised directly by a request handler. That handler knows what the error code for the problem should be, and it is attached to the exception that is raised. A single RestHandlerException class would probably be sufficient. If the handler and error code are defined in an extension, the localization for that error code also needs to live in that extension. The localization key for the error should be the same as the error code (plus a known prefix or suffix, perhaps). The RestRouter needs to know where and how to look up the localization messages, that is, they have to be registered/loaded somehow.
  2. An exception is raised further down inside the application logic, which is not expected/handled by the handler, and bubbles up into the router. The router show have an ExceptionLocalizer component which knows about some kinds of exceptions (RestHandlerException of course, but also some other common kinds of exceptions), and has a reasonable fallback for unknown exceptions.

In both cases, an exception is mapped to an error code, and the error code is mapped to a message. In the first case, the mapping of the error code to the exception is done by the code that throws the exception, in the second case it is done by the code that catches the exception.

Tgr added a comment.Jul 8 2019, 11:48 AM
  1. Refactor the existing Message class into a service and value object model. The service interface and value objects can theoretically be libraryized. Use the resulting classes directly from REST.

In the long term IMO this is the only option that makes sense. (In the short term, using a temporary workaround to avoid blocking REST API development on it makes sense). It was discussed a lot in the past but we did not have a task about it yet, so filed as T227447: Librarize i18n-related PHP classes in MediaWiki.

A reduced set of parameter types would help to make the message formatter interface narrower.

raw and list make sense in a generic interface. num is just decoration: it can be provided as some kind of formatting hint, and message implementations ignoring it will still render reasonably. The same is somewhat true of plaintext, although the result will look more broken, but still decipherable. duration, expiry, timePeriod, size and bitrate are all variations on localizing quantities; they could be implemented by a common parameter type that includes a quantity, a unit and a hint. Implementations ignoring the hint and just displaying the raw quantity and unit would still be usable, if a bit awkward.

That said, I don't see much value in making the interface narrower. It's not like we expect a large number of different rendering implementations. And with the exception of raw and plaintext, these are all pretty straightforward preprocessing of the parameter, and the exposing the preprocessing logic (ie. number formatting, rendering duration, expiry etc) as a library is valuable on its own right.

We can also look at IApiMessage as a precedent. It has getApiData() and getApiCode(), which allows the action API to provide extended error information in JSON-formatted responses.

Conceptually IApiMessage doesn't really make sense IMO. It would be better to have an exception type that contains a message (or a list of messages, like StatusValue) plus an error code (which extensions contain anyway) and arbitrary extra data.

"uselang" (apparently a misspelling of userlang)

I always thought it meant "use this language" (cf. useskin and setlang).

Should the injected message formatter have a concept of user and content language? If we support multiple languages in a response, we won't be able to inject a message formatter which is initialised for a single language. There will either need to be a factory class, or the formatting functions will need to have a language parameter.

Message::inLanguage() could be part of the interface; if it's never called, the language will be chosen by the formatter (and there is no reason the formatter couldn't be aware of site configuration and request context).
That leaves the user / content language flag, ie. Message::inContentLanguage(). Is there a need of that? I'd expect it's rare for a library to be needing both, so it could be decided by the calling application.

For self-documentation, I wonder if we need markup.

Hyperlinks, definitely. Inline formatting can be useful (see e.g. the action API manual on formatting). Lists and paragraphs are helpful occasionally.

In both cases, an exception is mapped to an error code, and the error code is mapped to a message.

A message is a combination of a message key and some parameters, where the parameters might have formatting metadata attached, and might be messages themselves. If you try to avoid using some sort of message class, you'll just end up with an array representation of it, with all the internal options exposed. I don't think that's an improvement; we have been trying to move away from it in MediaWiki for a while now.

Anomie added a comment.Jul 8 2019, 9:42 PM

The action API has the errorlang parameter, which may be [...] "uselang" (apparently a misspelling of userlang)

No, it means "same as the effective value of the 'uselang' parameter", as documented.

MediaWiki in general deals with two languages: the wiki's content language and the user's UI language (uselang, or user preference, or the content language). My understanding is that the content language is supposed to be used for "content" that gets cached generally, while the UI language is for bits of the UI that are uncached or are cached per user. But over the years things have generally gotten a bit muddled here, with stuff like the int magic word putting UI into cached content and with the introduction of a "page language" and a "parser target language" alongside the content language rather than as a replacement for it.

The action API adds a third, errorlang, which is basically a subdivision of the "UI language" concept to differentiate "UI data fetched from the API" and "errors/warnings generated by the API itself". But for the most part the consideration of languages in the action API lies in the underlying business logic code rather than in the API code itself.

There are also special cases, particularly in the action API, where a language is specifically specified through a different mechanism. One we'll very likely want to keep is the use of English when generating server-side log messages, including for some Exception subclasses' calls to parent::__construct() for eventual return by Exception::getMessage().

We can also look at IApiMessage as a precedent. It has getApiData() and getApiCode(), which allows the action API to provide extended error information in JSON-formatted responses.

More generically, it's something of a hacky extension of Message to include machine-readable data in the same object as the i18n data.

There will either need to be a factory class, or the formatting functions will need to have a language parameter.

That's a good question, and one that will also be relevant if/when I get around to the MessageSpecifier thing.

  1. Refactor the existing Message class into a service and value object model. The service interface and value objects can theoretically be libraryized. Use the resulting classes directly from REST.

I like this as the longer-term solution. I also note that both of the other options refer back to this one as something of an unwritten assumption that this is the eventual way to go.

The question is whether we want to wait for this, and if not what to do in the mean time: throw together something that tries to be an abstraction (and probably have to refactor it later), or just use MediaWiki's existing Message infrastructure and plan to refactor it later.

For self-documentation, I wonder if we need markup. If plain text was sufficient, that would make the i18n mechanism a lot simpler. The output for self-documentation should probably be HTML, but could be generated from plain text messages.

The ability to use hyperlinks and semantic HTML markup (e.g. <var>) has been really useful in the time since the Action API was converted from plain text (MW 1.25).

Using a separate API call for this would remove the need to somehow convey the desired language for error messages with every request.

On the other hand, it would mean an extra round-trip every time an error occurs. We should probably run that by API users and/or see what other APIs with internationalized errors/warnings do.

  1. An exception is raised directly by a request handler. That handler knows what the error code for the problem should be, and it is attached to the exception that is raised. A single RestHandlerException class would probably be sufficient.

This is basically what the Action API does with ApiUsageException.

If the handler and error code are defined in an extension, the localization for that error code also needs to live in that extension. The localization key for the error should be the same as the error code (plus a known prefix or suffix, perhaps).

The Action API generally uses the "known prefix" model, that being "apierror-" or "apiwarn-". But if passed a key without one of those prefixes, it just uses the whole key as the code.

It also has the ability to specify the code and key separately for cases where multiple messages don't need different codes, so for example you can have one "integer out of range code" with messages like "Must be greater than X", "Must be less than Y", or "Must be between X and Y" depending on whether the range is partially unbounded or not.

The RestRouter needs to know where and how to look up the localization messages, that is, they have to be registered/loaded somehow.

That would most likely be best left to the i18n service rather than being something RestRouter needs to actually care about.

  1. An exception is raised further down inside the application logic, which is not expected/handled by the handler, and bubbles up into the router. The router show have an ExceptionLocalizer component which knows about some kinds of exceptions (RestHandlerException of course, but also some other common kinds of exceptions), and has a reasonable fallback for unknown exceptions.

The Action API doesn't handle any specific Exceptions or Errors beyond its own ApiUsageException, but when configured to provide details of unexpected exceptions it does recognize two interfaces (that in the future would be provided by the i18n library):

  • ILocalizedException interface → calls $ex->getMessageObject() to get an i18n Message.
  • MessageSpecifier interface → calls Message::newFromSpecifier( $ex ) to get an i18n Message.
  • anything else → just uses $ex->getMessage() as plain text.

In all three cases it then wraps it in an i18n message that just says "[$REQID] Exception caught: $MSG". When not configured to show exception details, it just says "[$REQID] Caught exception of type $TYPE".

Any other special handling of exceptions is done further up the stack by code that catches it and throws an ApiUsageException in its place.

The error code is always "internal_api_error" (or will be once gerrit:472228 is merged). By the time you get to the generic handler, there's not likely much you can know about any unexpected exception type you might receive to do something other than say "I got this" with a generic code, particularly if you're intending to not be coupled to the rest of the system.

P.S. The use of ILocalizedException versus MessageSpecifier is somewhat comparable to PHP's built-in IteratorAggregate versus Iterator interfaces.

The same is somewhat true of plaintext, although the result will look more broken, but still decipherable.

I'm not sure about that one. If it gets ignored then I think you're liable to have an XSS.

That said, I don't see much value in making the interface narrower. It's not like we expect a large number of different rendering implementations.

It seems to me that anything using the library is going to have to provide a rendering interface. At best we could provide a generic implementation where the "parse" step is a no-op or just handles a handful of "templates" like {{PLURAL}}.

And with the exception of raw and plaintext, these are all pretty straightforward preprocessing of the parameter, and the exposing the preprocessing logic (ie. number formatting, rendering duration, expiry etc) as a library is valuable on its own right.

At some point there you're pretty close to just reimplementing locales.

Conceptually IApiMessage doesn't really make sense IMO. It would be better to have an exception type that contains a message (or a list of messages, like StatusValue) plus an error code (which extensions contain anyway) and arbitrary extra data.

Not all business logic throws exceptions to indicate errors. That's basically the point of StatusValue, to indicate errors and warnings without throwing an exception. Also the Action API has a code for each error and warning produced, not just one code that tries to cover everything.

Thus IApiMessage exists to associate machine-readable code and data with the Message object in the StatusValue.

I always thought it meant "use this language" (cf. useskin and setlang).

The index.php parameter probably means that. The way it was implemented it worked for the Action API too in many contexts, so supporting it mostly just meant declaring the parameter so it wouldn't be unrecognized. Tim was talking about the value for the errorlang parameter though, not the uselang parameter itself.

Message::inLanguage() could be part of the interface; if it's never called, the language will be chosen by the formatter (and there is no reason the formatter couldn't be aware of site configuration and request context).
That leaves the user / content language flag, ie. Message::inContentLanguage(). Is there a need of that? I'd expect it's rare for a library to be needing both, so it could be decided by the calling application.

On the other hand, how often does the code creating the Message (versus the code rendering it) actually care? In code I've written, I can think of plenty of times where I chain ->inLanguage()->parse() or the like but not any where I call ->inLanguage() on something I'm returning up the stack.

But this calls for some actual investigation rather than just anecdotal memory.

Implementations ignoring the hint and just displaying the raw quantity and unit would still be usable, if a bit awkward.

That's the solution I would prefer.

That said, I don't see much value in making the interface narrower. It's not like we expect a large number of different rendering implementations.

Having multiple narrow interfaces that focus on the need of the respective caller removes the need to come up with a complex solution that covers all use cases.

Conceptually IApiMessage doesn't really make sense IMO. It would be better to have an exception type that contains a message (or a list of messages, like StatusValue) plus an error code (which extensions contain anyway) and arbitrary extra data.

I agree, but I'd like to point out again that we should try to be clear what kind of code should know about message keys and i18n parameters and such, and what code should not.

Ideally, there is a clear separation between application logic (which knows nothing about i18n or JSON or HTTP requests), and "interaction layer" (aka "presentation layer") code that acts as a glue between the application logic and the interaction protocol (be it REST+JSON or HTML forms or CLI parameters and stdout).

As long as application logic only throws "semantic" exceptions that don't involve i18n, which then get caught and converted by the interaction layer, I'm happy.

I note that this is pretty much the opposite of what the Status object does. Status makes i18n the responsibility of the application logic.

Should the injected message formatter have a concept of user and content language? If we support multiple languages in a response, we won't be able to inject a message formatter which is initialised for a single language. There will either need to be a factory class, or the formatting functions will need to have a language parameter.

In my mind, a single formatter targets a single language. If we need to target two languages, we need to inject two formatters.

Message::inLanguage() could be part of the interface

I really don't like that. The message object should not need to know about languages at all.

Hyperlinks, definitely. Inline formatting can be useful (see e.g. the action API manual on formatting). Lists and paragraphs are helpful occasionally.

Shouldn't lists and paragraphs be built by surrounding logic, from multiple messages?

Anyway, I really want a "wikitext light" spec for use in i18n, with a strict and easy to implement grammar and just enough functionality to cover our internationalization needs.

In both cases, an exception is mapped to an error code, and the error code is mapped to a message.

A message is a combination of a message key and some parameters, where the parameters might have formatting metadata attached, and might be messages themselves.

Yes, agreed. Though the parameters could be derived from fields in the exception when exceptions are mapped to messages. This doesn't have to be done by the code that throws the exception. Whether that would be desirable or not depends on where the exception is thrown. Exceptions that originate in the interaction layer can already contain full message specs. Exceptions originating from "pure" application logic can't - they would have to be caught and converted in the interaction layer, or mapped later.

There will either need to be a factory class, or the formatting functions will need to have a language parameter.

That's a good question, and one that will also be relevant if/when I get around to the MessageSpecifier thing.

I'd prefer the formatter to be bound to a specific target language. If we need a mix of languages in the output for some reason, we'd have to inject multiple formatters (or inject a factory).

  1. Refactor the existing Message class into a service and value object model. The service interface and value objects can theoretically be libraryized. Use the resulting classes directly from REST.

I like this as the longer-term solution. I also note that both of the other options refer back to this one as something of an unwritten assumption that this is the eventual way to go.

I'm a bit worried that we'll end up with something clunky that tried to hard to fit all the use cases. Perhaps multiple more specific and lightweight libraries would be better.

I also note that such a library would have to specify a markup, and that spec would ideally not be "any wikitext". Because that would introduce a logical dependency on MediaWiki core and make it effectively impossible to use the lib without MW core.

The question is whether we want to wait for this, and if not what to do in the mean time: throw together something that tries to be an abstraction (and probably have to refactor it later), or just use MediaWiki's existing Message infrastructure and plan to refactor it later.

My recommendation is to have a light weight abstraction specific to the use case at hand. That makes it easy to pivot later.

The ability to use hyperlinks and semantic HTML markup (e.g. <var>) has been really useful in the time since the Action API was converted from plain text (MW 1.25).

Ok. We'll need a spec for the markup we want to support.

The RestRouter needs to know where and how to look up the localization messages, that is, they have to be registered/loaded somehow.

That would most likely be best left to the i18n service rather than being something RestRouter needs to actually care about.

That assumes "the i18n service" is something that is defined outside of MW core. If we want to get rid of cyclic dependencies, then things used by core cannot depend on things defined in core.

The Action API doesn't handle any specific Exceptions or Errors beyond its own ApiUsageException, but when configured to provide details of unexpected exceptions it does recognize two interfaces (that in the future would be provided by the i18n library):

  • ILocalizedException interface → calls $ex->getMessageObject() to get an i18n Message.
  • MessageSpecifier interface → calls Message::newFromSpecifier( $ex ) to get an i18n Message.
  • anything else → just uses $ex->getMessage() as plain text.

In all three cases it then wraps it in an i18n message that just says "[$REQID] Exception caught: $MSG". When not configured to show exception details, it just says "[$REQID] Caught exception of type $TYPE".

The distinction between "user level" exceptions that always get shown in full detail, and "internal" exceptions that get obscured, is a good point. I forgot about the need to distinguish this.

The same is somewhat true of plaintext, although the result will look more broken, but still decipherable.

I'm not sure about that one. If it gets ignored then I think you're liable to have an XSS.

How can you have an XSS in plain text? Something that is plain text must of course never be used in HTML without escaping.

It seems to me that anything using the library is going to have to provide a rendering interface. At best we could provide a generic implementation where the "parse" step is a no-op or just handles a handful of "templates" like {{PLURAL}}.

There is limited support for message rendering in JS. We could use that as a model.

Not all business logic throws exceptions to indicate errors. That's basically the point of StatusValue, to indicate errors and warnings without throwing an exception. Also the Action API has a code for each error and warning produced, not just one code that tries to cover everything.

StatusValue is an artifact from the times when PHP didn't have exceptions. It's still useful for warnings I guess. But I find it problematic that it drags i18n concerns into the application logic.

On the other hand, how often does the code creating the Message (versus the code rendering it) actually care? In code I've written, I can think of plenty of times where I chain ->inLanguage()->parse() or the like but not any where I call ->inLanguage() on something I'm returning up the stack.

Indeed. Let's not perpetuate the mistakes of Message. The target language and the target format should be a concern of the formatter, not the language spec.

Anomie added a comment.Jul 9 2019, 2:47 PM

I'm a bit worried that we'll end up with something clunky that tried to hard to fit all the use cases. Perhaps multiple more specific and lightweight libraries would be better.

I'm wary of "multiple more specific libraries". We don't want to wind up like npm, where there's a library for testing if a number is even, which depends on one for testing if a number is odd, which depends on one for testing if something is a number...

I also note that such a library would have to specify a markup

All it really needs to specify is that placeholders are "$1", "$2", and so on. {{PLURAL}} and {{GENDER}} and maybe {{int}} would probably also be useful.

I'm wary of going much beyond that, as it runs the risk of conflicting with someone wanting to use something like markdown rather than wikitext for their app's i18n.

My recommendation is to have a light weight abstraction specific to the use case at hand. That makes it easy to pivot later.

It means you have one more layer of abstraction that may not map well to the underlying layer it's abstracting, that has to be maintained indefinitely or eventually refactored out.

That assumes "the i18n service" is something that is defined outside of MW core. If we want to get rid of cyclic dependencies, then things used by core cannot depend on things defined in core.

Look at the various PSR libraries: a thing used by core depends on an interface defined as a PSR library, but the implementation of the interface is defined in core (or in a third library unknown to the original thing).

The same is somewhat true of plaintext, although the result will look more broken, but still decipherable.

I'm not sure about that one. If it gets ignored then I think you're liable to have an XSS.

How can you have an XSS in plain text? Something that is plain text must of course never be used in HTML without escaping.

You seem to have missed the context. We're talking about the Message class's "plaintext" parameter type, not messages rendered to plain text.

$msg = new Message( 'key' ); // i18n is "Foo bar: $1"
$msg->params( Message::plaintextParam( "<b>baz</b>" ) );
$s = $msg->parse();

The content of $s should be "Foo bar: &lt;b&gt;baz&lt;/b&gt;", because the parameter was plaintext-type. If the renderer ignores that type and treats it as any generic parameter, it may generate "Foo bar: <b>baz</b>" instead and render the bold tags.

MediaWiki itself is pretty safe from actual XSS by this sort of thing because the wikitext parser is already designed to deal with untrusted input. Other parsers someone might want to use for rendering i18n may not be as safe.

StatusValue is an artifact from the times when PHP didn't have exceptions.

That is not correct.

Status dates back to 2008 (54e2b71bd1), as a generalization of "FileRepoStatus" from 2007 (ca76169b). MediaWiki has used exceptions since 2006 (410986a).

StatusValue was derived from Status in 2015 (c15caa6d) along with MessageSpecifier as part of an earlier attempt to avoid having code in libs depend on non-libs MW code. The StatusValue part of that split doesn't seem to be too bad, but as we've found the MessageSpecifier interface is not sufficient.

tstarling added a comment.EditedJul 12 2019, 5:22 AM

As @Anomie says, exceptions were introduced to MediaWiki (by me), and later, Status was introduced to MediaWiki (also by me). I asked on wikitech-l what exceptions should be used for, and the consensus was that they should generally be used only for unexpected errors, such as when the developer has made a mistake, not for cases where the user does something wrong. Nobody supported following Python's model of using exceptions for everything, and making try/catch a routine part of flow control. The convention thus established was that "ordinary" errors should be handled by returning a special value. Status was introduced in response to this. It's only very recently that Daniel has been trying to change this convention by adding try/catch everywhere.

In Lee's phase3, false was normally used as an error return, and errors leading to an error page immediately being displayed were handled with an early exit. The false return convention persists in many places. I converted the early exits to exceptions, with the exception object describing the error page. The idea of exceptions as error page descriptors fits in nicely with REST's HttpException.

But this is a long tangent.

It sounds like what we need for REST i18n is:

  • A MessageFormatterFactory injected into Router. Router will provide this to handlers.
  • MessageFormatterFactory->get($lang) returns a MessageFormatter
  • MessageFormatter has methods such as formatHtml() and formatPlain() which produce a string from a MessageValue
  • MessageValue is a constructible value object representing message keys and parameters. As in Message, parameters have types. The list type has subtypes, and the value of a list parameter is an array of further typed parameters. It thus makes sense to encapsulate parameters in a value object, say MessageParam.
  • A new subclass of HttpException will be introduced, say LocalizedHttpException, which accepts a MessageValue as its first constructor parameter instead of a string.

Router will decide what language it uses to format LocalizedHttpException for output. For now, that will be $wgLang, injected one way or another. Handlers that use the provided MessageFormatterFactory will also need to make their own decision about what language to use, for example based on services injected into the Handler's constructor.

In the interests of maintaining our focus on REST, I recommend deferring the following tasks:

  • Libraryization of MessageFormatterFactory and MessageValue.
  • Any interface which exposes message retrieval from storage and the format in which messages are stored. This includes specifying a reduced subset of wikitext for message parsing.
  • Reducing the set of parameter types.

From the point of view of a consumer of MessageFormatterFactory, {{PLURAL}} etc. is not exposed. All it knows is that it sends in message keys and parameters and gets back text. Whether the formatter does a good job of plurals is not specified. Where the localised text comes from is not specified.

Whether the formatter does a good job with all parameter types is also not exposed by the public interface. I think parameter types should be defined in a forwards-compatible way, so that it's not an error for a MessageParam to be constructed with an unknown type. The only truly special type is the list type.

I agree with the plan Tim outlined above.

The only thing I'd change is exposing the MessageFormatterFactory to the Handlers. Most handlers shouldn't need it, and those that do should pull it in like any other service they need. I see no reason to make it special.

Oh, and we have to make sure that LocalizedHttpException still has a machine readable error code. If that is identical to the message key, that should be stated explicitly.


Tangent re the use of exceptions:

It seems I was wrong about why the history of Status objects. Thanks to Anomie and Tim for clearing that up!

For the record, I don't think we should generally use exceptions to represent user errors, though I'm also not totally opposed to the idea. What I do think, and what does lead to some more try/catch around the code, is that the concept of internationalization messages should not be present in low level code. If storage layer code is asked to retrieve something that doesn't exist, it should generally throw an exception. Similarly, value objects should "defend" against invalid state, ensuring that any such object that exist conforms to the requirements of the specified use case (e.g. a TitleValue can only represent a valid title, never an invalid one). It's the responsibility of higher level code to either check first, or catch the exception, and construct a message that represents to problem to the user.

We should probably have a conversation about best practices for Exceptions somewhere...

I agree with the plan Tim outlined above.
The only thing I'd change is exposing the MessageFormatterFactory to the Handlers. Most handlers shouldn't need it, and those that do should pull it in like any other service they need. I see no reason to make it special.

Added strikethrough accordingly.

Something that came up during more detailed design: I think MessageFormatter actually just needs format(), not formatText() and formatHtml(). Message has parse() and escaped(), which conflate declaration of the storage format with a desire for HTML output. Both set HTML output, but parse() assumes wikitext storage and escaped() assumes plain text storage. Similarly, Message::plain() declares a storage format with no preprocessor tags. If we're not exposing the storage format, then they don't really make sense.

So I now propose that MessageFormatter::format() will correspond to Message::text(), which just does preprocessing. It would be up to the caller to interpret the result as either plain text or HTML.

Ideally, MediaWiki would know the format of its messages and would be able to convert them to whatever format is desired by the caller. But that doesn't seem like a reasonable dependency for this task. We could add that feature in the future, and then implement formatHtml() which would convert from any input format to HTML.

If a handler throws a LocalizedHttpException, what JSON should be provided? Should the message parameters be included in the response, along with their types? I think any localized text should have the language code attached, for example [ 'message-en' => 'File not found' ].

So I now propose that MessageFormatter::format() will correspond to Message::text(), which just does preprocessing. It would be up to the caller to interpret the result as either plain text or HTML.

That strikes me as rather dangerous. The caller has no way to know whether the result is safe to be used as HTML.
It seems like the only safe option would be to always treat the output of Message::text() as plain text.

This also touches on the question of parameter handling. Thinking about in in this context, "raw" parameters seem dangerous - they should be "html" parameters or "wikitext" parameters, etc. Similarly, the behavior of parameters representing quantities may change depending on the output format.

Ideally, MediaWiki would know the format of its messages and would be able to convert them to whatever format is desired by the caller. But that doesn't seem like a reasonable dependency for this task. We could add that feature in the future, and then implement formatHtml() which would convert from any input format to HTML.

I agree that this would be ideal, and I also agree that we can't block this task on on overhaul of MediaWiki's i18n system. But perhaps we can get the desired effect on top of the existing i18n. One way would be to simply require all messages used in the REST API to be in a specific format. The other is to encode the information about the format in the message key as file-extension like a suffix: e.g. rest-handle-my-path-key-error.txt would be known to be plain text, rest-handle-my-path-key-error.html would be known to be HTML, and rest-handle-my-path-key-error.wiki would be wikitext.

In summary, I think it's important that the caller of a format method can be sure whether they are getting back HTML or not, without having any knowledge about the message itself. This is even more critical if we pass these messages on to clients; they definitely need to know whether it is safe for them to treat the message they received from the server as HTML or not.

Change 523126 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] [WIP] MessageFormatter

https://gerrit.wikimedia.org/r/523126

The current situation is that any messages which are output as raw HTML need to be added to $wgRawHtmlMessages, a very short list at present. That's not what I assumed when @Anomie said that <var> etc. are essential in help text. It's not too bad for security, although a caller can shoot itself in the foot if it tries hard enough. There are actually four formats: in addition to plain text, HTML and wikitext, we also have filtered HTML, for example the api-help-title message is sent into OutputPage::setPageTitle() which calls Sanitizer::removeHTMLtags(). If all messages accessed via a MessageFormatter instance are the same source format, then they can all use the same sanitization, removing the need to expose different formatting methods. The source format can just be passed to the MessageFormatter constructor.

We've been discussing the intricacies of message input and output formats in more detail at T227447: Librarize i18n-related PHP classes in MediaWiki, we should probably have most of the discussion there instead.

To briefly address some of the discussion above: In general, we need the code doing the outputting to be able to know what it's getting back: output safe for HTML output (whether parsed or escaped) or output not safe for HTML output, which doesn't depend on the input format. That could be accomplished easily enough by multiple methods on a MessageFormatter, by a $options parameter to a single method, or by multiple MessageFormatters. The same goes for that "whether parsed or escaped" distinction. As far as I can tell the decision comes down to whether we want to have to inject one MessageFormatter with a slightly more complex interface or potentially have to inject multiple MessageFormatters (or the MessageFormatterFactory).

@Anomie said

In general, we need the code doing the outputting to be able to know what it's getting back: output safe for HTML output (whether parsed or escaped) or output not safe for HTML output, which doesn't depend on the input format

Yes, indeed. The Formatter has to guarantee what it is returning from the format() method. We could just say that it's always HTML and be done. Or we have two methods, one for plain text and one for HTML - in the same interface, or in two interfaces.

What I would recommend against (after having made this mistake in Wikibase) is to have two implementations of the same interface that return different formats, make different guarantees. The format() method itself should guarantee what it returns, this info should not be derived from the parameter name or position or documentation in the calling code.

We could just say that it's always HTML and be done.

That part wouldn't work, because some code paths need plain text. As for the rest, probably better to raise it on the other task.