Page MenuHomePhabricator

Librarize i18n-related PHP classes in MediaWiki
Open, Needs TriagePublic

Description

Librarizing the cross-cutting aspects of 18n logic is a prerequisite for librarizing many other components. There have been lots of fragmented discussions about this so filing this task to have a more central place to track it.
(This is about the PHP side only; see T143463 for the Javascript i18n code.)

On the PHP side, the ubiquitous i18n class is Message, which is a bit of a kitchen sink, tightly coupled to database and cache logic; it needs an interface and a minimal implementation that just carries keys and parameters around. MessageSpecifier was an attempt at that, but there seems to be consensus that it's missing a number of necessary features:

  • parameter formatting - raw, escaped, number, duration etc. (we might want to consider dropping the XXXParams methods first to simplify things, see T120649#2745260)
  • specifying a set of fallback message keys in the same object
  • specifying the language
  • probably some kind of abstraction for Message::$useDatabase such as cusutomized/default

If message loading/parsing/etc is stripped from the value-object successor of Message, there will be need for some kind of MessageRenderer service interface for taking a message and turning it into a string (or maybe a Message since it's unlikely we can get rid of __toString-based auto-rendering in MediaWiki anytime soon).

The other relevant MediaWiki class is MessageLocalizer, which is basically a Message factory for injecting some of the above settings (language etc) based on global state. Not sure if that would need to be separate from the renderer service.

We might also want to provide a less-minimal implementation that allows using MediaWiki-style i18n features ({{PLURAL}}, {{GENDER}} etc) in non-MediaWiki code (both to provide a nice user experience for our libraries when they are used outside of MediaWiki, and for standalone apps such as PHP-based Toolforge tools).

Prior art:

Event Timeline

Another approach (instead of a renderer service) would be to inject a MessageLocalizer-ish service into all libraries, and require all message objects to be created by that service, and the interface version of Message would specify the output methods (text(), parse() etc) instead of the input ones. But that would mean keeping the current mess of Message rendering itself (or using a service callback, which is still not nice and makes serialization more problematic). So probably not a good idea.

Possibly relevant (although maybe not something to attempt at the same time given the complexity) is that currently Message is something of a security nightmare: the code turning it into a string (and choosing whether to escape) doesn't know what format the message is in; the repository holding the message (ie. Translatewiki) doesn't know how it's going to be escaped. $wgRawHtmlMessages was an attempt to specify that, but it's a band-aid at best (and doesn't actually prevent using an undeclared message as raw HTML). Fixing that will probably involve some changes to ow Message is called, e.g. unbundling formatting (plain text, block etc) and escaping (raw HTML or not). See also @Simetrical's comments in T21291#4468578.

I don't want to lick this cookie just yet, but I do hope to try to work on this sometime soon. Or at least parts of it, I doubt I'll bother with the "less-minimal implementation" bit.

My current plan is

  • Go the MessageRenderer route, so the MessageSpecifier value object can be sanely newable and mutable rather than requiring everything everywhere that wants to return a message to have some factory injected.
    • I note Message has three basic "parsing" modes: plain() doesn't do any parsing, text() does template expansion but returns wikitext, and parse() fully parses wikitext→HTML. There's also escaped() which does template expansion then passes the wikitext through htmlspecialchars(), and parseAsBlock() which skips the step of removing the wrapping <p> tag the parser probably generated. Open question: Should MessageRenderer include all these, or should it just have "render()"? In the latter case, should MediaWiki's implementation have separate MessageRenderers for each transformation or use a flag (see the next top-level bullet).
    • Open question: Pass the target language to MessageRenderer's methods, or only to its constructor (and have a MessageRendererFactory to get one for a particular language)?
  • I'm leaning towards pushing all the flags like "inLanguage" and "useDatabase" into the MessageRenderer's rendering method. Although I might have to change that plan if it turns out my intuition is wrong when I think that the flags are mainly set by the code rendering the Message rather than the code creating it.
  • Bonus: I want to make MessageSpecifier implement JsonSerializable (and a static constructor to reverse the process too).
  • I want MediaWiki's Message class to become a (deprecated) wrapper around MessageRenderer(Factory) + MessageSpecifier, much like how Revision wrapped RevisionStore+RevisionRecord.
  • I may fold the concept of machine-readable codes and data into MessageSpecifier or an auxiliary interface too.
    • For example, a MessageSpecifier representing a "integer out of range" error might have any of three different message keys, for "Must be greater than $1", "Must be less than $2", and "Must be between $1 and $2". It can be useful for the same message object to also contain a machine-readable 'code' indicating that it's an "integer out of range" error message regardless of the specific i18n message, as well as machine-readable data to indicate the minimum and maximum rather than requiring callers to know that $1 and $2 are the min and max (rather than $2 and $3, or whatever).
    • Yes, this is basically MediaWiki's IApiMessage.

Possibly relevant (although maybe not something to attempt at the same time given the complexity) is that currently Message is something of a security nightmare: the code turning it into a string (and choosing whether to escape) doesn't know what format the message is in; the repository holding the message (ie. Translatewiki) doesn't know how it's going to be escaped. $wgRawHtmlMessages was an attempt to specify that, but it's a band-aid at best (and doesn't actually prevent using an undeclared message as raw HTML). Fixing that will probably involve some changes to ow Message is called, e.g. unbundling formatting (plain text, block etc) and escaping (raw HTML or not). See also @Simetrical's comments in T21291#4468578.

I don't think it's that simple. The problem is that, fundamentally, we're just dealing with strings and the interpretation of the input-string (from translatewiki) depends on how the outputting code uses it:

  • If the outputting code calls ->plain() and shoves it into HTML, the input was treated as an HTML string.
  • If the outputting code calls ->text() and shoves it into HTML, the input was treated as some sort of hybrid of wikitext-templates and HTML.
  • If the outputting code calls ->plain() and runs it through htmlspecialchars(), the input was treated as a plain-text string.
  • If the outputting code calls ->text() and runs it through htmlspecialchars(), or calls ->encoded(), and shoves it into HTML, the input was treated as some sort of hybrid of wikitext-templates and plain text.
  • If the outputting code calls ->plain() and outputs it in a non-HTML context, the input was treated as plain text string.
  • If the outputting code calls ->text() and outputs it in a non-HTML context, the input was treated as some sort of hybrid of wikitext-templates and plain text.
  • If the outputting code calls ->parse(), the input was treated as a wikitext string.
  • If the outputting code calls ->plain() but then passes it to the Parser, the input was also treated as a wikitext string (as were most of the parameters, but differently than if ->parse() was called directly).
  • If the outputting code calls ->text() but then passes it to the Parser, the input was also treated as a wikitext string that gets templates weirdly double-expanded.

And it's even worse when you consider that most parameters have a type too.

To fix it, we'd probably have to add some concept of the type the input and each parameter thinks it's supposed to be and have the rendering methods transform that type appropriately. Only then can a taint-checker catch if you're outputting the result in the wrong way. Just adding a "forHTML()" method as suggested in T21291#4468578 won't work because it doesn't know how the input thought it was intended to be used.

Re: escaping, basically messages are processed in two stages: first either leave as is or preprocess or fully parse, then either escape or leave as is (conceptually, sanitizing the HTML would be the third option, but that happens as part of the parse). Some combinations do not make sense (specifically, if you parse a message, there's no point in escaping it), and sometimes the escaping step or even both are done manually by the caller, not by Message, but these are the fundamental use cases. The first stage corresponds to the format of the message - the message author and the programmer calling the Message rendering method need to be on the same page as to whether the message can contain templates, and whether it can contain wikimarkup. The second stage corresponds to the use site of the message - the programmer calling the Message rendering method needs to know whether the message is plain text or escaped or sanitized HTML and manually sanitize accordingly if needed before inserting in the output. That these two choices have to happen at the same time, in the same code location, makes things unnecessarily complicated.

So one thing to do would be to have the rendering methods return UntrustedString with an internal flag as to whether it is actually untrusted (so parse or escape would return a trusted one, text/plain an untrusted one) and then UntrustedString would have a getRaw and getSafe method, with getSafe adding another layer of HTML escaping if the internal flag is set to untrusted. (Or possibly separate getSafeForHTML, getSafeForJS, getSafeForSQL, although I'm not sure the non-HTML use cases exist.) Sometimes you'll still need to get the raw text and manually parse or escape it, but 1) a method name like getRaw is IMO an obvious warning flag to the programmer in a way text isn't, 2) more importantly, the code that creates the Message object (and knows which message we are talking about and what format it is using) can decide the appropriate formatting method to use, while the code that inserts the message in the output (and knows whether that's a HTML output or a plain text one) can decide the appropriate escaping mechanism, with the benefit of the object having some internal flags to prevent double-escaping.
I think this is more or less what Simetrical was saying, although I might be misunderstanding him wildly.

Of course, ideally the format would be included with the i18n files as metadata, instead of Message-constructing code having to sort-of set it, but I have no idea how hard that would be.

Re: escaping, basically messages are processed in two stages: first either leave as is or preprocess or fully parse, then either escape or leave as is (conceptually, sanitizing the HTML would be the third option, but that happens as part of the parse).

Not really. Message processing actually works something like this:

  • First, most parameters are injected into the message-string.
    • "raw", "plaintext", and Message parameters aren't.
    • Also "list" parameters that contain those are a bit tricky: depending on the types of the list items, it might be done here as a standard parameter or later as either a "raw" or Message type.
  • The message-string is processed.
    • plain() does nothing here.
    • text() preprocesses.
    • escaped() preprocesses and then HTML-encodes.
    • parse() and parseAsBlock() parse.
  • Then those parameters not done in the first step are injected.
    • "raw"-type parameters are injected as-is.
    • "plaintext"-type parameters get HTML-encoded for escaped(), parse(), and parseAsBlock().
    • Message-type parameters get processed with the same call as in the second bullet.

Valid usage patterns are, in my opinion,

  1. Calling plain() or text(), then outputting in a plaintext context.
    • I'd include "json-encoding and putting in a string in JS" as a plaintext context.
  2. Calling plain() or text(), then escaping to output as a plain-text string in HTML.
    • Although this doesn't really handle "raw" types correctly: they're escaped when that wasn't the intention.
  3. Calling escaped(), then outputting in HTML.
    • This is case #2 for text() but done correctly for "raw" parameters.
  4. Calling parse() (or parseAsBlock()) then outputting in HTML.
  5. The case of "raw HTML messages" is when you call plain() or text() then output in HTML without escaping. The trick is in making sure translatewiki and anything supplying a parameter knows this is the use case.
  6. There's also MediaWiki:Common.js and the like, which do the same but outputting as JS or CSS source without escaping. Same trick, less difficult thanks to the pseudo-file-extension.

Note that, given "raw", "plaintext", and Message parameter types, re-parsing (directly or by passing ->text() as a standard parameter to another Message) does not seem valid. Unfortunately we have code and patterns from before "raw", "plaintext", and Message parameters were supported that don't follow that model.

I guess your "UntrustedString" is trying to implicitly get to that by making the choice between #1 and #2 more explicit? And hoping the taint checker will then catch the "raw HTML message" case when it's not intended? That solution seems a bit heavy-weight to me though. We could as well make more targeted changes:

  • Use ".html" like we already do ",js" and ",css" to flag messages intended as raw HTML for translatewiki and anyone else looking at the message rather than the outputting code.
  • #2 is the only case that really has post-Message escaping for HTML output, since #3 exists for the text() case. We could introduce a new mode/method that does like #3 for plain() to replace the post-escaping in #2 and more correctly handle "raw".
  • Mark output from plain() and text() as tainted for HTML output.

BTW, I note that "raw"-type parameters are themselves somewhat problematic. Most likely they're being used to inject HTML into the message output, but in that case they're really only compatible with the eventual stringification using parse() (or parseAsBlock()). The other use case I'd guess at being common is from before "plaintext"-type parameters existed when someone wanted a plain text parameter and "knew" the value was not going to be an XSS.

  1. more importantly, the code that creates the Message object (and knows which message we are talking about and what format it is using) can decide the appropriate formatting method to use,

That seems to be overly focusing on the rare "raw HTML message" use case. Adding significant complication to every use of Message for a very rare use case seems excessive. It may also be confusing "source format" and "output format".

IMO anything actually intending to be using a raw HTML message should probably be handling it end-to-end from Message creation to stringification, nothing should return a Message intended as raw HTML to something else and nothing should output a Message received from other code as raw HTML. If there's a extant use case that requires that, I'd like to see it.

In some non-MediaWiki code base where raw HTML messages were common, flagging the input-format at Message creation time as you suggest, but rather than wrapping strings in an object I'd just make parse() do the right thing (i.e. not sending the HTML through the wikitext parser) so outputting unescaped data from plain() would always be an error and using parse() would always be the right thing to do.

IMO I would not consider "raw HTML messages" as a valid usage pattern. Let's just plan how to get rid of the few remaining cases.

I'd like to pitch that one of the reasons I wrote monkey-i18n was to demonstrate the embedding feature that makes it possible to avoid lego messages:

$manager = MonkeyI18n\Manager::defaultManager();
$manager->registerMessages( 'en', [
	'ho' => 'Hello {{EMBED:$n|&amp;}} &nbsp;',
] );
$factory = $manager->getMessageFactory();
echo $factory( 'ho' )
	->escapeWith( 'htmlspecialchars' )
	->with( 'n', new MonkeyI18n\Param\Unescaped( "<tag>$1</tag>" ) ) . "\n";

Output:

Hello <tag>$&amp;amp;</tag> &amp;nbsp;

It's not clear to me what exactly your example is supposed to be showing. You seem to have a parameter that itself contains parameter-replacements, but without a clear indication as to what pattern that would replace.

$n in the message gets replaced (in a safe way) by MonkeyI18n\Param\Unescaped( "<tag>$1</tag>" ), which gets $1 from the second param of {{EMBED}} (&amp;) in the message and gets escaped correctly.

There is no clean and safe way to do such thing with current Message class.

This need comes up semi-frequently in UI messages that want to embed a link, a button or other HTML element inside a sentence.

I'll admit I don't deal much with the fancier bits of the UI, but in my limited experience a link is generally done with wikitext and a button or other widget doesn't seem very likely to need to take data from the "outer" message. But I can see how there are cases where the added complication might be required for complicated links or the like, though.

I'm not sure I like the ->with( 'n', new MonkeyI18n\Param\Unescaped( "<tag>$1</tag>" ) ) syntax though. Would there be any reason to be using this with anything other than the "Param\Unescaped"? Could we just do ->withLink( 'n', "<tag>$1</tag>" )?

I'd like to keep it generic, not specific to links. Sometimes you just want wrap a part of sentence inside <span class=...></span> for styling. Other than that, I welcome ideas for better syntax.

Is "wrap some part of the message in HTML" be a generic enough summary? That is, would ->someMethodName( 'n', $html ) with the $html string containing only $1 as the replacement be the right level of functionality?

Or we could go in a different direction. Instead of it being something separate from the parameters, it could be something more like Message::rawParam() except with a similar $1 replacement. Or was ->with() intended to be the generic parameter-specifying mechanism in your i18n library?

Either way, when generating plain text the {{EMBED:id|word}} should probably just produce "word".

Or was ->with() intended to be the generic parameter-specifying mechanism in your i18n library?

Yeah it's equivalent to Message:param. Nothing prevents having shortcuts for certain types of parameters. In my experiment I actually require all parameters to be type, with automatic "boxing" of primitive values.

I ranted about a few gaps in our current Message interface, specifically the (lack of) interaction between RawMessage and MessageLocalizer, here: T227233#5305488.

@Catrope: So in the new stuff, you'd probably have different problems. Please let me know your take on this.

MessageValue doesn't concern itself with the language at all, nor with whether the message exists, it just holds the message key[1] and parameters. There's also no real equivalent to RawMessage, and hopefully we can avoid having to add one.[2] An appropriate "MessageFormatter",[3] fetched from a factory service,[4] will handle everything else: it encapsulates at least the Language, and probably[3] also useDatabase and any other flags, and will have the methods for existence checks and stringification and whatever else is needed.

So that tagShortDescriptionMessage() method would most likely either (a) take some sort of MessageFormatter in place of the MessageLocalizer and would return string|false, or (b) just return new MessageValue( "tag-$tag" ) and code receiving that MessageValue would be responsible for doing the ->exists() and ->isDisabled() checks and handling the results appropriately.

[1]: Currently it doesn't even support MediaWiki's fallback-keys concept. The hope is that we can avoid having to add that, or can hack it into MediaWiki's implementation (key as implode( '|', $keys ) maybe?) without having it be part of the library's interface.
[2]: Within MediaWiki you could use the "rawmessage" message for the common case where the raw message is just "$1".
[3]: The current plan is that there won't actually be a "MessageFormatter" class/interface though. Instead there is ITextFormatter and I guess there will be IHtmlFormatter, so static analysis can track whether the output is HTML-safe or not.
[4]: There's currently MessageFormatterFactory, which currently only allows for getting an ITextFormatter for a particular language. It'll probably gain a method to get an IHtmlFormatter too, but for the rest of the Message flags I suppose we'll probably wind up with a MessageFormatterFactoryFactory in MediaWiki. Yay DI across library boundaries!