Page MenuHomePhabricator

Use CommentStoreComment throughout MediaWiki
Open, Needs TriagePublic

Description

The CommentStoreComment class, encapsulating a comment (edit summary, log message, etc.), was added as part of the big comment refactoring in T166732. It can contain not only a plain string text, but also a Message and additional JSON data. If a Message is present, the text is supposed to reflect that message in the content language, but the Message may be used to render the comment in the user language instead. Wikibase currently achieves a similar result via the FormatAutocomments hook, but CommentStoreComment has the potential to be a nicer and more general solution.

However, the class is currently not used to its full potential. CommentStore is responsible for saving and loading CommentStoreComments, but most of the places that use it to load a comment discard the comment object pretty soon and only use its text, equivalent to the old comment field prior to the comment refactoring. We should change this, use CommentStoreComment objects as much as possible, and render them properly, including translation via the message if possible.


How exactly this proper rendering looks like, though, is not yet clear. The most conservative approach, and what I think seems to have been intended during the design of the new comment system, is to take the ->text() format of the message and format it just like a regular comment (i. e. Linker::formatComment: sanitize, format /*autocomments*/, format [[links]]). However, this system is fairly limited in the formatting of the message, and it also means that any comment syntax in the message will be formatted, whether that was intended or not (e. g. plaintext params of the message would not really be plain text). In other words, while the message may contain structured information about the comment, this approach would not make use of it. To me, this would be passing up an opportunity to solve bugs like T186035 and T215929. (It is possible to fix those bugs in other ways – @Anomie suggested using invisible characters, like U+200B, to escape parts of the message – but that would have to happen before constructing the message.)

Another approach is to ->parse() the message, if it exists, and bypass the usual comment formatting. This gives the creators of the CommentStoreComments, e. g. Wikibase, much greater flexibility: they can use full Wikitext parsing in the summary, and parameters would be correctly escaped according to their type (e. g. plaintext params would stay plain text). However, this is not fully compatible with the original comment model: that the comment text corresponds to the ->text() of the comment message in the content language. Instead, the comment text here would contain arbitrary Wikitext, and anyone who uses the comment text instead of the CommentStoreComment proper would just see the Wikitext. Within MediaWiki, this would mean we should migrate to CommentStoreComment fairly quickly; but users of the comment.comment_text field on the replica databases, or of the comment revision prop in the API (as opposed to parsedcomment, which would now actually ->parse() the comment), would still be affected.

@daniel had an idea to mitigate this problem, though: creators of CommentStoreComments, e. g. Wikibase, could override the ->text field of the CommentStoreComment to be a more suitable textual form of the message, instead of the ->text() of the message, before passing it to CommentStore (or PageUpdater::saveRevision(), etc.). In the trivial case, this could just be “textual comment not supported”, but they could also format a simpler version of the message, or check if arguments need to be escaped with U+200B or similar.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 12 2019, 4:41 PM

How exactly this proper rendering looks like, though, is not yet clear.

That's really the main question here. If it weren't for that, this would be a pretty straightforward Technical-Debt task.

The most conservative approach, and what I think seems to have been intended during the design of the new comment system, is to take the ->text() format of the message and format it just like a regular comment (i. e. Linker::formatComment: sanitize, format /*autocomments*/, format [[links]]).

That's correct, as far as it goes.

A secondary intention was that something doing its own thing could generate a "comment" from scratch based on the context and the non-Message data in $commentStoreComment->data, while other code not knowing about its specialness could still have usable comment_text.

To me, this would be passing up an opportunity to solve bugs like T186035 and T215929. (It is possible to fix those bugs in other ways – @Anomie suggested using invisible characters, like U+200B, to escape parts of the message – but that would have to happen before constructing the message.)

IMO the invisible characters would be a better solution to those two tasks.

but users of the comment.comment_text field on the replica databases, or of the comment revision prop in the API (as opposed to parsedcomment, which would now actually ->parse() the comment), would still be affected.

Yes, this sort of thing would be a problem. Particularly since not every use case actually wants HTML-formatted comments.

@daniel had an idea to mitigate this problem, though: creators of CommentStoreComments, e. g. Wikibase, could override the ->text field of the CommentStoreComment to be a more suitable textual form of the message, instead of the ->text() of the message, before passing it to CommentStore (or PageUpdater::saveRevision(), etc.). In the trivial case, this could just be “textual comment not supported”, but they could also format a simpler version of the message, or check if arguments need to be escaped with U+200B or similar.

I can't say I like this much either. In particular I'd -1 any code I saw using the “textual comment not supported” cop-out. In the end you're getting back to solving your other tasks with the invisible-character "escaping" anyway, so why not just use that as the solution?

BTW, U+0080 or U+0083 might be more candidates for the "escape" character, with the benefit that they're less common than something like U+200B.

hoo added a subscriber: hoo.Mar 12 2019, 5:23 PM

In particular I'd -1 any code I saw using the “textual comment not supported” cop-out.

For Wikibase, a more likely cop-out would be “added [en] label” – still using an interface message, but omitting the label value (the part that causes problems when parsed as wikitext). That way we don’t have to copy the “nasty regex” from Linker::formatAutocomments just to detect what exactly we need to escape.

In the end you're getting back to solving your other tasks with the invisible-character "escaping" anyway, so why not just use that as the solution?

Because it’s not a full solution. Even leaving aside the autocomments and linking stuff, you already mentioned another problem on Gerrit:

The main drawback is that it'll turn the "Created a claim" auto-comment into a bogus section link, rather than just displaying it in grey. I suppose you could do something with the 'FormatAutocomments' hook to avoid that if you really want.

So we have to mess with the summary both before and after rendering it, and the rendering doesn’t even give us a lot of formatting options.

Particularly since not every use case actually wants HTML-formatted comments.

Then why are we even using the Message class for comments? It already doesn’t have an output format that’s completely suitable for comment formatting (->text() is an approximation that doesn’t escape autocomments and links in supposedly plaintext params), and now you’re bringing up more output formats that can’t match any existing Message format at all, as far as I can tell (because everything except plain and text is geared towards HTML).

Then why are we even using the Message class for comments?

Because Wikibase wants i18n.

Then why are we even using the Message class for comments?

Because Wikibase wants i18n.

But it looks like that should be done at render time. Trying to do it when saving the comment sounds nice, but as the discussion above shows, is rather problematic.

So, as Lucas is saying, the solution may be giving up on the idea of using a Message object when writing the comment, and instead supply plain text and data. Or only allow messages with no parameters.

If we want to remove the Message handling from CommentStore and let Wikibase do its i18n based on ->data (in either a modified 'FormatAutocomments' that gets the data or some higher-level hook added to Linker::formatComment()), I can't think of any major objection at the moment.

The main benefit of the built-in Message handling is that it's easier for a caller in the simple case where an i18n message is used to generate comment_text and we want to effectively re-generate comment_text from the same message (potentially in a different language) at render time.

Then why are we even using the Message class for comments?

Because Wikibase wants i18n.

Well, Message doesn’t have to be the only way to do i18n. We can add some hook that lets us completely bypass the Message and comment_text, and instead directly return HTML based on the comment_data, but then other extensions (or core) might run into the same problem later. And since nothing uses comment_data yet as far as I’m aware, now we still have the opportunity to find some better solution in core, instead of having Wikibase work around the issue.

By the way, I noticed that this is not the first time that Message’s output formats aren’t quite adequate – API errors and warnings have a situation that feels similar to me. Non-legacy errors support several error formats, including wikitext (i. e. $msg->text()) and html ($msg->parse()), but also plaintext, “intended for human display in HTML-incapable clients.” And to generate this plaintext, which doesn’t directly match any Message output format, the ApiErrorFormatter takes the ->text() of the message and then strips a bit of markup with a custom function, in “a minimal, best-effort transformation to make the message … more readable”. We could apply a similar transformation to a message to get its comment_text (except in this case the target format isn’t quite plain text), while fully parsing the message in places where HTML is expected.

Well, Message doesn’t have to be the only way to do i18n. We can add some hook that lets us completely bypass the Message and comment_text, and instead directly return HTML based on the comment_data,

That'd be what I'd recommend if something really wants to do something beyond what standard comment formatting can handle. To what extent we want to make that generically possible versus encouraging developers not to get too fancy about comment formatting is an open question.

By the way, I noticed that this is not the first time that Message’s output formats aren’t quite adequate – API errors and warnings have a situation that feels similar to me. Non-legacy errors support several error formats, including wikitext (i. e. $msg->text()) and html ($msg->parse()), but also plaintext, “intended for human display in HTML-incapable clients.” And to generate this plaintext, which doesn’t directly match any Message output format, the ApiErrorFormatter takes the ->text() of the message and then strips a bit of markup with a custom function, in “a minimal, best-effort transformation to make the message … more readable”. We could apply a similar transformation to a message to get its comment_text (except in this case the target format isn’t quite plain text), while fully parsing the message in places where HTML is expected.

I suppose that depends on how much difference there is between API error messages and what you're wanting to do with comments. In the API errors and warnings the only HTMLish markup normally used is <var>, <kbd>, <samp>, and <code>, and the semantic meaning of those tags in unformatted text is basically preserved by converting them into quotation marks.

I suppose that depends on how much difference there is between API error messages and what you're wanting to do with comments.

As far as I’m aware, both Wikibase and WikibaseSchema really don’t want to do much actual formatting. They want parts that look like autocomments but aren’t linked to sections; they want wikilinks; and they want parts that aren’t subject to any formatting at all. If we strip all HTML tags without replacement and just keep the text content, turning

<span dir="auto"><span class="autocomment">Changed claim: </span></span> <a href="/wiki/Property:P119" title="‎place of burial‎ | ‎location of grave, resting place, place of ash-scattering, etc, (e.g. town/city or cemetery) for a person or animal. There may be several places: e.g. re-burials, cenotaphs, parts of body buried separately.‎"><span class="wb-itemlink"><span class="wb-itemlink-label" dir="ltr" lang="en">place of burial</span> <span class="wb-itemlink-id">(P119)</span></span></a>: <a href="/wiki/Q533697" title="‎Highgate Cemetery‎ | ‎place of burial in north London, England‎"><span class="wb-itemlink"><span class="wb-itemlink-label" dir="ltr" lang="en">Highgate Cemetery</span> <span class="wb-itemlink-id">(Q533697)</span></span></a>

into

Changed claim: place of burial (P119): Highgate Cemetery (Q533697)

that’s already a decent plain-text representation of the comment that also works well enough when interpreted as comment text. (/* Changed claim: */ … would also work.)

We could say that a CommentStoreComment created from a Message generates the “best-effort text representation” by applying the following transformations:

  • replace /* and [[ with /█* and [█[ in plaintext message arguments, where stands for some invisible Unicode character
  • take the message ->text()
  • strip all HTML tags from the result (Sanitizer::stripAllTags())

Since this doesn’t parse the message, for a typical Wikibase summary the result might look like this:

Changed claim: [[Property:P119]]: [[Q533697]]

And coming back to @daniel’s idea, if an extension isn’t satisfied with that it can still override the CommentStoreComment’s ->text with an even better representation. (Though I think we wouldn’t actually do this in Wikibase.)

Addshore moved this task from incoming to monitoring on the Wikidata board.Mar 25 2019, 4:06 PM