Page MenuHomePhabricator

Double quotes in sitelink are displayed as HTML entity in automated edit summary
Closed, ResolvedPublic

Description

Problem:
We seem to have encoding issues in the edit summaries for some automated changes to sitelinks.

List of steps to reproduce (step by step, including full links if applicable):

  • move a page with a special character in its page title on the client wiki (e.g. Wikipedia)
  • view the diff of the resulting automated edit that changes the sitelink on the repository (e.g. Wikidata)

What happens?:

  • page title is shown wrong in the edit summary that was automatically generated for the page move

What should have happened instead?:

  • page title is shown without encoding issue

Browser information, screenshots/mockups, other information, etc. (if applicable):
Screenshot from an example diff with broken encoding in the summary:

obrazek.png (332×1 px, 61 KB)

Acceptance criteria (optional):

  • special characters in sitelinks are handled without encoding issues when used in an automated edit summary for a sitelink change

Notes:

  • check if this is just happening for page moves or also page deletions
  • see also T187912 for a similar issue in a nearby place

Event Timeline

Change 414510 had a related patch set uploaded (by D3r1ck01; owner: Alangi Derick):
[mediawiki/extensions/Wikibase@master] Remove double escaping in sitelinks

https://gerrit.wikimedia.org/r/414510

Change 414510 abandoned by D3r1ck01:
Remove double escaping in autocomments (sitelink changed)

https://gerrit.wikimedia.org/r/414510

Lydia_Pintscher claimed this task.
Lydia_Pintscher subscribed.

This seems solved now \o/

Ahhhhhh now I see it. Sorry. I was looking at the Item page title and the html page title which all looked fine. Ok now it's clear. Thanks.

@ItamarWMDE I wonder then if it's the same issue as T187912.

Lydia_Pintscher renamed this task from Double quotes in title are displayed as HTML entity in comments to Double quotes in sitelink change are displayed as HTML entity in automated edit summary.Aug 5 2022, 9:20 AM
Lydia_Pintscher renamed this task from Double quotes in sitelink change are displayed as HTML entity in automated edit summary to Double quotes in sitelink are displayed as HTML entity in automated edit summary.

It seems pretty similar, but a different message is involved. It just might be that resolving one, resolves the other as well. We should probably spruce up the description of this task though, to make it more "good first task"y

Internally, the comment is stored as /* clientsitelink-update:0|nlwiki|nlwiki:Koninklijk erkende Fanfare "St. Cecilia", Bocholtz|nlwiki:Fanfare St. Cecilia Bocholtz */. This is fine.
The unwanted escaping happens when the /* ... */ part is magically transformed, probably in AutoCommentFormatter::formatAutoComment:

// no message requires wikitext params and some args are user-controlled
$args = array_map( 'wfEscapeWikiText', $args );

// render the autocomment
$auto = $msg->params( $args )->parse();

At first, I thought the quotes are first escaped by wfEscapeWikiText and second by Message::parse... but in fact, they are escaped earlier and wfEscapeWikiText is the second time.

In core, CommentFormatter::preprocessRevComment calls CommentParser::preprocess which calls CommentParser::preprocessInternal with $unsafe = false and that calls Sanitizer::escapeHtmlAllowEntities which includes this:

# It seems wise to escape ' as well as ", as a matter of course.  Can't
# hurt. Use ENT_SUBSTITUTE so that incorrectly truncated multibyte characters
# don't cause the entire string to disappear.
$html = htmlspecialchars( $html, ENT_QUOTES | ENT_SUBSTITUTE );

So this is the first escaping which treats ' (see T187912) and " in particular. It actually does hurt...

When I checked the escapeHtmlAllowEntities and the signature of htmlspecialchars function, it seems like it doesn't matter if you put or remove ENT_QUOTES | ENT_SUBSTITUTE as flags, because it's the default value anyway.

function htmlspecialchars(string $string, int $flags = ENT_QUOTES|ENT_SUBSTITUTE, ?string $encoding = null, bool $double_encode = true)

I'm not sure if it's totally safe to unescape those characters. Instead, we can change the way we process auto comments and usernames in comments, and we can use preprocessUnsafe instead of preprocess in the CommentFormatter::preprocessRevComment:

	private function preprocessRevComment(
		CommentParser $parser,
		Authority $authority,
		RevisionRecord $revRecord,
		$samePage = false,
		$isPublic = false,
		$useParentheses = true
	) {
		if ( $revRecord->getComment( RevisionRecord::RAW ) === null ) {
			return "";
		}
		if ( $revRecord->audienceCan(
			RevisionRecord::DELETED_COMMENT,
			$isPublic ? RevisionRecord::FOR_PUBLIC : RevisionRecord::FOR_THIS_USER,
			$authority )
		) {
			$comment = $revRecord->getComment( RevisionRecord::FOR_THIS_USER, $authority );
			$block = $parser->preprocess(
				$comment ? $comment->text : '',
				$revRecord->getPageAsLinkTarget(),
				$samePage,
				null,
				true
			);
			$block = $this->wrapCommentWithBlock( $block, $useParentheses );
		} else {
			$block = " <span class=\"comment\">" . wfMessage( 'rev-deleted-comment' )->escaped() . "</span>";
		}
		if ( $revRecord->isDeleted( RevisionRecord::DELETED_COMMENT ) ) {
			$class = \Linker::getRevisionDeletedClass( $revRecord );
			return " <span class=\"$class comment\">$block</span>";
		}
		return $block;
	}

Change 832978 had a related patch set uploaded (by Hasan Akgün (WMDE); author: Hasan Akgün (WMDE)):

[mediawiki/core@master] Unescape double and single qoutes while formatting comments.

https://gerrit.wikimedia.org/r/832978

The ampersand is also affected, since it can also be used in titles – example:

image.png (490×1 px, 122 KB)

Change 832978 abandoned by Hasan Akgün (WMDE):

[mediawiki/core@master] Unescape double and single qoutes while formatting comments.

Reason:

I will try to solve this on Wikibase instead of MediaWiki core

https://gerrit.wikimedia.org/r/832978

Change 833014 had a related patch set uploaded (by Hasan Akgün (WMDE); author: Hasan Akgün (WMDE)):

[mediawiki/extensions/Wikibase@master] Unescape double and single qoutes while formatting auto-comments.

https://gerrit.wikimedia.org/r/833014

Change 833014 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Unescape double and single qoutes while formatting auto-comments.

https://gerrit.wikimedia.org/r/833014