Page MenuHomePhabricator

"[[New York ]]<nowiki/>population" appears instead of "[[New York]] population" in translations
Closed, ResolvedPublic

Description

Examine the following pieces of wiki syntax:

  • [[New York ]]<nowiki/>population
  • [[New York]] population

They create precisely the same HTML:

<li><a href="/wiki/New_York" title="New York">New York</a> population</li>

The only difference, of course, is that the first example has dirtier wiki syntax.

Can Parsoid avoid adding this <nowiki/> and instead normalize it immediately to [[New York]] population?

This probably doesn't happen in VisualEditor, because VE's UI removes the trailing space, but ContentTranslation doesn't do it. It makes sense to me to handle this at the Parsoid level.

Event Timeline

Amire80 raised the priority of this task from to Medium.
Amire80 updated the task description. (Show Details)
Amire80 added subscribers: Amire80, eranroz, Arlolra.

I think that this might be the same problem. This is what it looked like (and what was wanted) in the table cell:

[[Link]]
Word

But when it saved, the result was

[[Link]]<nowiki/>Word

which renders differently (and screws up the column widths).

No, this is most likely because there is no space neither at the end of the link nor after it. It's a different problem. There may be different user behaviors that lead to this, but I guess that the most common reason it that people highlight a part of a word and mark it as a link; in wiki syntax this would turn the whole word into a link, but VE adds a <nowiki/>. The rendered result is different.

However, when there is a space before the closing ]] and a <nowiki/> after, MediaWiki moves the space out of the link when outputting the page for reading. This is a tad surprising, but totally sensible in practice.

cscott renamed this task from "[[New York ]]<nowiki/>population" should be automatically changed to "[[New York]] population" to "[[New York ]]<nowiki/>population" should be automatically changed to "[[New York ]] population" (note embedded space).Nov 2 2015, 4:55 PM
cscott set Security to None.
Amire80 renamed this task from "[[New York ]]<nowiki/>population" should be automatically changed to "[[New York ]] population" (note embedded space) to "[[New York ]]<nowiki/>population" should be automatically changed to "[[New York]] population".Nov 2 2015, 5:02 PM

Regarding the task title change: I am not proposing to leave the space inside the square brackets, but to move it outside of the square brackets.

I don't think Parsoid would serialize this as [[New York ]] in the first place; it seems we'd be more likely to serialize it as [[New York|New York ]]. We're pretty conservative about matching link text with title text, I believe.

Is the proposal to move all trailing whitespace out of link text?

cscott: we already do migrate trailing whitespace out of the text. https://github.com/wikimedia/parsoid/blob/f0d77afc0b952b96daa582f5de6ac8d3c20b4413/lib/html2wt/normalizeDOM.js#L210-L232

But, we somehow aren't normalizing the wikilink to the [[Foo]] form.

[subbu@earth html2wt] echo "[[Foo|Foo ]] bar" | parse.js --wt2wt --scrubWikitext
[[Foo|Foo]] bar

@Amire80, we already do what you request as far as I know. See below:

[subbu@earth html2wt] parse.js --html2wt < /tmp/html
[[Foo|Foo ]]<nowiki/>bar

[subbu@earth html2wt] parse.js --html2wt --scrubWikitext < /tmp/html
[[Foo|Foo]] bar

CX passes in the scrubWikitext API param as far as we know. If not, it is worth checking that it is being passed in.

We already do this: https://github.com/wikimedia/parsoid/blob/master/lib/html2wt/normalizeDOM.js#L240

λ (master) cat t
<a href="/wiki/New_York" title="New York">New York </a>population

λ (master) cat t | node bin/parse --html2wt
[[New York|New York ]]<nowiki/>population

λ (master) cat t | node bin/parse --html2wt --scrubWikitext
[[New York]] population

Hmmmm.

Then this might be a ContentTranslation bug. I see it happening frequently in edits created by ContentTranslation, for example here. It's in Hebrew, but I think that it's easy to see it in the first line of the wiki text.

We recently changed scrubWikitext to scrub_wikitext (T115102). Was it done correctly? You can see the current code at https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FContentTranslation.git/master/api%2FApiContentTranslationPublish.php#L110

Change 252665 had a related patch set uploaded (by Amire80):
Send both scrubWikitext and scrub_wikitext

https://gerrit.wikimedia.org/r/252665

Amire80 renamed this task from "[[New York ]]<nowiki/>population" should be automatically changed to "[[New York]] population" to "[[New York ]]<nowiki/>population" appears instead of "[[New York]] population" in translations.Nov 12 2015, 11:23 AM
Amire80 claimed this task.

The reason for this seems to be twofold:

  • There is currently a bug in RESTBase which does not translate scrub_wikitext to scrubWikitext before calling Parsoid for transforms that do not supply a revision ID
  • CX's API supplies only the title, but not the revision as well.

I will tackle the first point today (likely to be deployed today as well). For the second point, please append the revision to the RESTBase URI in your transform request.

Change 252665 abandoned by Amire80:
Send both scrubWikitext and scrub_wikitext

https://gerrit.wikimedia.org/r/252665

I will tackle the first point today (likely to be deployed today as well).

The fix has been deployed in production.

Arrbee subscribed.

Resolved due to upstream action.