Page MenuHomePhabricator

Parsoid is generating [[Foo|Foo]] instead of [[Foo]] for some VE edits
Closed, ResolvedPublic8 Story Points

Description

In some cases, the Visual Editor creates links like

[[Page Name|Page Name]]

which should just be

[[Page Name]]

This is potentially related to, but not a duplicate of T56947, as this specific bug seems like it could be resolved by an automatic cleanup of the generated wikitext.

One example is this edit (link to "Webforum") and I will try to provide some more in the next days.

Related Objects

Mentioned In
T163549: Only lint pages that have wikitext contentmodel
T162920: In multi-content/template-block scenarios, Linter displays "--" in the "Through a template"? column
T161151: Parsoid should resolve template paths before providing them to Linter
T164006: Suggestion: API for fetching lint errors for a specific revision
T163091: Parsoid: Add API endpoint to get lint errors for arbitrary wikitext
T37247: content-holding <div> should only contain the page text
T164792: Add class mw-parser-output to Parsoid's output
T141226: Missing data-mw content in wikitext leads to html2wt exceptions
T50463: Parsoid: Links created with VE are piped links even when linktrail could be used
Mentioned Here
rGPARa182c227f09d: T141226: Fix logic when unpacking dom fragments
T37247: content-holding <div> should only contain the page text
T141226: Missing data-mw content in wikitext leads to html2wt exceptions
T161151: Parsoid should resolve template paths before providing them to Linter
T162920: In multi-content/template-block scenarios, Linter displays "--" in the "Through a template"? column
T163091: Parsoid: Add API endpoint to get lint errors for arbitrary wikitext
T163549: Only lint pages that have wikitext contentmodel
T164006: Suggestion: API for fetching lint errors for a specific revision
T164792: Add class mw-parser-output to Parsoid's output
T136653: Parsoid doesn't recognize interwiki shortcuts in the href attribute
T56947: VisualEditor: When user changes a link anchor which has the same link target, suggest that they may wish to change the link target too

Event Timeline

Cirdan created this task.Dec 13 2016, 6:59 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 13 2016, 6:59 PM
Cirdan updated the task description. (Show Details)Dec 13 2016, 7:04 PM
ssastry renamed this task from Visual Editor creates unnecessary pipelinks to Parsoid is generating [[Foo|Foo]] instead of [[Foo]] for some VE edits.Dec 14 2016, 9:32 PM
ssastry reopened this task as Open.
ssastry triaged this task as High priority.
ssastry added a project: Parsoid.

FWIW, I cannot reproduce the bug right now on that title. I tried on a couple links. I also went back to the oldid that caused the problem and I couldn't reproduce it on that specific link either. It will be helpful to get more diffs and maybe we will get a sense of what is going on when we look at them together.

Jdforrester-WMF added a subscriber: Jdforrester-WMF.

(Thanks for re-opening.)

jhsoby added a subscriber: jhsoby.Dec 14 2016, 10:47 PM
Elitre added a subscriber: Elitre.Dec 15 2016, 7:27 AM
Jdforrester-WMF set the point value for this task to 8.Feb 3 2017, 9:29 PM
Cirdan rescinded a token.Feb 11 2017, 5:27 PM

I put a few requests for help out.

This is probably a variant of this bug: https://no.wikipedia.org/w/index.php?title=Colin_Archer&diff=17212741&oldid=17211304

Instead of adding the markup for bold text outside the links, VE created a piped link with the bold markup as display text for the link.

Elitre added a comment.Mar 6 2017, 7:32 PM

Hadibe at de.wiki provided a new set of examples there.

This is probably a variant of this bug: https://no.wikipedia.org/w/index.php?title=Colin_Archer&diff=17212741&oldid=17211304
Instead of adding the markup for bold text outside the links, VE created a piped link with the bold markup as display text for the link.

This is a different "bug" and is not directly related to this bug report. Parsoid shouldn't be emitting [[Foo|Foo]] at all and that is what we are looking for reproduction steps for.

Hadibe at de.wiki provided a new set of examples there.

Those examples are a bit different. [[Holly Hunter|'''Holly Hunter''']] and others are expected and existing known behavior. But [[Foo|Foo]] is unexpected and shouldn't even be happening. So, that is the one that requires reproduction steps.

Thanks! If there is a way to figure out what the editor tried to do and reproduce it in VE, that would help us figure out what is going on.

Thanks! Reproduced. The crucial step here is the copy-paste of an existing piped link. This leads to HTML of this form:

<p><a href="./Foo" rel="mw:WikiLink" title="Foo" id="mwAg">bar</a></p>
<p><a href="./Foo" rel="mw:WikiLink" title="Foo" id="mwAg">Foo</a></p>

Note that the id is duplicated on the copied element. This causes Parsoid to reuse data-parsoid information from the previous copied link. That data-parsoid indicates that the [[..]] should be output as a piped link and that is what causes the [[Foo|Foo]] output. The piped syntactical information overrides other information because that is the only way to preserve syntactical information from source wikitext like [[Foo|Foo]] or [[Foo|Fools]] that showed up in source.

I have lost track of how VE handles copy-paste and ids but at first glance, it appears that VE should not be preserving the element id on copy. @Esanders @Jdforrester-WMF

Do we know if it's an internal copy vs. an internal cut? We have to keep the id otherwise we'd force reserialisation, I thought?

Do we know if it's an internal copy vs. an internal cut? We have to keep the id otherwise we'd force reserialisation, I thought?

Aha ... cut vs copy. Yes, for cut to move to a different place, preserving id ensures formatting is preserved. But, if a copy, the formatting is duplicated. I don't know what I did in my reproduction there. But, yes, if it is a cut and multiple-copy .. the formatting will indeed be duplicated across those multiple copy sites. Not sure what a good solution for this is.

That apart, the reproduction steps in https://www.mediawiki.org/wiki/Topic:Tkuliwqrx24yzckh says copy, not cut. So, that indicates that the id should not be preserved inside VE, and if it is, that looks like a VE bug.

This is actually a bit more complicated, I suppose. In some cases, duplicating formatting is safe / good, and in other cases, duplicating formatting leads to unintended consequences like this bug report here. Long-term, in Parsoid, we are striving to reduce reliance on this source-level information, but till then, we will have edge case scenarios like this.

Yeah…

  • If I edit some content, I want it to serialise the same as the original (so retain IDs to make that happen).
  • If I copy some content, I want it to serialise the same as the original (so copy across IDs to make that happen).
  • If I move some content, I want it to serialise the same as the original (so copy across IDs to make that happen).

BUT

  • If I copy some content, and then edit it in a way that would change the serialisation options, I might want it to serialise differently from the original, but client-side I don't know what will cause this (so… help).

I think this is completely a Parsoid issue. In the past we didn't preserve data-parsoid/IDs on copy but users complained that serialisation info was lost on duplication. In this case Parsoid needs to make a decision about whether or not to keep the piping rule if the link text is changed, it should not be left up to the client. In VE we try to preserve serialisation unless there has been a modification, so I think it would be reasonable for Parsoid to normalise a link if the text or target is changed.

Since it is a copy, it is always going to show up as new content during html2wt. So that, in itself, is insufficient information to determine whether to use / discard data-parsoid info. The real problem here is what I indicated in T153107#3208202. Not yet sure how to address that.

To clarify - if you have a link which is marked as "always serialise to piped", but you also know that the link/label has been modified, then you should ignore that marking, and normalize it. I assume you have all this information available?

To clarify - if you have a link which is marked as "always serialise to piped", but you also know that the link/label has been modified, then you should ignore that marking, and normalize it. I assume you have all this information available?

Yes, that makes sense and we do that already ... see below.

[subbu@earth parsoid] cat /tmp/wt
[[Foo|bar]]
[subbu@earth parsoid] parse.js < /tmp/wt > /tmp/o.html
[subbu@earth parsoid] sed 's/bar/Foo/g;' < /tmp/o.html > /tmp/n.html
[subbu@earth parsoid] parse.js --html2wt < /tmp/n.html ### without dom-diff and modification information
[[Foo|Foo]]
[subbu@earth parsoid] parse.js --html2wt --selser --oldtextfile /tmp/wt --oldhtmlfile /tmp/o.html < /tmp/n.html #### selser uses dom-diff and has modification information
[[Foo]]

Anyway, will have to look at the code to see what information we are missing where ... is it a problem with the dom-diff code or is it some edge case in the serialization.

copied the HTML for the modified link and reran selser and I get:

[subbu@earth parsoid] parse.js --html2wt --selser --oldtextfile /tmp/wt --oldhtmlfile /tmp/o.html < /tmp/n.html
[[Foo]]

[[Foo|Foo]]

So, it is clearly missing some modification flag in freshly inserted content in this specific case ... looks like some unhandled edge case.

ssastry claimed this task.May 2 2017, 9:24 PM

Change 351519 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] WIP: T153107: Fix unhandled detection of modified link content

https://gerrit.wikimedia.org/r/351519

The fix for this is simple (a 1-line fix), but I promptly tripped over T136653 which needs fixing since it caused parser test failures.

ssastry moved this task from Backlog to In Progress on the Parsoid board.May 2 2017, 10:41 PM

Go the case this morning with one of my edits. Search for "Pays de Fougères" in the diff.

Previous state: [[Pays de Fougères|le Pays de Fougères]]
Current state: le [[Pays de Fougères|Pays de Fougères]]

I've removed the "le" by using my Delete keyboard key, quit the link label input by using my Left arrow key and then typed "le" again.

Change 351519 merged by jenkins-bot:
[mediawiki/services/parsoid@master] T153107: Fix unhandled detection of modified link content

https://gerrit.wikimedia.org/r/351519

ssastry closed this task as Resolved.May 11 2017, 2:01 PM

Yet to be deployed.

Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptMay 11 2017, 2:01 PM
Trizek-WMF reopened this task as Open.EditedAug 29 2017, 1:54 PM

I have experienced the problem with my volunteer account on https://fr.wikipedia.org/wiki/Le_Goff.

I've copy/pasted a list of people generated from Wikidata, where labels and targets for links are identical: https://fr.wikipedia.org/w/index.php?title=Le_Goff&diff=next&oldid=140157606 They haven't been merged to a single link. [[Alain Le Goff|Alain Le Goff]] was supposed to become [[Alain Le Goff]] and so on. I've done an other edit since, but nothing changed.

I think the task is about the system not generating those anymore, but it won't do magic for already existing, copy/pasted cases.
@Arlolra?

I think the task is about the system not generating those anymore, but it won't do magic for already existing, copy/pasted cases.
@Arlolra?

There's a regression test in the patch above that specifically mentions the copy/paste scenario. It should be normalized for all new and modified content.

I've copy/pasted a list of people generated from Wikidata, where labels and targets for links are identical: https://fr.wikipedia.org/w/index.php?title=Le_Goff&diff=next&oldid=140157606 They haven't been merged to a single link. [[Alain Le Goff|Alain Le Goff]] was supposed to become [[Alain Le Goff]] and so on.

Hmm, indeed Parsoid considered this modified content but chose to use the piped form. Can you provide steps to reproduce what you did? ie. How did you generate the list from Wikidata and copy/paste it to VE?

I've done an other edit since, but nothing changed.

Selective serialization would prevent further edits from cleaning up any unmodified portion of the page.

Can you provide steps to reproduce what you did? ie. How did you generate the list from Wikidata and copy/paste it to VE?

I use one of the Dicare tools (the interface is only in French), which generates lists of people with the same family name, based on Wikidata. The goal is to create disambiguation pages. http://tools.dicare.org/noms/homonymie.php

  1. Generate the list for Le Goff (Q24707465): http://tools.dicare.org/noms/homonymie.php?id=Q24707465&fallback=en+de+es+it+nl+br+ca (button 'gérérer")
  2. The wikitext is copied on a wiki page, using wikitext editor.
  3. All links are like this (bug to be fixed): [[Alain_Le_Goff|Alain Le Goff]].
  4. I do a simple search and replace on underscore signs to have [[Alain Le Goff|Alain Le Goff]] (I don't know how to remove Alain_Le_Goff| and all similare cases at once).
  5. I save the page

I use one of the Dicare tools (the interface is only in French), which generates lists of people with the same family name, based on Wikidata. The goal is to create disambiguation pages. http://tools.dicare.org/noms/homonymie.php

  1. Generate the list for Le Goff (Q24707465): http://tools.dicare.org/noms/homonymie.php?id=Q24707465&fallback=en+de+es+it+nl+br+ca (button 'gérérer")
  2. The wikitext is copied on a wiki page, using wikitext editor.

Sorry, this point needs some clarification.

The output of that tool is some wikitext (wikicode) and you paste it in a wikitext editor? At what point do you put it in VE?

  1. All links are like this (bug to be fixed): [[Alain_Le_Goff|Alain Le Goff]].
  2. I do a simple search and replace on underscore signs to have [[Alain Le Goff|Alain Le Goff]] (I don't know how to remove Alain_Le_Goff| and all similare cases at once).
  3. I save the page

The output of that tool is some wikitext (wikicode) and you paste it in a wikitext editor? At what point do you put it in VE?

In the meantime, @Envlh has fixed the tool. All links were like [[Alain_Le_Goff|Alain Le Goff]] due to a change on how Wikidata items were stored.

ssastry moved this task from In Progress to Backlog on the Parsoid board.Sep 11 2017, 7:09 PM

What's the status of this task?

I didn't had time to make more edits that would have reproduced the issue.

ssastry changed the task status from Open to Stalled.Dec 15 2017, 10:14 PM

If there haven't been other cases other than the one Benoit reported, isn't this resolved?

I haven't found other cases either.

ssastry closed this task as Resolved.Dec 19 2017, 2:58 PM

Please reopen if there are other instances that show up.