Page MenuHomePhabricator

Join hyphenated words across pages
Open, LowPublic

Description

Normally, when transcluding a sequence of pages with the <pages/> tag, a white space is added between every page and the next. This is good in most cases, but when a word is hyphenated at the end of a page, and continues in the next page, the space is not desirable. Currently a variety of different templates are used to circumvent this problem.

My proposal is to introduce a <hyphen/> tag. In the Page namespace, it will simply render as a - (a minus sign). However the <pages/> tag should prevent the generation of the white space if a <hyphen/> is present at the very end of the page or section, so that the two halves of the word are effectively joined together.

Example: name<hyphen/>space

Bug T60729 is also related to this.

Event Timeline

Candalua raised the priority of this task from to Needs Triage.
Candalua updated the task description. (Show Details)
Candalua added a subscriber: Candalua.

Instead of introducing another self-closing tag that won't "work" using the expanded tag syntax, why not just get the wiki-markup/parser/Parsoid to properly recognize & process the soft-hyphen character (&shy;) in those eol, hyphenated-word across page-break instances?

Even better; make SHY a formal magic-word that automatically eliminates the current auto-space, carriage-return, line-feed thing when found at the end of a line.

Either way, I'm not sure the elimination of the automatically added space is feasible regardless of the hyphen approach being proposed; @Tpt ?

why not just get the wiki-markup/parser/Parsoid to properly recognize & process the soft-hyphen character (&shy;) in those eol, hyphenated-word across page-break instances?

I believe it is doable and not so difficult to implement. I am not sure that use of the soft-hyphen is most contributor friendly thing to do but it is definitely better than introducing a new tag. Use the regular hyphen is maybe something to look at. It would be less semantic but more friendly to type with a keyword/edit with the VisualEditor. But there may be a conflict with other use cases of hyphens.

Even better; make SHY a formal magic-word that automatically eliminates the current auto-space, carriage-return, line-feed thing when found at the end of a line.

It is doable with the current PHP parser but maybe not with Parsoid. But with CSS 3 I believe it would be doable to do such style changes even on the Parsoid output.

@Phe @Hsarrazin @Aubrey @micru What do you think about it?

PS: Wikipedia article about soft hyphen: https://en.wikipedia.org/wiki/Soft_hyphen

is this something that would allow NOT to use template tiret and tiret2 (on frws) anymore ?

replace that template that's really delicate to add (having to look forward to next page) by a simple markup to have the 2 parts of the word stuck together again ?

Well, I'm for it, yes, big time !!

As for "how it should work", I'm no tech. A magic-world that could be added in place of the hyphen would be great, yes :)

last night I saw a bot running on frwiki, removing space at the end of pages (something calle pywikibot touch edit https://fr.wikisource.org/w/index.php?title=Page:Bronte_-_Shirley_et_Agnes_Grey.djvu/547&curid=661141&diff=5301259&oldid=2060905)

@Phe @Tpt Does it have anything to do with this, or is it fixing something else ?

is this something that would allow NOT to use template tiret and tiret2 (on frws) anymore ?

No, frws will still be able to use tiret and tiret2. And, in fact, I think it will maybe be even possible to implement tiret and tiret2 using <hyphen />

last night I saw a bot running on frwiki, removing space at the end of pages (something calle pywikibot touch edit https://fr.wikisource.org/w/index.php?title=Page:Bronte_-_Shirley_et_Agnes_Grey.djvu/547&curid=661141&diff=5301259&oldid=2060905)

I believe it was a purge operation done by @Billinghurst. But I am not sure.

Re the bot and touch. Yes it is cycling through pages due to another bug fix. The reason that the edit occurs is from yet another code change that occurred in our history. So we have the choice of which bug is less/more troublesome ... no listing of pages to file: or that touch does the space removal, which would occur the next time that the page was edited.

@Candalua: As you added Developer-Wishlist 2017, could you elaborate how fixing this would make a developer's life better/easier?

Tgr added a subscriber: Tgr.

This is not in scope for the wishlist as it is not about a feature that would help development.

Change 435834 had a related patch set uploaded (by Candalua; owner: Candalua):
[mediawiki/extensions/ProofreadPage@master] Suppress page separator before a hyphen

https://gerrit.wikimedia.org/r/435834

I submitted a patch for a possible solution. After some research, I decided to go for the "regular hyphen solution", as it is the most intuitive for the users, and the least likely to create problems with parsers, Visual Editor, and everything (but I would appreciate some feedback about this).

There are a few user cases where the hyphen should be kept, such as words which actually contains a hyphen, but I don't think we should cover those, as they are very rare and we will still be able to solve them with the usual transclusion templates like Tiret/Tiret2.
In any case, I included the possibility to configure a different "word joiner" instead of the hyphen.

Here are some tech details:

  • there is a new configuration variable to identify the "word joiner", which defaults to "-".
  • the sequence "word joiner + page separator" is replaced with an empty string after the parsing of the wikitext. As can be seen in the code, this is done through the use of a placeholder for the separator, because we need to mark the position of the separator before the parsing. The placeholder is also a new config variable which defaults to:
__PAGESEPARATOR__

written as it if was a magic word to make it very improbable to be present as legitimate wikitext. (I'm open to suggestions for an even better value.)

Of course it needs to be said that a wrong configuration of the word joiner and/or the placeholder can potentially break ProofreadPage transclusion.

An example:

Let assume that the page Page:foo.djvu/1 ends with the wikitext hyphen- and the page Page:Foo.djvu/2 starts with -ated

Current state:

The output of <pages index="Foo.djvu" from="1" to="2" /> is hyphen- ated.

If Candalua's proposa is deployed and activated:

The output of <pages index="Foo.djvu" from="1" to="2" /> is hyphenated.

This operation would be executed after preprocessing but before parsing.

Let assume that the page Page:foo.djvu/1 contains the wikitext hyphen- and the page Page:Foo.djvu/2 contains -nated

Obviously you mean: Page:foo.djvu/1 ends with the wikitext hyphen- and Page:Foo.djvu/2 starts with ated

@Candalua we may have cases like

the page Page:foo.djvu/1 contains Anglo- and the page Page:Foo.djvu/2 contains -American or
the page Page:foo.djvu/1 contains Anglo- and the page Page:Foo.djvu/2 contains American

which both should be merged as Anglo-American. And this cannot be made automatically without user's decission whether the hyphen is necessery in a particular case or not.

@Candalua we may have cases like

the page Page:foo.djvu/1 contains Anglo- and the page Page:Foo.djvu/2 contains -American or
the page Page:foo.djvu/1 contains Anglo- and the page Page:Foo.djvu/2 contains American

which both should be merged as Anglo-American. And this cannot be made automatically without user's decission whether the hyphen is necessery in a particular case or not.

Exactly. As it is not possibile to automatically distinguish each case, my change always removes the hyphen (which is correct for the majority of cases). These other cases will have to be managed as before, with a transclusion template.

Change 435834 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@master] Suppress page separator before a hyphen

https://gerrit.wikimedia.org/r/435834

The change is now live, everybody please try and test it.
I have updated the documentation at https://www.mediawiki.org/wiki/Help:Extension:ProofreadPage, and I plan of sending a mass message to all Scriptoriums in the next days. Then if no issues arise after some time I'll close this task.

The subject of this task is slightly misleading: I see no <hyphen/> tag introduced.

Candalua renamed this task from add <hyphen/> tag to ProofreadPage so that <pages/> doesn't put a whitespace between pages to Join hyphenated words across pages.Sep 28 2018, 10:10 AM

@Candalua The minus sign seems to have special meaning on zh.ws and it is not a hyphenation sign there (and no space is added there when merging pages).
I thinhk, this feature should be totally disabled there.

See also examples in my comment there:
https://zh.wikisource.org/w/index.php?title=Wikisource%3A%E5%86%99%E5%AD%97%E9%97%B4&type=revision&diff=1503150&oldid=1502692

In other wikis, some hyphens at the page and are intentional. And this usage is broken now. (Eg. "queen -" / "mother" in en.ws; transcluded previously as "queen - mother")

IMO, the wikis should have been notified about this change (preferably before it takes effect...), They need to review the hyphen-at-the-end-of-page usage.

As a positive, I note that I did not find any broken and previously correct usage in pl.ws, yet; but about 1000 pages still need to be reviewed.

Have you considered the case of German (pre- 1996 orthography reform) where "ck" hyphenates as "kk"; e.g. "backer" becomes "bak-ker"? My personal view is there are two many edge cases like this for hyphenation to be managed automatically. I think the patch should be rolled back.

@Ankry, @Hesperian: sorry for having been rather bold. As of now, task discussion normally happens here on phabricator only: we should really improve this by communicating more often between communities, and warn them of upcoming changes (I have created this distribution list that can now be used to contact all Wikisources).

Now, if you want to return to the previous behaviour on a particular wiki, this can be done by setting $ProofreadPagePageJoiner on something different than "-", for example a pseudo-magic word like __PAGEJOIN__. I will open a site request for zh.source, and for de.source too if you can provide me some consensus about that.

@Hesperian I think this change does not break anything concerning old German orthography, as it does not fully replace currently used hyphenated word replacing method. The only valid reason of disabling this change may be if it is discouraged because it may be a source of bad habits among editors. As I noticed, it was not intended to fix all possible hypenation cases: for some cases still current merging method should be used.

@Candalua As I noticed already, cs.ws community like this change. Also, in pl.ws we think it may be OK (I checked most of affected pages and it seems that this change breaks only pages that were already broken - so nothing worse here than it was before).
In en.ws I have found (and fixed) one page that was broken: "queen -" / "mother"; erlier rendered as "queen - mother", now as "queen mother". However, this is large wiki with a lot of cases that may need a review.
In zh.ws this change is useless and I think it should be disabled, IMO. Unsure about other non-Latin, non-Greek and non-Cyrillic scripts.

However, for future changes, I think, it would be better to implement them as disabled by default and let wikis to decide whether they want it enabled or not. We have too many various scripts in use, editors with various habits and it is hard to predict all possible cases.

The hyphens at the end of these zhwikisource pages are not real hyphens, but is part of MediaWiki-Language-converter syntax. We needs to identify these usages and do not join them incorrectly.

@Midleading: as per discussion above, I have already opened a request to disable this functionality on zh.source: T205826. I don't understand why it has not been done yet.

Yeah, but regardless of that, if this feature can be aware that the hyphens may be part of language converter and do not make mistake because of it, we can use this feature on zhwikisource.

Yeah, but regardless of that, if this feature can be aware that the hyphens may be part of language converter and do not make mistake because of it, we can use this feature on zhwikisource.

After introducing T60729 in zhwikisource, this feature should be no-op there, so being active in zh.ws (and ja.ws) seems to me to be pointless.

Do you mean using the hyphen-at-the-page-end merging instead of T60729 ? Or just showing the hyphen in the Page namespace while dropping it in main ns?

Have you considered the case of German (pre- 1996 orthography reform) where "ck" hyphenates as "kk"; e.g. "backer" becomes "bak-ker"? My personal view is there are two many edge cases like this for hyphenation to be managed automatically. I think the patch should be rolled back.

Is this construction used anywhere in dewikisource content? Can you point out an example? Pointing out that it is really useful, not only a theoretical possibility, may be helpful when deciding whether to implement it or not.

However, as I can see, current implementation ignores the context, so supporting kk/ck hyphenation would need a separate implementation (as well as supporting the -{some_character}- construction used by the language converter in zh.ws).

Any chance with this change working on cross-page references?

<ref name="note">hyphen-</ref><ref follow="note">ated</ref> currently outputs as hyphen- ated

The thing you say it render as is a hyphen, not a minus sign. A minus sign looks quite different from a hyphen. Thus:

3 − 5

3 - 5

As reported here, that doesn't work if pages content section tag(s). Maybe replace recursiveTagParse by recursiveTagParseFully?

I just tried to add recursiveTagParseFully and made a quick test.

The first page contains start- and the second one <section start="foo"/>\nend (with \n a line jump). The rendering is start end with both recursiveTagParseFully and recursiveTagParse. So, recursiveTagParseFully does not solve the problem.
I believe what we should do is investigate how to handle line jumps at the beginning of pages.

Hmm. Is this really an issue with LST as such? The example at frWS uses the pseudo-LST ## section name ## syntax provided by a local Gadget, which is what forces a newline. So far as I know, raw <section begin="section name" /> syntax does not force a newline and should work out of the box for this.

Working around this with raw LST tags should be acceptable for the relatively few cases of this, or, alternately, there's no particular reason the LST gadget can't be fixed to not force the newline in this situation.

Is there any other situation where we need to deal with newlines like this? They should only be being preserved here due to the LST syntax, because otherwise MW and PRP conspire to strip leading whitespace (which in turn requires workarounds like <nowrap /> and {{nop}}/{{nopt}}).

Any chance with this change working on cross-page references?

<ref name="note">hyphen-</ref><ref follow="note">ated</ref> currently outputs as hyphen- ated

Yes, this would be a very nice expansion of the feature (but should probably be a new task), that would probably eliminate 99% of the remaining uses for the hws/hwe templates on enWS (which I think are tiret/tiret2 on frWS).

But I suspect that that would actual have to be implemented in Extension:Cite, since it's Cite that is joining the content of <ref name="foo"> <ref follow="foo">. And I'm pretty sure this behaviour would only be desirable in the context of PRP and not all the other places Cite is used, so unlike the page-end hyphens this patch dealt with I don't think you could enable that by default even if implemented.

@thiemowmde (since I know you looked at that code relatively recently) Any chance you could provide some general thoughts or pointers on how it might be (sanely) possible to get two ref=name/ref=follow refs to remove the hyphen and suppress the space in the example quoted above? Ideally in such a way that when the two refs have been transcluded by PRP it happens automatically, but I'm guessing that would create some rather unfortunate mutual dependencies between the two extensions.

This is obviously way outside the scope of the work you were doing in this area, but if you have any ideas or pointers that would let me at least write this up as a coherent new Phab task it would be really helpful.

My team reworked the Cite codebase a while ago when we worked on Book-Referencing. Unfortunately we currently don't have resources to support anything Cite-related. Even if, I'm not sure if we would pick this up. My estimation as an engineer is that the feature discussed in this ticket requires rather high investments for not much benefit.

I can provide some ideas that might be worth exploring, if you want.

  • Maybe there is a Unicode feature you can use? Think of &shy;. What we need is a dash character that is rendered when it's at the end of a line (&shy; unfortunately doesn't do this), but becomes invisible when it's between two characters (this is literally what &shy; is for).
  • It should be possible to teach the Cite extension what it means when the content of a reference ends with &shy; (or the literal U+00AD character). It would need to do two things: Replace the trailing &shy; with - when the references are rendered independently, and merge them without a space when they are.
  • Another idea is to introduce some syntax that allows to render (partly) different content when the references are merged. It might even be possible to come up with a (mostly) pure CSS solution.

It might look something like this:

<ref name="foo">
    <span class="cite-follow-hide">hyphen-</span>
    <span class="cite-follow-show">hyphenated</span>
</ref>
<ref follow="foo">
    <span class="cite-follow-hide">ated</span>
    <span class="cite-follow-show"></span>
</ref>

This HTML syntax can be wrapped in a template, if you want. The only thing that needs to be changed in the Cite code is to output two different CSS classes when such references are merged vs. when they are not.

There are certainly more options.

Would it be possible to have a hook that PRP can tie into and process the text as needed? Then Cite can delegate responsibility to PRP for how this works, and all page-joining-hyphen shenanigans can be handled in the same place (PRP) with the same configurations (e.g. wgProofreadPagePageJoiner).

Any chance with this change working on cross-page references?

<ref name="note">hyphen-</ref><ref follow="note">ated</ref> currently outputs as hyphen- ated

For hyphenated references that flow over pages, just stick the hyphenated word in a <noinclude> tag set on the first page, and then on the following page set the same word inside an <includeonly> tag set. Isn't trying to create something super complex overkill for the number of occurrences when there is a ready and easy solution existing in wiki code? I think in ten+ years I will have done it two or three times, and I do plenty of references.

<noinclude> doesn't apply here. I would not reuse it for another purpose. But yes, the basic idea is the same as the one with the CSS classes I described.

I'm afraid I don't understand how this is related to ProofreadPage. I would avoid complex solutions like new hooks or anything that creates dependencies between two extensions.

<noinclude> doesn't apply here. I would not reuse it for another purpose. But yes, the basic idea is the same as the one with the CSS classes I described.

I'm afraid I don't understand how this is related to ProofreadPage. I would avoid complex solutions like new hooks or anything that creates dependencies between two extensions.

I was more wondering why we were fussing with edge cases that seldom occur and have a perfectly adequate solution within wikitext.

@thiemowmde The whole issue here is that you might have one PageNS page in PRP that has a reference like <ref name="p1">This ref-</ref> and continues on another Page: NS page with <ref follow="p1">erence is split</ref>.

When viewed in the Page NS, the first page should read "This ref-" (i.e. with hyphen), and when transcluded together, the text should read (and copy-paste) as "This reference is split" (i.e. without hyphen).

This is something that would only reasonably happen at Wikisource, in the context of PRP transclusion. Hence the idea to delegate to the PRP extension for reuse of the existing logic and configuration. Notably, PRP already does exactly this for the main content.

@Billinghurst certainly it's not a common situation, but it would make more intuitive sense if the refs and main text content were handled the same. You and I get it because we remember when {{hws/e}} were needs everywhere, but IMO it adds another little layer to the barrier to entry. WS processes are hard enough without saying "hyphens are handled automagically across page breaks except for refs when you use {{hws/e}}". Handling refs and main content the same would essentially allow hws/e to be completely deprecated except for some really bizarre edge cases, which simplifies messaging for new users.

Ah, I see. <includeonly> works because of the way the individual pages are merged.

<ref name="foo">
    hyphen<noinclude>-</noinclude><includeonly>ated</includeonly>
</ref>
<ref follow="foo">
    <noinclude>ated</noinclude>
</ref>

@Inductiveload, I know the features are used together, but I don't see how they technically depend on each other. I had a quick look but can't find any code in the ProofreadPage extension that would do anything with hyphens. German Wikisource apparently bypasses the issue entirely. English Wikisource appears to use the footer in combination with templates that render different content depending on the namespace (just another solution that works pretty much the same as the <noinclude> concept). Can't you use these templates in references?

My main points are:

  • While the suggested <hyphen /> tag might be convenient in many situations, it won't solve everything. Not all languages use hyphens. Sometimes a hyphenated word is spelled different or uses different capitalization. The question is then: What's better? A mixture of two or more solutions of varying complexity, or a single solution that's a little bit more complex?
  • Every universal solution I can think of will probably have the same level of complexity as the solution you currently use.

@thiemowmde The existing PRP hyphen removal logic is handled in PageTagParser.php:

		$separator = $this->context->getConfig()->get( 'ProofreadPagePageSeparator' );
		$joiner = $this->context->getConfig()->get( 'ProofreadPagePageJoiner' );

This allows the client Wikisources to configure their own hyphen characters and page separators. For example, zhWS doesn't use hyphens and doesn't add spaces between pages, because Chinese doesn't have hyphenation or spaces between "words".

Configuration on the WMF wikifarm is in InitaliseSettings.php:

'wgProofreadPagePageSeparator' => [
	'default' => '&#32;',
	'jawikisource' => '', // T195873
	'thwikisource' => '', // T252610
	'zhwikisource' => '', // T194875
],

'wgProofreadPagePageJoiner' => [
	'default' => '-',
	'zhwikisource' => '__PAGEJOIN__', // T205826
],

Thus, delegating reference-joining logic to PRP would allow Wikisources to transparently unify the handling of continued references in exactly the same way as continued pages, using the exact same configuration.

With respect to the templates {{hws}} and {{hwe}}, these have been deprecated on enWS for some years for the main page content since PRP became able to do this, because the built-in PRP hyphen handling is much simpler. However, that are retained mostly because references still need them (though this is not that common, because continued references are themselves not common).

Ah, thanks for the insight! Yes, in theory it would be possible to make Cite just re-use the existing ProofreadPage configuration, if it exists. Or copy-paste it with a wgCite… prefix. The implementation can be even more trivial than in ProofreadPage: Check if the previous <ref>'s text ends with the …PageJoiner string. If it does, remove it and merge without any separator. Otherwise merge with the …PageSeparator. The relevant line of code is https://phabricator.wikimedia.org/diffusion/ECIT/browse/master/src/ReferenceStack.php$147.

Don't forget the Parsoid extension needs to be updated accordingly: https://phabricator.wikimedia.org/diffusion/GPAR/browse/master/src/Ext/Cite/References.php$193. It's probably best to sync with the Parsoid team before implementing anything.

I think the idea of joining follow-refs with the PRP hyphenation rules should be a separate task because I think people are getting confused here. PRP already supports the headline functionality "Join hyphenated words across pages", and has done since May 2018 in 09e69dab933607533042d8343cab63b958703685. So I suggest to close this as resolved and deal with edge cases like refs separately in a subtask.