Page MenuHomePhabricator

Join hyphenated words across pages
Open, LowPublic

Description

Normally, when transcluding a sequence of pages with the <pages/> tag, a white space is added between every page and the next. This is good in most cases, but when a word is hyphenated at the end of a page, and continues in the next page, the space is not desirable. Currently a variety of different templates are used to circumvent this problem.

My proposal is to introduce a <hyphen/> tag. In the Page namespace, it will simply render as a - (a minus sign). However the <pages/> tag should prevent the generation of the white space if a <hyphen/> is present at the very end of the page or section, so that the two halves of the word are effectively joined together.

Example: name<hyphen/>space

Bug T60729 is also related to this.

Event Timeline

Candalua created this task.Jul 2 2015, 1:01 PM
Candalua updated the task description. (Show Details)
Candalua raised the priority of this task from to Needs Triage.
Candalua added a subscriber: Candalua.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 2 2015, 1:02 PM
jayvdb awarded a token.Jul 2 2015, 1:03 PM
jayvdb added a subscriber: jayvdb.
GOIII added subscribers: Tpt, GOIII.Jul 3 2015, 9:27 PM

Instead of introducing another self-closing tag that won't "work" using the expanded tag syntax, why not just get the wiki-markup/parser/Parsoid to properly recognize & process the soft-hyphen character (&shy;) in those eol, hyphenated-word across page-break instances?

Even better; make SHY a formal magic-word that automatically eliminates the current auto-space, carriage-return, line-feed thing when found at the end of a line.

Either way, I'm not sure the elimination of the automatically added space is feasible regardless of the hyphen approach being proposed; @Tpt ?

why not just get the wiki-markup/parser/Parsoid to properly recognize & process the soft-hyphen character (&shy;) in those eol, hyphenated-word across page-break instances?

I believe it is doable and not so difficult to implement. I am not sure that use of the soft-hyphen is most contributor friendly thing to do but it is definitely better than introducing a new tag. Use the regular hyphen is maybe something to look at. It would be less semantic but more friendly to type with a keyword/edit with the VisualEditor. But there may be a conflict with other use cases of hyphens.

Even better; make SHY a formal magic-word that automatically eliminates the current auto-space, carriage-return, line-feed thing when found at the end of a line.

It is doable with the current PHP parser but maybe not with Parsoid. But with CSS 3 I believe it would be doable to do such style changes even on the Parsoid output.

@Phe @Hsarrazin @Aubrey @micru What do you think about it?

PS: Wikipedia article about soft hyphen: https://en.wikipedia.org/wiki/Soft_hyphen

is this something that would allow NOT to use template tiret and tiret2 (on frws) anymore ?

replace that template that's really delicate to add (having to look forward to next page) by a simple markup to have the 2 parts of the word stuck together again ?

Well, I'm for it, yes, big time !!

As for "how it should work", I'm no tech. A magic-world that could be added in place of the hyphen would be great, yes :)

last night I saw a bot running on frwiki, removing space at the end of pages (something calle pywikibot touch edit https://fr.wikisource.org/w/index.php?title=Page:Bronte_-_Shirley_et_Agnes_Grey.djvu/547&curid=661141&diff=5301259&oldid=2060905)

@Phe @Tpt Does it have anything to do with this, or is it fixing something else ?

is this something that would allow NOT to use template tiret and tiret2 (on frws) anymore ?

No, frws will still be able to use tiret and tiret2. And, in fact, I think it will maybe be even possible to implement tiret and tiret2 using <hyphen />

last night I saw a bot running on frwiki, removing space at the end of pages (something calle pywikibot touch edit https://fr.wikisource.org/w/index.php?title=Page:Bronte_-_Shirley_et_Agnes_Grey.djvu/547&curid=661141&diff=5301259&oldid=2060905)

I believe it was a purge operation done by @Billinghurst. But I am not sure.

Re the bot and touch. Yes it is cycling through pages due to another bug fix. The reason that the edit occurs is from yet another code change that occurred in our history. So we have the choice of which bug is less/more troublesome ... no listing of pages to file: or that touch does the space removal, which would occur the next time that the page was edited.

Yann added a subscriber: Yann.Feb 27 2016, 11:01 PM
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptFeb 27 2016, 11:01 PM
This comment was removed by Billinghurst.
Candalua triaged this task as Low priority.Sep 26 2016, 1:35 PM

@Candalua: As you added Developer-Wishlist 2017, could you elaborate how fixing this would make a developer's life better/easier?

Tgr added a subscriber: Tgr.

This is not in scope for the wishlist as it is not about a feature that would help development.

Change 435834 had a related patch set uploaded (by Candalua; owner: Candalua):
[mediawiki/extensions/ProofreadPage@master] Suppress page separator before a hyphen

https://gerrit.wikimedia.org/r/435834

Candalua claimed this task.EditedMay 29 2018, 8:14 AM

I submitted a patch for a possible solution. After some research, I decided to go for the "regular hyphen solution", as it is the most intuitive for the users, and the least likely to create problems with parsers, Visual Editor, and everything (but I would appreciate some feedback about this).

There are a few user cases where the hyphen should be kept, such as words which actually contains a hyphen, but I don't think we should cover those, as they are very rare and we will still be able to solve them with the usual transclusion templates like Tiret/Tiret2.
In any case, I included the possibility to configure a different "word joiner" instead of the hyphen.

Here are some tech details:

  • there is a new configuration variable to identify the "word joiner", which defaults to "-".
  • the sequence "word joiner + page separator" is replaced with an empty string after the parsing of the wikitext. As can be seen in the code, this is done through the use of a placeholder for the separator, because we need to mark the position of the separator before the parsing. The placeholder is also a new config variable which defaults to:
__PAGESEPARATOR__

written as it if was a magic word to make it very improbable to be present as legitimate wikitext. (I'm open to suggestions for an even better value.)

Of course it needs to be said that a wrong configuration of the word joiner and/or the placeholder can potentially break ProofreadPage transclusion.

Tpt added a comment.EditedMay 29 2018, 8:53 PM

An example:

Let assume that the page Page:foo.djvu/1 ends with the wikitext hyphen- and the page Page:Foo.djvu/2 starts with -ated

Current state:

The output of <pages index="Foo.djvu" from="1" to="2" /> is hyphen- ated.

If Candalua's proposa is deployed and activated:

The output of <pages index="Foo.djvu" from="1" to="2" /> is hyphenated.

This operation would be executed after preprocessing but before parsing.

Let assume that the page Page:foo.djvu/1 contains the wikitext hyphen- and the page Page:Foo.djvu/2 contains -nated

Obviously you mean: Page:foo.djvu/1 ends with the wikitext hyphen- and Page:Foo.djvu/2 starts with ated

Ankry added a subscriber: Ankry.May 30 2018, 1:00 PM

@Candalua we may have cases like

the page Page:foo.djvu/1 contains Anglo- and the page Page:Foo.djvu/2 contains -American or
the page Page:foo.djvu/1 contains Anglo- and the page Page:Foo.djvu/2 contains American

which both should be merged as Anglo-American. And this cannot be made automatically without user's decission whether the hyphen is necessery in a particular case or not.

@Candalua we may have cases like

the page Page:foo.djvu/1 contains Anglo- and the page Page:Foo.djvu/2 contains -American or
the page Page:foo.djvu/1 contains Anglo- and the page Page:Foo.djvu/2 contains American

which both should be merged as Anglo-American. And this cannot be made automatically without user's decission whether the hyphen is necessery in a particular case or not.

Exactly. As it is not possibile to automatically distinguish each case, my change always removes the hyphen (which is correct for the majority of cases). These other cases will have to be managed as before, with a transclusion template.

Tgr removed a subscriber: Tgr.May 30 2018, 2:35 PM
Magol added a subscriber: Magol.Jul 7 2018, 9:27 AM

Change 435834 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@master] Suppress page separator before a hyphen

https://gerrit.wikimedia.org/r/435834

The change is now live, everybody please try and test it.
I have updated the documentation at https://www.mediawiki.org/wiki/Help:Extension:ProofreadPage, and I plan of sending a mass message to all Scriptoriums in the next days. Then if no issues arise after some time I'll close this task.

Ankry added a comment.Sep 27 2018, 4:44 PM

The subject of this task is slightly misleading: I see no <hyphen/> tag introduced.

Candalua renamed this task from add <hyphen/> tag to ProofreadPage so that <pages/> doesn't put a whitespace between pages to Join hyphenated words across pages.Sep 28 2018, 10:10 AM

@Candalua The minus sign seems to have special meaning on zh.ws and it is not a hyphenation sign there (and no space is added there when merging pages).
I thinhk, this feature should be totally disabled there.

See also examples in my comment there:
https://zh.wikisource.org/w/index.php?title=Wikisource%3A%E5%86%99%E5%AD%97%E9%97%B4&type=revision&diff=1503150&oldid=1502692

In other wikis, some hyphens at the page and are intentional. And this usage is broken now. (Eg. "queen -" / "mother" in en.ws; transcluded previously as "queen - mother")

IMO, the wikis should have been notified about this change (preferably before it takes effect...), They need to review the hyphen-at-the-end-of-page usage.

As a positive, I note that I did not find any broken and previously correct usage in pl.ws, yet; but about 1000 pages still need to be reviewed.

Have you considered the case of German (pre- 1996 orthography reform) where "ck" hyphenates as "kk"; e.g. "backer" becomes "bak-ker"? My personal view is there are two many edge cases like this for hyphenation to be managed automatically. I think the patch should be rolled back.

@Ankry, @Hesperian: sorry for having been rather bold. As of now, task discussion normally happens here on phabricator only: we should really improve this by communicating more often between communities, and warn them of upcoming changes (I have created this distribution list that can now be used to contact all Wikisources).

Now, if you want to return to the previous behaviour on a particular wiki, this can be done by setting $ProofreadPagePageJoiner on something different than "-", for example a pseudo-magic word like __PAGEJOIN__. I will open a site request for zh.source, and for de.source too if you can provide me some consensus about that.

Ankry added a comment.EditedOct 1 2018, 2:37 PM

@Hesperian I think this change does not break anything concerning old German orthography, as it does not fully replace currently used hyphenated word replacing method. The only valid reason of disabling this change may be if it is discouraged because it may be a source of bad habits among editors. As I noticed, it was not intended to fix all possible hypenation cases: for some cases still current merging method should be used.

@Candalua As I noticed already, cs.ws community like this change. Also, in pl.ws we think it may be OK (I checked most of affected pages and it seems that this change breaks only pages that were already broken - so nothing worse here than it was before).
In en.ws I have found (and fixed) one page that was broken: "queen -" / "mother"; erlier rendered as "queen - mother", now as "queen mother". However, this is large wiki with a lot of cases that may need a review.
In zh.ws this change is useless and I think it should be disabled, IMO. Unsure about other non-Latin, non-Greek and non-Cyrillic scripts.

However, for future changes, I think, it would be better to implement them as disabled by default and let wikis to decide whether they want it enabled or not. We have too many various scripts in use, editors with various habits and it is hard to predict all possible cases.

The hyphens at the end of these zhwikisource pages are not real hyphens, but is part of MediaWiki-Language-converter syntax. We needs to identify these usages and do not join them incorrectly.

@Midleading: as per discussion above, I have already opened a request to disable this functionality on zh.source: T205826. I don't understand why it has not been done yet.

Yeah, but regardless of that, if this feature can be aware that the hyphens may be part of language converter and do not make mistake because of it, we can use this feature on zhwikisource.

Ankry added a comment.Jan 2 2019, 9:41 AM

Yeah, but regardless of that, if this feature can be aware that the hyphens may be part of language converter and do not make mistake because of it, we can use this feature on zhwikisource.

After introducing T60729 in zhwikisource, this feature should be no-op there, so being active in zh.ws (and ja.ws) seems to me to be pointless.

Do you mean using the hyphen-at-the-page-end merging instead of T60729 ? Or just showing the hyphen in the Page namespace while dropping it in main ns?

Ankry added a comment.Jan 2 2019, 10:52 AM

Have you considered the case of German (pre- 1996 orthography reform) where "ck" hyphenates as "kk"; e.g. "backer" becomes "bak-ker"? My personal view is there are two many edge cases like this for hyphenation to be managed automatically. I think the patch should be rolled back.

Is this construction used anywhere in dewikisource content? Can you point out an example? Pointing out that it is really useful, not only a theoretical possibility, may be helpful when deciding whether to implement it or not.

However, as I can see, current implementation ignores the context, so supporting kk/ck hyphenation would need a separate implementation (as well as supporting the -{some_character}- construction used by the language converter in zh.ws).

Xover added a subscriber: Xover.Jan 4 2019, 6:45 PM

Any chance with this change working on cross-page references?

<ref name="note">hyphen-</ref><ref follow="note">ated</ref> currently outputs as hyphen- ated