Page MenuHomePhabricator

Parsoid <pages> support does not apply $wgProofreadPagePageJoiner logic
Open, MediumPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):
This issue mainly concerns the Wikisource project

  1. open a page containing a hyphen on in different browsers

e.g. https://pl.wikisource.org/wiki/Dziewczyna_bezimienna/Cz%C4%99%C5%9B%C4%87_druga/XVIII
OR

  1. download a page as epub (ws-export)

e.g. https://ws-export.wmcloud.org/?format=epub&lang=pl&page=Dziewczyna_bezimienna%2FCz%C4%99%C5%9B%C4%87_druga%2FXVIII

What happens?:

On places where a hyphen occurs on the page break there is a literary hyphen in the text
here: niepokoju, doprawdy, że cza- sami żałuję naszego dawnego życia

What should have happened instead?:

when the page ends in a hyphen, the hyphen should be removed and no space be inserted
the visible text should be here:

niepokoju, doprawdy, że czasami żałuję naszego dawnego życia

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

  1. e.g. Opera, Windows
  1. everywhere

Event Timeline

Restricted Application added subscribers: jhsoby-WMNO, Aklapper. · View Herald Transcript
Pppery renamed this task from During transclusiona a page breake with a hyphen is not treated correctly to During transclusion a a page breakwith a hyphen is not treated correctly.Dec 6 2025, 4:52 PM
Draco_flavus renamed this task from During transclusion a a page breakwith a hyphen is not treated correctly to During transclusion a page break with a hyphen is not treated correctly.Dec 7 2025, 10:25 AM
aaron triaged this task as Medium priority.Dec 16 2025, 1:49 AM

Maybe there is some difference in parser options or hooks for the REST endpoint that makes the ProofReadPage tag hooks act a bit differently?

OK, so I can repro this issue with action=parse by using parser=parsoid, e.g. https://pl.wikisource.org/w/api.php?action=parse&format=json&page=Dziewczyna%20bezimienna%2FCz%C4%99%C5%9B%C4%87%20druga%2FXVIII&formatversion=2&parser=parsoid . The parse Action API default is the wikitext parser. The rest API uses the parsoid parser.

Sourcepage: https://pl.wikisource.org/w/index.php?title=Strona:PL_Dziewczyna_bezimienna_by_Ch_Mérouvel_from_Kurjer_Poranny_Y1893_No152_part07.jpg&action=edit

I believe ProofreadPage has a little known feature where a hyphen is removed when concatenated to the next page via https://wikisource.org/wiki/Wikisource:ProofreadPage#The_%3Cpages/%3E_tag
https://www.mediawiki.org/wiki/Extension:Proofread_Page#Join_hyphenated_words_across_pages

PS. There's also a hidden alternate separator in the text, which is https://www.mediawiki.org/wiki/Extension:Proofread_Page#Page_separator
You can see this in the Parsoid output because it has <span typeof=\"mw:Entity\" id=\"mwiw\"> </span> after the cza-, which is correct.

TheDJ renamed this task from During transclusion a page break with a hyphen is not treated correctly to Parsoid <pages> support does not apply $wgProofreadPagePageJoiner logic.Dec 16 2025, 7:33 PM
TheDJ edited projects, added Parsoid; removed MediaWiki-REST-API.

Maybe they are not well known beyond the Wikisource community. However for us these features are crucial. We use them almost everywhere.

Maybe they are not well known beyond the Wikisource community. However for us these features are crucial. We use them almost everywhere.

Yeah, but it's not a standard wikitext feature. it's something that is part of a specific parser tag, and it's easily accidentally overlooked if you are not deeply familiar with Proofread Page.

Change #1224259 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/extensions/ProofreadPage@master] Fix placeholder replacement in Parsoid renders

https://gerrit.wikimedia.org/r/1224259

ABreault-WMF subscribed.

Seems to have been overlooked in T278481. There's test coverage for the feature but it was falsely passing.

Change #1224259 merged by jenkins-bot:

[mediawiki/extensions/ProofreadPage@master] Fix placeholder replacement in Parsoid renders

https://gerrit.wikimedia.org/r/1224259

Confirmed fixed, in both parsoid view and in epub export

Thanks, unfortunately the Parsoid native rendering was disabled in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1225613

The patch in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/1224259 should have fixed the issue but we won't be able to confirm until it's re-enabled again.