Page MenuHomePhabricator

Wikisource: Make footnote links manually added to be internal links
Open, Needs TriagePublic5 Estimated Story Points

Description

This is a continuation of T270372 & a bug fix. As noted by @Samwilson: All <ref> reference links should be fixed, but there are some works that use manual links to create footnotes and these have not been fixed yet. We don't have a way to identify these, because they're effectively just normal wiki links (as you've pasted above). They used be be identifiable by the fact that they had same-document links, e.g. href="#foo" but Parsoid adds the page name to these, href="Lorem#foo" — we'll add handling for these soon (perhaps in a separate ticket? I'm not sure).

As noted by @Denis_Gagne52: As mentioned earlier, any link built this way [[Les pères du système taoïste/Tao-Tei-King#CHAP1]] will work but not with only #CHAP1 inside the brackets. There are many links built that way. I think the conversion was taken care in this function of BookCleanerEpub.php but the first part does not trap #mylink any more :

 	/**
 	 change the internal links
 	 */
 	protected function setLinks( DOMDocument $dom ) {
 		$list = $dom->getElementsByTagName( 'a' );
		/** @var DOMElement $node */
 		foreach ( $list as $node ) {
 			$href = $node->getAttribute( 'href' );
 			$title = Util::encodeString( $node->getAttribute( 'title' ) ) . '.xhtml';
		if ( substr( $href, 0, 1 ) === '#' ) {

Examples:

Event Timeline

ifried renamed this task from Wikisource: Fix external footnote links manually added to Wikisource: Make footnote links manually added to be internal links.Feb 24 2021, 3:14 PM
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried added a subscriber: Denis_Gagne52.

Yes, you are correct. Thanks for adding the tag, @Aklapper.

ARamirez_WMF set the point value for this task to 5.Feb 25 2021, 6:57 PM
ARamirez_WMF moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.

The reason that this behaviour has changed since the switch to Parsoid HTML is that links within the same document are now prefixed with the document name, and they used to not be. This means that it's now more complicated to determine which links are same-document ones. They used be be e.g. href="#foo" but Parsoid adds the page name e.g. href="Lorem#foo". Is this intentional by Parsoid? It means that the document can't be used with any other name (e.g. downloading it to Lorem.html breaks the links).

If the page name is added to the link, I think the first elseif would catch it and build a new link as expected. Could you investigate if the problem is not with the node. I don’t see any title attribute in the neighborhood of href="#foo" and one is needed :

$title = Util::encodeString( $node->getAttribute( 'title' ) ) . '.xhtml';

Assuming the link is prefixed with the document name, why can’t we link to that document whether it is the same or not ?

Something like this may work :

	 * change the internal links
	 */
	protected function setLinks( DOMDocument $dom ) {
		$list = $dom->getElementsByTagName( 'a' );
		/** @var DOMElement $node */
		foreach ( $list as $node ) {
			$href = $node->getAttribute( 'href' );
			$title = Util::encodeString( $node->getAttribute( 'title' ) ) . '.xhtml';
                        $pos = strpos( $href, '#' );
                        if (strlen($title) < 7) && ( $pos !== false ) {
                                $title = Util::encodeString( substr( $href, 0, $pos - 1 )) . '.xhtml';
                        }
			if ( substr( $href, 0, 1 ) === '#' ) {
				continue;   //there was no array_search to get the transformed id ??
			} elseif ( in_array( $title, $this->linksList ) ) {
				//$pos = strpos( $href, '#' );
				if ( $pos !== false ) {
					$anchor = substr( $href, $pos + 1 );
					$title .= '#' . array_search( $anchor, PageParser::getIds() );
				}
Arlolra subscribed.

Is this intentional by Parsoid?

I would say yeah. There're some hints at it in T69486 but links in Parsoid are rendered relative to the <base href="//en.wikipedia.org/wiki/"/> element in the head of the document.

CommTech is not planning on working on this.