Page MenuHomePhabricator

Ws export: failed to get subpages for work with colon in the name
Closed, ResolvedPublicBUG REPORT

Description

Steps to Reproduce:

Actual Results:

  • There is no sub-page content

Expected Results:

  • Subpages included

I'm not sure if this is because there is a colon in the name.

EPUB that I received:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Inductiveload When I export the book, it now includes the subpages.

I see you added ws-summary to the TOC, which might have fixed the issue (see T253282). I see you raised this bug after you had done that, but I think the old (incorrect) export would still have been cached.

Could you try again?

@dom_walden it is indeed working now.

I had used the "bypass caching" link, (https://ws-export.wmcloud.org/?lang=en&page=The+Golden+Bowl+%28New+York%3A+Charles+Scribner%27s+Sons%2C+1909%29&format=epub-3&fonts=&nocache=1) but it still persisted. Now I have made the same change to v2, the same thing is happening.

How does nocache=1 work?

Although ws-summary fixes this problem, it shouldn't be required when the ToC is on the top level page (i.e. the one that's exported, in this case The Golden Bowl (New York: Charles Scribner's Sons, 1909)/Volume 1).

In PageParser, the following is supposed to find subpage links if the ws-summary fails:

	$chapters = $this->getChaptersList( $pageList, $namespaces );
	if ( empty( $chapters ) ) {
		$list = $this->xPath->query( '//a[contains(@href,"' . Util::wfUrlencode( $title ) . '") and
			not(
				contains(@class,"new") or
				contains(@class,"extiw") or
				@rel = "mw:WikiLink/Interwiki" or
				contains(@class,"external") or
				contains(@href,"#") or
				contains(@class,"internal") or
				contains(@href,"action=edit") or
				contains(@title,"/Texte entier") or
				contains(@class,"image")
			)]' );
		/*…*/
	}

but this isn't matching the links, which are like:

<a rel="mw:WikiLink" href="./The_Golden_Bowl_(New_York:_Charles_Scribner's_Sons,_1909)/Volume_1/Book_1/Chapter_5" title="The Golden Bowl (New York: Charles Scribner's Sons, 1909)/Volume 1/Book 1/Chapter 5" id="mwPA">Chapter 5</a>

I think it's because wfUrlencode encodes apostrophes, but Parsoid doesn't.

@Samwilson as a complication, this was actually encountered when trying to export the top-level page (not just Volume 1).

In that case, the addition of ws-summary *is* required on the subpages (T253282 will tackle fixing that).

And when I said apostrophes above, I of course meant colons.

[edit] pl ws users let me know that the problem will also occur for many pages without a comma in the name, but all of them have didactics in the page name [/edit]

@Samwilson it looks like comma a diacritic in the page name also causes subpages not to be included in ebup

Is there a chance to resolve this problem? It seems that we (pl ws) have hundreds thousands of ''erroneous export'' pages, e.g .:
https://pl.wikisource.org/wiki/Znachor_(Do%C5%82%C4%99ga-Mostowicz,_1938)
https://pl.wikisource.org/wiki/Z%C5%82ote_jab%C5%82ko_(Kraszewski,_1873)
https://pl.wikisource.org/wiki/Nad_przepa%C5%9Bci%C4%85_(Kraszewski,_1887)_ws
https://pl.wikisource.org/wiki/Nad_przepa%C5%9Bci%C4%85_Kraszewski,_1887_ws
https://pl.wikisource.org/wiki/Pan_s%C4%99dzia
https://pl.wikisource.org/wiki/W_sieci_paj%C4%99czej
...

We have checked - this problem did not exist before the "Ebook Export Improvement" project.
I get more and more "complaints" about such problems from pl ws users.

Sounds like a more general URL encoding issue. I'll have a look now.

Sounds like a more general URL encoding issue. I'll have a look now.

Yes, it looks like the problem is affecting all "non-English" wikis, eg. T275967. The new version of the exporter has stopped including subpages if the page title contains non-basic-latin letters (and the page does not contain the <summary> tag). Many wikis may not be aware of this problem.
I have checked the examples given here T275967 and here on bn and hz ws, using the older version of ws-exporter that we have on the local server and I got the full books exported with subpages (subchapters). The problem is therefore global and serious.

Please, if possible, give it a high priority.

@Tpt, maybe it would be possible to temporarily restore the previous subpage search module?

My current idea is that this is due to us trying to find subpages with the XPath a[contains(@href,"' . Util::wfUrlencode( $title ) . '"), which doesn't work because Parsoid's hrefs are not fully URL encoded like that. Anyway, it would also find other links containing the same title (although I can't find an example of that in the wild). Anyway, how about we switch to using the title attribute instead, and check for whether it begins with [title]/? I've had a go in this PR: https://github.com/wikimedia/ws-export/pull/360 (ready for review).

temporarily restore the previous subpage search module?

This wouldn't work because it's not the searching that's broken, but the source HTML that's changed.

WS Export 2.6.0 is released now with this fix. @Zdzislaw could you please check and see if things are improved?

... Anyway, how about we switch to using the title attribute instead, and check for whether it begins with [title]/?

I think we might be missing subpages if they are like [title]: foo or [title] (bar).

For example, https://nl.wikisource.org/wiki/Wetboek_van_Strafrecht_Suriname, https://es.wikisource.org/wiki/El_Tratado_de_la_Pintura

Also, the first three links here: https://nl.wikisource.org/wiki/Alleen_op_de_wereld

@Samwilson It looks like the problem has been resolved. Thank you very much for your quick reaction and implementation of the solution (many people involved in promoting pl ws breathed a sigh of relief). We appreciate it.
I also checked the cases reported here - looks like zh is ok as well.
But the problem reported here T275967 seems to be due to Auxiliary_Table_of_Contents template parsing problem rather than a wsexport problem (see: rest_v1/page/html/শকুন্তলা_(সিগনেট_প্রেস_সংস্করণ) and api/rest_v1/page/html/টেমপ্লেট:Auxiliary_Table_of_Contents).

@dom_walden "subpages" like [title]: foo or [title] (bar) always required the use of the "summary" class if they were to be identified as subchapters.

@Samwilson It looks like the problem has been resolved. Thank you very much for your quick reaction and implementation of the solution (many people involved in promoting pl ws breathed a sigh of relief). We appreciate it.
I also checked the cases reported here - looks like zh is ok as well.

Thanks for checking @Zdzislaw, that is good to hear.

I have exported a number of ebooks in different languages (including plwikisource and zhwikisource) on ws-export.wmcloud.org and compared them to a local version of WS Export before the Parsoid changes (T264788). I compared the number of documents included in the epub, hoping to find any differences in subpages included.

Test Environment: ws-export.wmcloud.org version 2.6.2.

ifried added a subscriber: ifried.

Thanks to everyone who reported this issue or helped fix it!

As this issue has been resolved, I'm marking it as Done.