Page MenuHomePhabricator

MediawikiApi::newFromPage() fails on non-XML HTML
Closed, ResolvedPublic

Description

When fetching with $api = MediawikiApi::newFromPage('https://sq.wikinews.org/'); I get the following error:

"String could not be parsed as XML" at /vendor/addwiki/mediawiki-api-base/src/MediawikiApi.php line 80
Warning: SimpleXMLElement::__construct(): Entity: line 419: parser error:
Entity 'reg' not defined at vendor/addwiki/mediawiki-api-base/src/MediawikiApi.php:80

I guess newFromPage should be changed to not treat HTML as XML and not use SimpleXml.

Event Timeline

I noticed this too, which effects T162754. Currently you can't select a number of projects, including es.wikipedia.org

Samwilson triaged this task as Medium priority.Apr 24 2017, 5:50 AM
Samwilson moved this task from Incoming to In Progress on the Addwiki board.
Samwilson edited projects, added Community-Tech-Sprint; removed Community-Tech.

Here's a fix: https://github.com/addwiki/mediawiki-api-base/pull/34

The parsing of the RSD XML can still be done with SimpleXml, but perhaps this should be changed too, to be more uniform. What do you think? It's just a few extra lines of code, is all.

kaldari claimed this task.

Now we just need to wait for Addshore to cut a new version.

Now we just need to wait for Addshore to cut a new version.

2.3.0 now released :)

Maybe I'm doing something wrong, but on the latest mediawiki-api and mediawiki-api-core, with MediawikiApi::newFromPage('https://sq.wikinews.org/'); I now get:

Warning: DOMDocument::loadHTML(): ID n-Dhuroni already defined in Entity, line: 370

The backtrace points to https://github.com/addwiki/mediawiki-api-base/blob/master/src/MediawikiApi.php#L87

Anyone else have this issue?

Duplicate element IDs break DOMDocument::loadHTML (in the case above, https://sq.wikinews.org/wiki/MediaWiki:Sidebar contains sitesupport-url|Dhuroni twice). This can be fixed by working around libxml errors (because this one doesn't actually stop the xpath finding the RSD URL, so can be ignored).

I'll send another PR.

PR merged. Original issue is moot since API URL has been moved to configs.