Page MenuHomePhabricator

Allow HTML parsing with DOMDocument
Closed, ResolvedPublic

Description

Add html mode to {{#get_web_data:}}.

If used with | use xpath, address DOM nodes with XPath; otherwise, with CSS/jQuery-style selectors. Extend their syntax with an optional .attr() tail: e.g., h2 a.attr('href') or h2 a.attr (href) will both effectively give //h2//a/@href.

In both cases, parse HTML with DOMDocument and DOMXPath standard classes.

CSS mode requires a new dependency: symfony/css-selector (already required by Semantic MediaWiki).

Below are two links to a site where the proposed features have been implemented:

Event Timeline

That's interesting. What's a use case for parsing HTML with External Data?

That's interesting. What's a use case for parsing HTML with External Data?

Page scraping, when there's no API. Regular expressions not always can parse HTML.

See two examples at the end of the task description: in both, a list of articles written by a certain author is extracted from a publishing site.

Okay, that makes sense. And this is very cool - I like it. (Though of course, web scraping should only be done if there are no other options!) I'm looking forward to the patch.