Page MenuHomePhabricator

Allow HTML parsing with DOMDocument
Closed, ResolvedPublic

Description

Add html mode to {{#get_web_data:}}.

If used with | use xpath, address DOM nodes with XPath; otherwise, with CSS/jQuery-style selectors. Extend their syntax with an optional .attr() tail: e.g., h2 a.attr('href') or h2 a.attr (href) will both effectively give //h2//a/@href.

In both cases, parse HTML with DOMDocument and DOMXPath standard classes.

CSS mode requires a new dependency: symfony/css-selector (already required by Semantic MediaWiki).

Below are two links to a site where the proposed features have been implemented:

Event Timeline

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptJun 9 2020, 2:12 PM
alex-mashin updated the task description. (Show Details)Jun 9 2020, 2:13 PM
alex-mashin updated the task description. (Show Details)
alex-mashin updated the task description. (Show Details)Jun 9 2020, 2:15 PM
alex-mashin updated the task description. (Show Details)Jun 9 2020, 2:30 PM

That's interesting. What's a use case for parsing HTML with External Data?

alex-mashin added a comment.EditedJun 9 2020, 2:56 PM

That's interesting. What's a use case for parsing HTML with External Data?

Page scraping, when there's no API. Regular expressions not always can parse HTML.

See two examples at the end of the task description: in both, a list of articles written by a certain author is extracted from a publishing site.

Okay, that makes sense. And this is very cool - I like it. (Though of course, web scraping should only be done if there are no other options!) I'm looking forward to the patch.

Change 604057 had a related patch set uploaded (by Alex Mashin; owner: mashin):
[mediawiki/extensions/ExternalData@master] Add html mode to {{#get_web_data:}}

https://gerrit.wikimedia.org/r/604057

Change 604057 merged by jenkins-bot:
[mediawiki/extensions/ExternalData@master] Add html mode to {{#get_web_data:}}

https://gerrit.wikimedia.org/r/604057

alex-mashin closed this task as Resolved.Jun 11 2020, 10:48 PM