Allow HTML parsing with DOMDocument
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	alex-mashin
	Jun 9 2020, 2:12 PM

Description

Add html mode to {{#get_web_data:}}.

If used with | use xpath, address DOM nodes with XPath; otherwise, with CSS/jQuery-style selectors. Extend their syntax with an optional .attr() tail: e.g., h2 a.attr('href') or h2 a.attr (href) will both effectively give //h2//a/@href.

In both cases, parse HTML with DOMDocument and DOMXPath standard classes.

CSS mode requires a new dependency: symfony/css-selector (already required by Semantic MediaWiki).

Below are two links to a site where the proposed features have been implemented:

Details

	Subject	Repo	Branch	Lines +/-
	Add html mode to {{#get_web_data:}}	mediawiki/extensions/ExternalData	master	+188 -177

Customize query in gerrit

Event Timeline

alex-mashin created this task.Jun 9 2020, 2:12 PM

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptJun 9 2020, 2:12 PM

alex-mashin updated the task description. (Show Details)Jun 9 2020, 2:13 PM

alex-mashin updated the task description. (Show Details)

alex-mashin updated the task description. (Show Details)Jun 9 2020, 2:15 PM

alex-mashin updated the task description. (Show Details)Jun 9 2020, 2:30 PM

That's interesting. What's a use case for parsing HTML with External Data?

In T254887#6206246, @Yaron_Koren wrote:

That's interesting. What's a use case for parsing HTML with External Data?

Page scraping, when there's no API. Regular expressions not always can parse HTML.

See two examples at the end of the task description: in both, a list of articles written by a certain author is extracted from a publishing site.

Okay, that makes sense. And this is very cool - I like it. (Though of course, web scraping should only be done if there are no other options!) I'm looking forward to the patch.

https://gerrit.wikimedia.org/r/604057

gerritbot added a project: Patch-For-Review.Jun 9 2020, 3:51 PM

https://gerrit.wikimedia.org/r/604057

alex-mashin closed this task as Resolved.Jun 11 2020, 10:48 PM

Maintenance_bot removed a project: Patch-For-Review.Jun 11 2020, 11:10 PM

Allow HTML parsing with DOMDocumentClosed, ResolvedPublicActions

Description

Details

Event Timeline

Allow HTML parsing with DOMDocument
Closed, ResolvedPublic
Actions