Page MenuHomePhabricator

Expose basic article identifiers (title, variant, revision...) to HTML scrapers
Open, Needs TriagePublic

Description

There doesn't seem any easy way for a tool that takes an URL and processes the wiki article at that URL (spider, browser extension etc.) to identify the article sufficiently to interact with the API - the title and variant is contained in the URL but it might be in the path or the query, the path format might depend on wiki configuration, the URL might be in non-canonical encoding etc. MediaWiki scripts rely on page variables instead, but those use mw.config so they are not in a parsable format. The most important variables identifying the content (title, variant, revision, maybe page ID) should be embedded in the HTML in a machine-readable format.

Event Timeline

Tgr created this task.Apr 25 2018, 10:33 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2018, 10:33 AM
Tgr added a comment.Apr 25 2018, 10:35 AM

(The use case where this came up is a browser extension for sending the current article to action=readinglists&command=createentry.)

MediaWiki scripts rely on page variables instead, but those use mw.config so they are not in a parsable format.

Why not? /"wgPageName"\s*:\s*"([^"]+)"/, /"wgRelevantArticleId"\s*:\s*(\d+)/, or something similar depending on what you actually want, will extract the relevant data from the HTML.

Tgr added a comment.Apr 26 2018, 9:22 AM

The advice about parsing HTML with regex probably applies here.

We could just add the variables as a bunch of meta keywords, or something like JSON-LD, to get well-defined, machine-readable syntax.

Vvjjkkii renamed this task from Expose basic article identifiers (title, variant, revision...) to HTML scrapers to 79daaaaaaa.Jul 1 2018, 1:14 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Community_Tech_bot renamed this task from 79daaaaaaa to Expose basic article identifiers (title, variant, revision...) to HTML scrapers.
Community_Tech_bot added a subscriber: Aklapper.
CommunityTechBot raised the priority of this task from High to Needs Triage.Jul 3 2018, 1:40 AM