Page MenuHomePhabricator

References stored in page props are not parsed
Closed, DeclinedPublic

Description

References are often created via templates e.g.

{{cite web|last1=Hicklin|first1=Aaron|title=The Gospel According to Benedict|url=http://www.out.com/entertainment/movies/2014/10/14/sherlock-star-benedict-cumberbatch-poised-make-alan-turing-his-own-imitation-game|website=Out Magazine|date=14 October 2014|accessdate=24 April 2015}}

Currently the references returned via Cite::getStoredReferences return the raw wikitext for references.
This reduces the usefulness in serving them via the API for rendering purposes

Expected:
These should return parsed wikitext when an html option is passed.

Open questions:

  • How can this be done in a performant way?

Demonstration:
Before:

Screen Shot 2016-02-18 at 10.19.29 AM.png (545×404 px, 72 KB)

After:
Screen Shot 2016-02-18 at 10.19.52 AM.png (543×400 px, 86 KB)

Related Objects

StatusSubtypeAssignedTask
OpenReleaseNone
OpenNone
OpenNone
OpenNone
OpenFeatureNone
OpenNone
Resolveddr0ptp4kt
Duplicate Jhernandez
Duplicatedr0ptp4kt
DeclinedNone
ResolvedJdlrobson
DeclinedNone
ResolvedJdlrobson
Resolvedphuedx
Resolvedphuedx
DeclinedNone
ResolvedNone
OpenNone
DeclinedCenarium
DeclinedNone

Event Timeline

Jdlrobson raised the priority of this task from to Medium.
Jdlrobson updated the task description. (Show Details)
Jdlrobson added a project: Cite.
Jdlrobson added subscribers: Jdlrobson, Luke081515, MGChecker and 9 others.

Parsed, refs would take much more storage space and I think it should be avoided.

I presume that MobileFrontend will hook into the Cite extension to abort the return of <references> tags and instead return some placeholder. This will require to split the parser cache (mobile vs non-mobile), or maybe MF now has its own parser cache ? (from T124356 it didn't appear to have one)

The parse of references could be made at the special page used to display references to non-JS users, which would have its own cache, and the JS version would feed parsed references from there.

The primary goal is not a special page but a JavaScript based rendering of references (in existing mobile web site we show references inside a panel at bottom of page).. So we need to work out some way/place to parse this.

Change 271678 had a related patch set uploaded (by Jdlrobson):
Surface parsed references in API response

https://gerrit.wikimedia.org/r/271678

The above suggests doing this on parse but I'm sure there's a better way. Any thoughts most welcome! :)

I'm not sure that all API callers would like the references in parsed form. You could call the parse API on the raw refs.

@Ceranium the parsing is definitely something we can do in the short term but it would mean every click on a reference is an API lookup and the first reference would be two. Definitely a short term solution however.

We could add an API option to add an html field to the reference value. I have added a few reviewers and used separate patchsets so that we can keep our options open!

This reduces the usefulness in serving them via the API

For your use case. There are other use cases where wikitext is needed and parsed HTML would be absolutely useless, such as AnomieBOT's OrphanReferenceFixer.

This is why options are generally good.

Yup so an API option to

This reduces the usefulness in serving them via the API

For your use case. There are other use cases where wikitext is needed and parsed HTML would be absolutely useless, such as AnomieBOT's OrphanReferenceFixer.

This is why options are generally good.

See https://phabricator.wikimedia.org/T127263#2042753
So yeh options are fine. I'm still not sure how we go about doing that in a performant way though.

For performance, you'd probably want to define a maximum like "100 refs parsed". Then if at least one ref has been parsed already and the total number of refs already parsed plus the number of refs in the current page is > 100, you stop there and return continuation. It's not perfect since someone could throw 10000 refs on a page, but it'll probably do.

Or you could get more complicated with the continuation and stop in the middle of outputting a page as soon as the 100 is hit, if you wanted to.

@Ceranium the parsing is definitely something we can do in the short term but it would mean every click on a reference is an API lookup and the first reference would be two. Definitely a short term solution however.

As I understand it, references would no longer be displayed at the bottom of the page in mobile view but individually fetched for each click on a ref.
Then using the API as implemented in https://gerrit.wikimedia.org/r/#/c/271422/, parsed or not, supplemented by the parse API in the later case, would still require all references (unparsed in the later case) to be sent in a single request, yet they can take a lot of KBs. If the intent is to make it easier to access WP for slow connections, then it might be worth building an API that retrieves and parses a single reference from a page (identified by its key) for this purpose. Since the list of references is cached, it shouldn't be an issue for the server.
(The API from https://gerrit.wikimedia.org/r/#/c/271422/ would still have other use cases.)

Change 278703 had a related patch set uploaded (by Cenarium):
[WIP] Use custom self-regenerating parser cache for references

https://gerrit.wikimedia.org/r/278703

Change 271678 abandoned by Jdlrobson:
Surface parsed references in API response

https://gerrit.wikimedia.org/r/271678

This feature will be removed (see T222373)

Change 278703 abandoned by Thiemo Kreuz (WMDE):
Add API module to retrieve parsed references

Reason:
There is a REST API for this: https://en.wikipedia.org/api/rest_v1/#/Page content/getContent-references

https://gerrit.wikimedia.org/r/278703