References stored in page props are not parsed
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Jdlrobson
	Feb 18 2016, 12:51 AM

Description

References are often created via templates e.g.

{{cite web|last1=Hicklin|first1=Aaron|title=The Gospel According to Benedict|url=http://www.out.com/entertainment/movies/2014/10/14/sherlock-star-benedict-cumberbatch-poised-make-alan-turing-his-own-imitation-game|website=Out Magazine|date=14 October 2014|accessdate=24 April 2015}}

Currently the references returned via Cite::getStoredReferences return the raw wikitext for references.
This reduces the usefulness in serving them via the API for rendering purposes

Expected:
These should return parsed wikitext when an html option is passed.

Open questions:

How can this be done in a performant way?

Demonstration:
Before:

Screen Shot 2016-02-18 at 10.19.29 AM.png (545×404 px, 72 KB)

After:

Screen Shot 2016-02-18 at 10.19.52 AM.png (543×400 px, 86 KB)

Details

	Subject	Repo	Branch	Lines +/-
	Add API module to retrieve parsed references	mediawiki/extensions/Cite	master	+235 -2
	Surface parsed references in API response	mediawiki/extensions/Cite	master	+4 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Release	None	T84936 Release VisualEditor-MediaWiki as "1.0"
Open		None	T50429 [Epic] Support editing parts of a page in VisualEditor-MediaWiki
Open		None	T54365 Explore performance gains from progressive (JIT?) de-alienation in VisualEditor
Open		None	T174303 Copy-pasting linked ISBN numbers from view mode HTML into VisualEditor inserts wikitext links to Special:BookSources (it should turn them into magic links?)
Open	Feature	None	T54091 The read HTML should have hinting to allow full DOM copying (as opposed to just rich copying) from read mode into VE surfaces
Open		None	T55784 [EPIC] Use Parsoid HTML for all page views
Resolved		dr0ptp4kt	T114542 Next Generation Content Loading and Routing, in Practice
Duplicate		• Jhernandez	T104432 [EPIC]: Improve mobile site performance
Duplicate		dr0ptp4kt	T120341 [GOAL] Make Wikipedia more accessible to all connections with new fast API-driven web experience in mobile web beta
Declined		None	T125920 [EPIC] Future exciting reading web performance endeavours
Resolved		Jdlrobson	T113066 [GOAL] Make Wikipedia more accessible to 2G connections
Declined		None	T123328 [GOAL] Lazy load references in mobile skin
Resolved		Jdlrobson	T125896 Feature flagged lazy loaded references
Resolved		phuedx	T123290 Should be possible to access references and notes for a given page via API
Resolved		phuedx	T125134 Spike: Can we store a JSON blob of references data in ParserOutput
Declined		None	T126802 What is impact of storing references
Resolved		None	T101091 References in the section preview
Open		None	T124840 Section edit preview doesn't let you preview references defined outside the section being previewed
Declined		Cenarium	T125329 Save references in page_props and cache
Declined		None	T127263 References stored in page props are not parsed

Event Timeline

Jdlrobson created this task.Feb 18 2016, 12:51 AM

Jdlrobson raised the priority of this task from to Medium.

Jdlrobson updated the task description. (Show Details)

Jdlrobson added a project: Cite.

Jdlrobson added a parent task: T125896: Feature flagged lazy loaded references.

Jdlrobson added subscribers: Jdlrobson, Luke081515, MGChecker and 9 others.

Jdlrobson mentioned this in T125896: Feature flagged lazy loaded references.Feb 18 2016, 12:53 AM

Parsed, refs would take much more storage space and I think it should be avoided.

I presume that MobileFrontend will hook into the Cite extension to abort the return of <references> tags and instead return some placeholder. This will require to split the parser cache (mobile vs non-mobile), or maybe MF now has its own parser cache ? (from T124356 it didn't appear to have one)

The parse of references could be made at the special page used to display references to non-JS users, which would have its own cache, and the JS version would feed parsed references from there.

The primary goal is not a special page but a JavaScript based rendering of references (in existing mobile web site we show references inside a panel at bottom of page).. So we need to work out some way/place to parse this.

Demonstration:
Before:

After:

Jdlrobson added a parent task: T125897: Create Special:Citations fallback for non-JavaScript/Resourceloader unsupported users.Feb 18 2016, 6:21 PM

• jhobs updated the task description. (Show Details)Feb 18 2016, 6:40 PM

Change 271678 had a related patch set uploaded (by Jdlrobson):
Surface parsed references in API response

https://gerrit.wikimedia.org/r/271678

gerritbot added a project: Patch-For-Review.Feb 18 2016, 9:42 PM

The above suggests doing this on parse but I'm sure there's a better way. Any thoughts most welcome! :)

I'm not sure that all API callers would like the references in parsed form. You could call the parse API on the raw refs.

@Ceranium the parsing is definitely something we can do in the short term but it would mean every click on a reference is an API lookup and the first reference would be two. Definitely a short term solution however.

We could add an API option to add an html field to the reference value. I have added a few reviewers and used separate patchsets so that we can keep our options open!

This reduces the usefulness in serving them via the API

For your use case. There are other use cases where wikitext is needed and parsed HTML would be absolutely useless, such as AnomieBOT's OrphanReferenceFixer.

This is why options are generally good.

Yup so an API option to

In T127263#2044692, @Anomie wrote:

This reduces the usefulness in serving them via the API

For your use case. There are other use cases where wikitext is needed and parsed HTML would be absolutely useless, such as AnomieBOT's OrphanReferenceFixer.

This is why options are generally good.

See https://phabricator.wikimedia.org/T127263#2042753
So yeh options are fine. I'm still not sure how we go about doing that in a performant way though.

Jdlrobson updated the task description. (Show Details)Feb 19 2016, 8:25 PM

For performance, you'd probably want to define a maximum like "100 refs parsed". Then if at least one ref has been parsed already and the total number of refs already parsed plus the number of refs in the current page is > 100, you stop there and return continuation. It's not perfect since someone could throw 10000 refs on a page, but it'll probably do.

Or you could get more complicated with the continuation and stop in the middle of outputting a page as soon as the 100 is hit, if you wanted to.

In T127263#2042753, @Jdlrobson wrote:

@Ceranium the parsing is definitely something we can do in the short term but it would mean every click on a reference is an API lookup and the first reference would be two. Definitely a short term solution however.

As I understand it, references would no longer be displayed at the bottom of the page in mobile view but individually fetched for each click on a ref.
Then using the API as implemented in https://gerrit.wikimedia.org/r/#/c/271422/, parsed or not, supplemented by the parse API in the later case, would still require all references (unparsed in the later case) to be sent in a single request, yet they can take a lot of KBs. If the intent is to make it easier to access WP for slow connections, then it might be worth building an API that retrieves and parses a single reference from a page (identified by its key) for this purpose. Since the list of references is cached, it shouldn't be an issue for the server.
(The API from https://gerrit.wikimedia.org/r/#/c/271422/ would still have other use cases.)

Krinkle mentioned this in T125329: Save references in page_props and cache.Mar 7 2016, 5:39 PM

Jdlrobson removed a parent task: T125897: Create Special:Citations fallback for non-JavaScript/Resourceloader unsupported users.Mar 11 2016, 8:07 PM

Change 278703 had a related patch set uploaded (by Cenarium):
[WIP] Use custom self-regenerating parser cache for references

https://gerrit.wikimedia.org/r/278703

Izno moved this task from Unsorted backlog to Doing on the Cite board.Apr 13 2016, 2:46 PM

Change 271678 abandoned by Jdlrobson:
Surface parsed references in API response

https://gerrit.wikimedia.org/r/271678

WMDE-Fisch added subscribers: daniel, WMDE-Fisch.Nov 22 2016, 12:20 PM

Izno moved this task from Doing to Defect backlog on the Cite board.Apr 9 2019, 5:36 PM

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptApr 9 2019, 5:36 PM

Izno moved this task from Defect backlog to Enhancement backlog on the Cite board.Apr 9 2019, 5:37 PM

This feature will be removed (see T222373)

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptJul 17 2019, 4:15 PM

Change 278703 abandoned by Thiemo Kreuz (WMDE):
Add API module to retrieve parsed references

Reason:
There is a REST API for this: https://en.wikipedia.org/api/rest_v1/#/Page content/getContent-references

https://gerrit.wikimedia.org/r/278703

Maintenance_bot removed a project: Patch-For-Review.Dec 2 2019, 9:10 AM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:28 PM

	F3371470: Screen Shot 2016-02-18 at 10.19.52 AM.png
	Feb 18 2016, 6:20 PM

	F3371472: Screen Shot 2016-02-18 at 10.19.29 AM.png
	Feb 18 2016, 6:20 PM

References stored in page props are not parsedClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

References stored in page props are not parsed
Closed, DeclinedPublic
Actions

Related Objects
Search...