Should be possible to access references and notes for a given page via API
Closed, ResolvedPublic2 Estimated Story Points
Actions

Description

In mobile devices references and notes can account for 50% of the HTML of an article (see https://www.mediawiki.org/wiki/Reading/Web/Projects/A_frontend_powered_by_Parsoid/HTML_content_research#HTML_size_report), mobile intend to scrub references from the initial output and lazy load them (with suitable non-JS fallbacks).

Given an article's references are not needed straight away it should be possible to obtain them via an API separately from the rest of the content and render this functionality via JavaScript.

The references extension builds an intermediate representation (IR) while the page is being parsed. If the parser encounters a <references /> tag (or the parser finishes parsing the page and it didn't encounter a <references /> tag), then the IR is used to build the output HTML. However, the IR isn't stored anywhere.

Building an API to surface the IR would require additional storage. In the worse case scenario references account for around 50% of HTML but the IR is likely to be a lot smaller.

When evaluating serialisation methods, bear in mind that we'd prefer to avoid making the user agent doing any more work than it should do, i.e. MobileFrontend relies on querying the DOM for the note every time the user taps a reference, which could be eliminated if we were to deliver references as a map of reference ID to note.

Acceptance criteria:

References intermediate representation saved via setExtensionData
API endpoint for surfacing intermediate representation in Cite extension
No changes in MobileFrontend but someone needs to confirm that API result is compatible with what is happening in mobile.references ResourceLoader module in references.js getReference function

Details

Subject	Repo	Branch	Lines +/-
Surface references via api query property	mediawiki/extensions/Cite	master	+107 -0
Toy benchmark for unserializing serialized IR	mediawiki/extensions/Cite	master	+234 -0
Add script to dump the intermediate representation	mediawiki/extensions/Cite	master	+56 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Release	None	T84936 Release VisualEditor-MediaWiki as "1.0"
Open		None	T50429 [Epic] Support editing parts of a page in VisualEditor-MediaWiki
Open		None	T54365 Explore performance gains from progressive (JIT?) de-alienation in VisualEditor
Open		None	T174303 Copy-pasting linked ISBN numbers from view mode HTML into VisualEditor inserts wikitext links to Special:BookSources (it should turn them into magic links?)
Open	Feature	None	T54091 The read HTML should have hinting to allow full DOM copying (as opposed to just rich copying) from read mode into VE surfaces
Open		None	T55784 [EPIC] Use Parsoid HTML for all page views
Resolved		dr0ptp4kt	T114542 Next Generation Content Loading and Routing, in Practice
Duplicate		• Jhernandez	T104432 [EPIC]: Improve mobile site performance
Duplicate		dr0ptp4kt	T120341 [GOAL] Make Wikipedia more accessible to all connections with new fast API-driven web experience in mobile web beta
Declined		None	T125920 [EPIC] Future exciting reading web performance endeavours
Resolved		Jdlrobson	T113066 [GOAL] Make Wikipedia more accessible to 2G connections
Declined		None	T123328 [GOAL] Lazy load references in mobile skin
Resolved		Jdlrobson	T125896 Feature flagged lazy loaded references
Resolved		phuedx	T123290 Should be possible to access references and notes for a given page via API
Resolved		phuedx	T125134 Spike: Can we store a JSON blob of references data in ParserOutput
Declined		None	T126802 What is impact of storing references
Declined		Cenarium	T125329 Save references in page_props and cache
Declined		None	T127263 References stored in page props are not parsed

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Jdlrobson renamed this task from Should be possible to access references for a given page via API to Should be possible to access references and notes for a given page via API.Jan 11 2016, 11:37 PM

Jdlrobson updated the task description. (Show Details)

Jdlrobson removed a parent task: T113066: [GOAL] Make Wikipedia more accessible to 2G connections.Jan 12 2016, 12:21 AM

Jdlrobson added a subtask: T123328: [GOAL] Lazy load references in mobile skin.

@tstarling we would appreciate your thoughts around this as we plan for this quarter - in particular where you would recommend storing a data structure for references, perhaps ParserOutput via setExtensionData ?

Jdforrester-WMF subscribed.Jan 12 2016, 11:56 PM

Anomie moved this task from Unsorted to Non-core-API stuff on the MediaWiki-Action-API board.Jan 13 2016, 2:56 PM

Change 264300 had a related patch set uploaded (by Phuedx):
Add script to dump the intermediate representation

https://gerrit.wikimedia.org/r/264300

Change 264300 abandoned by Phuedx:
Add script to dump the intermediate representation

Reason:
I only want to archive the tool for posterity. I do not wish it to be merged.

https://gerrit.wikimedia.org/r/264300

phuedx added a project: Reading-Web-Sprint-64-Five Sprints at Freddy's.Jan 18 2016, 5:22 PM

phuedx moved this task from Needs Analysis to To Do on the Reading-Web-Sprint-64-Five Sprints at Freddy's board.Jan 18 2016, 5:40 PM

@phuedx @Jdlrobson Is this a spike? An epic?

I expected to see clearer AC since it is in TODO in the current sprint. Should we rename it? Add subtasks with spikes or more concrete tasks?

phuedx moved this task from To Do to Needs Analysis on the Reading-Web-Sprint-64-Five Sprints at Freddy's board.Jan 19 2016, 10:22 AM

I used the tool I added in 264300 to determine the size of the serialised IR in KB for a small set of sample articles:

Title	Size (KB)
Nike, Inc.	23.34
Star Wars: The Force Awakens	117.27
Barack Obama	147.79
Doctor Who	57.56
Syrian Civil War	236.86
Oakland, California	43.46
Campus Honeymoon	0.75
Brazil	109.73

Storing the serialised IR for every page would require storing at least 2 bytes per page (serialize(null); returns "N;").

Sorry @Jhernandez, I added this to the sprint to remind myself to follow up on 264300.

Change 265241 had a related patch set uploaded (by Phuedx):
Toy benchmark for unserializing serialized IR

https://gerrit.wikimedia.org/r/265241

Change 265241 abandoned by Phuedx:
Toy benchmark for unserializing serialized IR

Reason:
I only want to archive the benchmark for posterity. I do not wish it to be merged.

https://gerrit.wikimedia.org/r/265241

phuedx moved this task from Needs Analysis to To Do on the Reading-Web-Sprint-64-Five Sprints at Freddy's board.Jan 20 2016, 6:13 PM

phuedx moved this task from To Do to Needs Analysis on the Reading-Web-Sprint-64-Five Sprints at Freddy's board.

Cenarium subscribed.Jan 25 2016, 8:34 AM

Change 266249 had a related patch set uploaded (by Phuedx):
[WIP] Load references only when necessary

https://gerrit.wikimedia.org/r/266249

Jdlrobson edited projects, added Reading-Web-Sprint-65-Game of Phones; removed Reading-Web-Sprint-64-Five Sprints at Freddy's.Jan 25 2016, 5:12 PM

phuedx updated the task description. (Show Details)Jan 25 2016, 5:31 PM

While the discussion seems to be focused on exporting reference metadata, the description of this task also brings up the more general question of lazy-loading of notes and other page components like navboxes or infoboxes.

We are very interested in providing APIs for selective component retrieval in RESTBase, expanding on the existing section retrieval API. The main thing needed to make that happen is the identification of interesting content elements in Parsoid, possibly using templatedata to categorize templates across languages.

Parsoid HTML also contains very detailed and reliable structured information about references. It might be easier to use this information, rather than adding another custom code path in the Cite extension.

Cenarium mentioned this in T7984: Edit preview doesn't let you preview cite.php footnotes..Jan 25 2016, 6:38 PM

As I just noticed, page_props couldn't be used even for wikitext, since some pages have more than 64 KB of wikitext of references and pp_value is a BLOB. Or the type would have to be changed.

phuedx removed a project: Reading-Web-Sprint-65-Game of Phones.Jan 26 2016, 11:16 AM

phuedx claimed this task.Jan 26 2016, 11:41 AM

By your talk of "IR" I suppose you want to deliver each reference separately? Why not bundle the whole references section and deliver it to the user in one chunk? Isn't it fair to assume that if the user wants to see one reference, they will want to see others shortly afterwards?

@tstarling: 266249 does exactly that. The entirety of $parser->extCite->mRefs is processed, cached, and only delivered to client when they tap a reference. The "processing" step generates a map of reference key to reference text, as that's all the MobileFrontend requires, i.e.

array(
  'Foo' => array(
    'text' => 'Bar baz',
    'count' => 0,
    'number' => 1
  ),
)

is converted to

array(
  'cite_note_Foo-0' => 'Bar baz'
)

@phuedx
Couldn't the caching be done directly in the cite extension so that it can be used for other purposes, e.g. T124840 ?
(This would need an indefinite cache duration, renewal on reparse, and purging on move or deletion.)

@Cenarium: Yes. Absolutely. 266249 is just me experimenting with the idea, hence the WIP tag. The reason I chose to cache the processed references was that it made the mfreferences API incredibly simple to implement.

@Cenarium: … However, then we're treating a cache more like a store, which is where my "Where should we store this structure?" comes from.

phuedx mentioned this in T123328: [GOAL] Lazy load references in mobile skin.Jan 27 2016, 11:31 AM

Jdlrobson removed a subtask: T123328: [GOAL] Lazy load references in mobile skin.Jan 27 2016, 4:36 PM

Jdlrobson added a parent task: T123328: [GOAL] Lazy load references in mobile skin.

@phuedx: Yes, caches shouldn't be considered reliable for long term storage, but in case of T124840 this isn't too bad if there's a cache miss. So if MobileFrontend doesn't need a long term storage medium, this shouldn't be an issue.

• Tbayer unsubscribed.Jan 29 2016, 12:39 AM

Jdlrobson reassigned this task from phuedx to Cenarium.Jan 29 2016, 9:11 PM

Jdlrobson updated the task description. (Show Details)

Jdlrobson mentioned this in T125134: Spike: Can we store a JSON blob of references data in ParserOutput.

Jdlrobson closed subtask T125134: Spike: Can we store a JSON blob of references data in ParserOutput as Resolved.

Jdlrobson edited projects, added Reading-Web-Sprint-65-Game of Phones; removed Patch-For-Review.

Change 267514 had a related patch set uploaded (by Cenarium):
Store parsed references in cache and page_props

https://gerrit.wikimedia.org/r/267514

gerritbot added a project: Patch-For-Review.Jan 30 2016, 10:50 PM

Cenarium added a parent task: T124840: Section edit preview doesn't let you preview references defined outside the section being previewed.Jan 30 2016, 11:08 PM

Cenarium removed a parent task: T124840: Section edit preview doesn't let you preview references defined outside the section being previewed.Jan 31 2016, 8:14 AM

Cenarium mentioned this in T125329: Save references in page_props and cache.Jan 31 2016, 8:19 AM

Cenarium added a subtask: T125329: Save references in page_props and cache.Jan 31 2016, 8:22 AM

Okay, I've done the internal logic for Cite, for which I've created a specific task:T125329. But the API still needs to be done, so I think of reassigning this task to phuedx. The API should call the getStoredReferences function of Cite.

phuedx moved this task from Needs Analysis to Code Review on the Reading-Web-Sprint-65-Game of Phones board.Feb 1 2016, 10:15 AM

Tobi_WMDE_SW subscribed.Feb 1 2016, 2:29 PM

phuedx removed a project: Reading-Web-Sprint-65-Game of Phones.Feb 1 2016, 5:48 PM

Jdlrobson added a project: Reading-Web-Sprint-65-Game of Phones.Feb 1 2016, 5:50 PM

Jdlrobson edited a custom field.

phuedx removed a project: Reading-Web-Sprint-65-Game of Phones.Feb 1 2016, 5:54 PM

Waiting on T125329

Jdlrobson edited projects, added Reading-Web-Sprint-66-Harry is Tired; removed Tracking-Neverending.Feb 4 2016, 9:32 PM

Jdlrobson edited a custom field.

MGChecker subscribed.Feb 8 2016, 10:34 AM

Restricted Application added a subscriber: Luke081515. · View Herald TranscriptFeb 8 2016, 10:34 AM

phuedx mentioned this in T125896: Feature flagged lazy loaded references.Feb 8 2016, 5:47 PM

Jdlrobson changed the task status from Stalled to Open.Feb 16 2016, 5:42 PM

Jdlrobson edited a custom field.Feb 16 2016, 5:47 PM

Change 271422 had a related patch set uploaded (by Jdlrobson):
Surface references via api query property

https://gerrit.wikimedia.org/r/271422

Jdlrobson added a parent task: T125896: Feature flagged lazy loaded references.Feb 18 2016, 12:52 AM

Jdlrobson moved this task from Needs Analysis to Code Review on the Reading-Web-Sprint-66-Harry is Tired board.Feb 18 2016, 10:32 PM

Izno moved this task from Unsorted backlog to Doing on the Cite board.Feb 20 2016, 3:03 AM

Jdlrobson moved this task from Code Review to -1 (Needs More Work) on the Reading-Web-Sprint-66-Harry is Tired board.Feb 22 2016, 9:42 PM

Jdlrobson moved this task from -1 (Needs More Work) to Code Review on the Reading-Web-Sprint-66-Harry is Tired board.Feb 24 2016, 7:32 PM

Change 271422 merged by jenkins-bot:
Surface references via api query property

https://gerrit.wikimedia.org/r/271422

Jdlrobson moved this task from Code Review to Ready for Signoff on the Reading-Web-Sprint-66-Harry is Tired board.Feb 26 2016, 7:28 PM

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2016-03-01_(1.27.0-wmf.15)).Feb 26 2016, 8:00 PM

Jdlrobson updated the task description. (Show Details)Feb 27 2016, 12:11 AM

This is now done but feature flagged.
Verified here:
http://reading-web-staging.wmflabs.org/w/index.php?title=Special:ApiSandbox&mobileaction=toggle_view_desktop#action=query&format=json&prop=references&titles=Doctor_Who

Jdlrobson moved this task from Ready for Signoff to Done on the Reading-Web-Sprint-66-Harry is Tired board.Feb 27 2016, 1:18 AM

• Niedzielski subscribed.Mar 7 2016, 10:43 PM

This no longer works. See: https://en.wikipedia.org/w/api.php?action=query&prop=references&titles=Albert%20Einstein
I'm wondering if this is a regression or that is was disabled on purpose. I would like to access the list of references from the API.

@Fako85 The error message is "Cite extension reference storage is not enabled."
To my knowledge this was never enabled in production. You may want to open a new task against Cite extension to investigate that.

In mobile web we've been using the following API to get at references:
https://en.m.wikipedia.org/w/api.php?action=mobileview&format=json&page=Barack%20Obama&sections=references&prop=text&revision=810395720

Thanks a lot @Jdlrobson. I'll try to parse from there what I need.

Jdlrobson closed subtask T126802: What is impact of storing references as Declined.Jul 17 2019, 4:14 PM

Jdlrobson closed subtask T125329: Save references in page_props and cache as Declined.

Anomie mentioned this in T238195: Check ApiQueryReferences compatibility with extended references.Nov 13 2019, 2:44 PM