Page MenuHomePhabricator

Should be possible to access references and notes for a given page via API
Closed, ResolvedPublic2 Story Points

Description

In mobile devices references and notes can account for 50% of the HTML of an article (see https://www.mediawiki.org/wiki/Reading/Web/Projects/A_frontend_powered_by_Parsoid/HTML_content_research#HTML_size_report), mobile intend to scrub references from the initial output and lazy load them (with suitable non-JS fallbacks).

Given an article's references are not needed straight away it should be possible to obtain them via an API separately from the rest of the content and render this functionality via JavaScript.

The references extension builds an intermediate representation (IR) while the page is being parsed. If the parser encounters a <references /> tag (or the parser finishes parsing the page and it didn't encounter a <references /> tag), then the IR is used to build the output HTML. However, the IR isn't stored anywhere.

Building an API to surface the IR would require additional storage. In the worse case scenario references account for around 50% of HTML but the IR is likely to be a lot smaller.

When evaluating serialisation methods, bear in mind that we'd prefer to avoid making the user agent doing any more work than it should do, i.e. MobileFrontend relies on querying the DOM for the note every time the user taps a reference, which could be eliminated if we were to deliver references as a map of reference ID to note.

Acceptance criteria:

  • References intermediate representation saved via setExtensionData
  • API endpoint for surfacing intermediate representation in Cite extension
  • No changes in MobileFrontend but someone needs to confirm that API result is compatible with what is happening in mobile.references ResourceLoader module in references.js getReference function

Related Objects

StatusAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
DeclinedNone
OpenNone
Resolveddr0ptp4kt
DuplicateJhernandez
Duplicatedr0ptp4kt
OpenNone
ResolvedJdlrobson
DeclinedNone
ResolvedJdlrobson
Resolvedphuedx
Resolvedphuedx
DeclinedNone
DeclinedCenarium
DeclinedNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Jdlrobson renamed this task from Should be possible to access references via API to Should be possible to access references for a given page via API.Jan 11 2016, 11:37 PM
Jdlrobson renamed this task from Should be possible to access references for a given page via API to Should be possible to access references and notes for a given page via API.
Jdlrobson updated the task description. (Show Details)
Jdlrobson updated the task description. (Show Details)Jan 12 2016, 11:51 PM

@tstarling we would appreciate your thoughts around this as we plan for this quarter - in particular where you would recommend storing a data structure for references, perhaps ParserOutput via setExtensionData ?

Change 264300 had a related patch set uploaded (by Phuedx):
Add script to dump the intermediate representation

https://gerrit.wikimedia.org/r/264300

Change 264300 abandoned by Phuedx:
Add script to dump the intermediate representation

Reason:
I only want to archive the tool for posterity. I do not wish it to be merged.

https://gerrit.wikimedia.org/r/264300

@phuedx @Jdlrobson Is this a spike? An epic?

I expected to see clearer AC since it is in TODO in the current sprint. Should we rename it? Add subtasks with spikes or more concrete tasks?

I used the tool I added in 264300 to determine the size of the serialised IR in KB for a small set of sample articles:

TitleSize (KB)
Nike, Inc.23.34
Star Wars: The Force Awakens117.27
Barack Obama147.79
Doctor Who57.56
Syrian Civil War236.86
Oakland, California43.46
Campus Honeymoon0.75
Brazil109.73

Storing the serialised IR for every page would require storing at least 2 bytes per page (serialize(null); returns "N;").

Sorry @Jhernandez, I added this to the sprint to remind myself to follow up on 264300.

Change 265241 had a related patch set uploaded (by Phuedx):
Toy benchmark for unserializing serialized IR

https://gerrit.wikimedia.org/r/265241

Change 265241 abandoned by Phuedx:
Toy benchmark for unserializing serialized IR

Reason:
I only want to archive the benchmark for posterity. I do not wish it to be merged.

https://gerrit.wikimedia.org/r/265241

Change 266249 had a related patch set uploaded (by Phuedx):
[WIP] Load references only when necessary

https://gerrit.wikimedia.org/r/266249

phuedx updated the task description. (Show Details)Jan 25 2016, 5:31 PM
GWicke added a comment.EditedJan 25 2016, 6:27 PM

While the discussion seems to be focused on exporting reference metadata, the description of this task also brings up the more general question of lazy-loading of notes and other page components like navboxes or infoboxes.

We are very interested in providing APIs for selective component retrieval in RESTBase, expanding on the existing section retrieval API. The main thing needed to make that happen is the identification of interesting content elements in Parsoid, possibly using templatedata to categorize templates across languages.

Parsoid HTML also contains very detailed and reliable structured information about references. It might be easier to use this information, rather than adding another custom code path in the Cite extension.

As I just noticed, page_props couldn't be used even for wikitext, since some pages have more than 64 KB of wikitext of references and pp_value is a BLOB. Or the type would have to be changed.

phuedx claimed this task.Jan 26 2016, 11:41 AM

By your talk of "IR" I suppose you want to deliver each reference separately? Why not bundle the whole references section and deliver it to the user in one chunk? Isn't it fair to assume that if the user wants to see one reference, they will want to see others shortly afterwards?

@tstarling: 266249 does exactly that. The entirety of $parser->extCite->mRefs is processed, cached, and only delivered to client when they tap a reference. The "processing" step generates a map of reference key to reference text, as that's all the MobileFrontend requires, i.e.

array(
  'Foo' => array(
    'text' => 'Bar baz',
    'count' => 0,
    'number' => 1
  ),
)

is converted to

array(
  'cite_note_Foo-0' => 'Bar baz'
)
Cenarium added a comment.EditedJan 27 2016, 7:47 AM

@phuedx
Couldn't the caching be done directly in the cite extension so that it can be used for other purposes, e.g. T124840 ?
(This would need an indefinite cache duration, renewal on reparse, and purging on move or deletion.)

@Cenarium: Yes. Absolutely. 266249 is just me experimenting with the idea, hence the WIP tag. The reason I chose to cache the processed references was that it made the mfreferences API incredibly simple to implement.

@Cenarium: … However, then we're treating a cache more like a store, which is where my "Where should we store this structure?" comes from.

@phuedx: Yes, caches shouldn't be considered reliable for long term storage, but in case of T124840 this isn't too bad if there's a cache miss. So if MobileFrontend doesn't need a long term storage medium, this shouldn't be an issue.

Change 267514 had a related patch set uploaded (by Cenarium):
Store parsed references in cache and page_props

https://gerrit.wikimedia.org/r/267514

Cenarium reassigned this task from Cenarium to phuedx.Jan 31 2016, 8:32 AM

Okay, I've done the internal logic for Cite, for which I've created a specific task:T125329. But the API still needs to be done, so I think of reassigning this task to phuedx. The API should call the getStoredReferences function of Cite.

Jdlrobson changed the task status from Open to Stalled.Feb 2 2016, 1:19 AM

Waiting on T125329

Restricted Application added a subscriber: Luke081515. · View Herald TranscriptFeb 8 2016, 10:34 AM
Jdlrobson changed the task status from Stalled to Open.Feb 16 2016, 5:42 PM
Jdlrobson edited a custom field.Feb 16 2016, 5:47 PM

Change 271422 had a related patch set uploaded (by Jdlrobson):
Surface references via api query property

https://gerrit.wikimedia.org/r/271422

Izno moved this task from Unsorted backlog to Doing on the Cite board.Feb 20 2016, 3:03 AM

Change 271422 merged by jenkins-bot:
Surface references via api query property

https://gerrit.wikimedia.org/r/271422

Jdlrobson updated the task description. (Show Details)Feb 27 2016, 12:11 AM
Fako85 added a subscriber: Fako85.Nov 16 2017, 2:57 PM

This no longer works. See: https://en.wikipedia.org/w/api.php?action=query&prop=references&titles=Albert%20Einstein
I'm wondering if this is a regression or that is was disabled on purpose. I would like to access the list of references from the API.

@Fako85 The error message is "Cite extension reference storage is not enabled."
To my knowledge this was never enabled in production. You may want to open a new task against Cite extension to investigate that.

In mobile web we've been using the following API to get at references:
https://en.m.wikipedia.org/w/api.php?action=mobileview&format=json&page=Barack%20Obama&sections=references&prop=text&revision=810395720

Thanks a lot @Jdlrobson. I'll try to parse from there what I need.