Page MenuHomePhabricator

Spike: Can we store a JSON blob of references data in ParserOutput
Closed, ResolvedPublic

Description

To lazy load references we need to keep references separate from the text of an article.
Although we could use REST APIs for this, this seems like it would be a useful feature of the Cite extension, apps and mobile web being obvious first consumers.

We thus need to store them somewhere, potentially the ParserOutput.

Outcome:

  • A yes/no answer to the question in the title "Can we store a JSON blob of references data in ParserOutput"
  • If no, a proposed alternative.

Duration: 4hrs

Related Objects

StatusAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
DeclinedNone
OpenNone
Resolveddr0ptp4kt
DuplicateJhernandez
Duplicatedr0ptp4kt
OpenNone
ResolvedJdlrobson
DeclinedNone
ResolvedJdlrobson
Resolvedphuedx
Resolvedphuedx

Event Timeline

Jdlrobson assigned this task to phuedx.
Jdlrobson raised the priority of this task from to Normal.
Jdlrobson updated the task description. (Show Details)

Yes. The successive mRefs arrays can be saved as extension data during parse. Then, the array can be json_encoded and compressed during LinksUpdate, stored in page_props, in two parts if the size is excessive, and saved in a cache.

That's how I plan on handling T124840, and I think it would work for MobileFrontend as well.

Anomie added a subscriber: Anomie.Jan 29 2016, 3:21 AM

I suppose that proposing serializing to JSON is to save a few bytes over just putting the data structure into ParserOutput's extension data and allowing it to be serialized along with the rest of the object?

Note that, if the data is only needed when you have a ParserOutput, extension data would seem likely to serve the purpose without the extra size limitations that using page_props introduces.

I suppose that proposing serializing to JSON is to save a few bytes over just putting the data structure into ParserOutput's extension data and allowing it to be serialized along with the rest of the object?

I'm not sure why JSON was chosen here – perhaps it was a simple miscommunication. I think that serialising the data structure once to then serialise it again (it'd be serialized when it's stored with ParserOutput#setExtensionData, right?) would be wasteful.

Yes. The successive mRefs arrays can be saved as extension data during parse. Then, the array can be json_encoded and compressed during LinksUpdate, stored in page_props, in two parts if the size is excessive, and saved in a cache.
That's how I plan on handling T124840, and I think it would work for MobileFrontend as well.

That'd be ideal. When are you planning on starting this work? Is this something that we could collaborate on or, if not, something that I can help out on in any way? I'd be more than happy to do the MobileFrontend -side of this.

Anomie set Security to None.

I suppose that proposing serializing to JSON is to save a few bytes over just putting the data structure into ParserOutput's extension data and allowing it to be serialized along with the rest of the object?

I'm not sure why JSON was chosen here – perhaps it was a simple miscommunication. I think that serialising the data structure once to then serialise it again (it'd be serialized when it's stored with ParserOutput#setExtensionData, right?) would be wasteful.

Well, serialize( json_encode( $foo ) ) does tend to be a bit smaller than serialize( $foo ), because PHP's format is more verbose except when it comes to long strings containing lots of double-quotes and backslashes. That might have been a concern that outweighs the added complexity of having to encode and decode the JSON blob.

That'd be ideal. When are you planning on starting this work? Is this something that we could collaborate on or, if not, something that I can help out on in any way? I'd be more than happy to do the MobileFrontend -side of this.

I'll probably do this this weekend. I think I can do the Cite extension side of this, of course I'd like your feedback on it.
I'll do a function to return the refs that first tries the cache then the db if there's a cache miss. Is it okay if it returns the data in array form ?
For the cache, there's no need to json_encode it, and as you mentioned above if it's no needed we should avoid it. (The bd needs it but it would be handled internally by Cite.)

Thanks for everyone's answers!
@Cenarium I'll assign you to T123290 and we'll help as best we can next week!

Jdlrobson closed this task as Resolved.Jan 29 2016, 9:12 PM

The answer seems to be YES.