gzipped page props
Open, LowPublic

Description

The page props table allows us to store information associated with an article. It is limited in size

The Graph and Cite extension store gzipped content in references-1 and graph_specs fields that must be decoded via gzdecode before being useful.

As a side effect of storing these in page props they are accessible via the API.
Surfacing these in the API is a little confusing and arguably not useful.

For example:
http://localhost:8888/w/index.php/Special:ApiSandbox?useformat=desktop#action=query&format=json&prop=pageprops&titles=Selenium_References_test_page&ppprop=references-1
on my local instance gives
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdV*JM+V\ufffd\ufffd\ufffdV\ufffd\ufffd\ufffd\ufffdy%JV\ufffd\ufffd:J%\ufffd\ufffd@\ufffdRHjq\ufffd\ufffdPQjQj^r\ufffd\ufffd\ufffdRvj\ufffd\ufffd\ufffdamlm\ufffd\ufffdRYjQqf~\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd,gJ\ufffd\ufffd\ufffd"

This doesn't seem ideal. A few ideas

  • Allowing page props to have hints to explain how the client can make use of them
  • Allowing handlers in the API that are run before returning them so that the API response returns the page property as JSON for example
  • Allow certain page props to be hidden so they cannot be surfaced via the API
Jdlrobson updated the task description. (Show Details)
Jdlrobson raised the priority of this task from to Needs Triage.
Jdlrobson added projects: Graphs, Cite, MediaWiki-API.
Jdlrobson added a subscriber: Jdlrobson.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 17 2016, 8:05 PM

This would be cool. The issue affects TemplateData too.

It would be nice if it worked with Special:PagesWithProp too, right now it just doesn't display binary or large values (see SpecialPagesWithProp::formatResult).

Restricted Application added a subscriber: Luke081515. · View Herald TranscriptFeb 17 2016, 8:11 PM

Fortunately we now have a "PageProps" class that manages access to the database table, thanks to @cicalese. The logic for handling gzipped pp_value fields (and maybe also values that are so huge that they span multiple rows) should be added there.

Although I note that the PageProps class doesn't cover SpecialPagesWithProp and ApiQueryPagesWithProp yet.

Yurik added a subscriber: Yurik.EditedFeb 17 2016, 9:23 PM

Graph ext uses its own api call to get pp_value unzipped. Also note that for format=json, it returns the value directly in the body of the response, not as a string. And I am sure @Jdlrobson and others would appreciate that instead of doing double-unserializing+double error check, JS clients can simply use the value directly. Please merge 258196 (T120380) to allow it for the core.

https://www.mediawiki.org/w/api.php?action=graph&formatversion=2&title=Extension%3AGraph%2FDemo&hash=af18beb8c055ab584a5cb3b59bf86a510914bbc2

Hmm, interesting, but I think we can do this one step at a time. I imagine sending JSON-as-string (with all the escaped ") is still going to be better for performance than sending gzipped-JSON-as-string (with all the escaped binary data), assuming that the regular HTTP gzip compression kicks in as usual.

Yurik added a comment.Feb 17 2016, 9:37 PM

@matmarex in graph case, it is actually slower, because after being read from DB and unzipped, data has to be parsed into JSON, the specific graph's data retrieved (there could be multiple graphs per page, while user may only need one), and attached directly to the result's output.
If it was sent as a string, it would have had to be re-json-encoded, so it would be more cpu on the server, bigger payload, and more cpu on the client to double-decode it.

Let's leave aside the completely unrelated (and -2ed) T120380 here. It's not at all on-topic for this task, which is about the fact that trying to return binary data just doesn't work at all since Unicode normalization completely mangles it.

@Anomie, I disagree that it is unrelated. My understanding is that most usecases for binary pageprops is actually to store large json, and having a standard mechanism to get these pageprops would be good. Also, I think we should introduce a pageprops2 that does not limit it to 64KB, thus solving the related problem.

Yurik moved this task from Backlog to Tracking on the Graphs board.Feb 22 2016, 4:56 AM
Jdforrester-WMF triaged this task as Low priority.