Page MenuHomePhabricator

gzip-encoded page properties can't be exported from the API
Open, NormalPublic

Description

Event Timeline

cscott created this task.Nov 27 2018, 9:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 27 2018, 9:27 PM

As an example, PHP var_export says the raw value is:

'kartographer' => ' � ' . "\0" . '' . "\0" . '' . "\0" . '' . "\0" . '' . "\0" . '' . "\0" . ' E�� �0 E�e�HJKm� \\�7��1 "P2� ���e�nϹ��pW�h!�uZ��y�E�}� 3��ʣc�83!� ���}�GC �S �@ 3&�g�R�>̼���' . "\0" . '' . "\0" . '' . "\0" . '',

but the value exported by the API for this is:

"kartographer": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdE\ufffd\ufffd\ufffd\ufffd0\ufffdE\ufffde\ufffdHJKm\ufffd\ufffd\\\ufffd7\ufffd\ufffd1\ufffd\"P2\ufffd\ufffd\ufffd\ufffd\ufffde\ufffdn\u03f9\ufffd\ufffdpW\ufffdh!\ufffduZ\ufffd\ufffdy\ufffdE\ufffd}\ufffd\ufffd3\ufffd\ufffd\u02a3\u436e\ufffdU\ufffd\ufffdv\ufffd\u06c2\ufffd\ufffd\ufffdM|\ufffdB\ufffd\ufffd\ufffd\ufffdF\u06a0\ufffd\ufffdk\ufffd\u6615\ufffd\ufffd\ufffd0\ufffd\ufffdk\ufffd\ufffdR\ufffdJ\ufffd\ufffdY\ufffd\ufffdn\ufffd\ufffdR\ufffd\ufffd\ufffd\ufffd)\ufffd\ufffd`\ufffd\ufffd\ufffd\ufffd\ufffdn\ufffd\rc\ufffd83!\ufffd\ufffd\ufffd\ufffd}\ufffdGC\ufffd\ufffdS\ufffd\ufffd@\ufffd3&\ufffdg\ufffdR\ufffd>\u033c\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd",

Note that all the \0 have become \ufffd.

Presumably this also is broken for TemplateData when that page prop is gzipped?

Anomie added a subscriber: Anomie.

The API isn't really designed to return binary data. I'm not sure whether client libraries would handle it properly or if they too would somehow blow up when trying to process a non-UTF8 "string" as UTF8.

For a specific result property, I'd recommend encoding binary data before adding it to ApiResult. For example, ApiQueryCategories uses bin2hex() when returning category sortkeys. base-32 and base-64 encodings would also be decent options. For compressed data, I'd recommend just uncompressing it.

But it sounds like here you're talking about generic page properties, where we can't easily know whether each one is text, compressed text, or binary data. On the other hand, anything storing compressed or binary data into page_props is already broken on PostgreSQL and (I'm pretty sure) MSSQL, for much the same reason.

Probably the most straightforward solution would be to deprecate the "properties" value in ApiParse.php and replace it with a "binProperties" value which is the bin2hex'ed value of each property. That would ensure that there was some means of exporting even binary page properties, such as those used by Kartographer, TemplateData, Graph, etc.

OTOH, that would be a negative for the majority of properties that aren't shoving binary data into page_props.

Yes, true. But we need some way to disambiguate between returning \xFF meaning the literal byte 255 and \u00FF meaning the UTF-8 byte sequence denoting codepoint 255.

As a hack, I suppose we could look for the magic bytes denoting gzip compression, and switch to binhex if those are found. That should be a narrow fix and wouldn't change the output for properties which are already being exported properly.

IMO the thing to do for the API is probably twofold:

  1. Have ApiResult::validateValue() refuse strings that are not valid UTF-8. We can detect that by seeing if the normalized string has more � in it than the input string.
  1. Start returning page properties that are not valid UTF-8 encoded somehow, e.g.
"pageprops": {
    "noeditsection": "",
    "displaytitle": "<b>Title!</b>",
    "binaryprop": {
        "encoding": "base64",
        "data": "WW91J2xsIHN0aWxsIGhhdmUgdG8gZml4IHRoaW5ncyBmb3IgUG9zdGdyZVNRTCB0aG91Z2gu"
    }
}

That avoids ambiguity over whether a string "666F6F" is supposed to mean "666F6F" or "foo", no matter how unlikely the former might be. The encoding should be done in a way that can be easily reused in case anything else needs the same treatment.

I'll probably put code for that in Gerrit tomorrow if there are no objections.

Mholloway added a subscriber: Mholloway.
Anomie claimed this task.Dec 6 2018, 5:44 PM
Anomie moved this task from Needs details or plan to Needs Review on the MediaWiki-API board.

I'll probably put code for that in Gerrit tomorrow if there are no objections.

I had the code all ready, and forgot to ever push it to Gerrit. :(

Change 478035 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] API: Better handle binary page props

https://gerrit.wikimedia.org/r/478035

We have had to deal with this in EventBus as well, thes page properties are not JSON-encodable. We ended up with a hack like this.

A question I would like to raise is whether we need to hack around binary page props? I have only seen that issue in Kartographer, and we have an event that JSON serializes all the page properties changes. Do we know if it's widely used? Could we just fix Kartographer, deprecate and then prohibit binary page props and remove all the hacks?

Yurik added a subscriber: Yurik.Dec 7 2018, 10:02 PM

I suspect the same issue is in Graphoid - it pulls the same bin data blobs via api

A question I would like to raise is whether we need to hack around binary page props? I have only seen that issue in Kartographer, and we have an event that JSON serializes all the page properties changes. Do we know if it's widely used? Could we just fix Kartographer, deprecate and then prohibit binary page props and remove all the hacks?

Yeah, I'm starting to come around to prohibiting them as well. Ultimately, we need to decide whether MediaWiki intends to support binary data in page_props. I think the current expectation is that it doesn't, and people have just made it work in certain places as needed. My recollection from T53740 (I believe the first user of binary page_props) was that was done in a bit of a rush and I'm not sure whether it's been re-evaluated since then. My understanding from the MCR RfC is that the long-term plan is T56140: Move TemplateData to its own JSON-content namespace and associate with Template-namespace, or to its own TemplateData content model and revision slot, moving TemplateData out of page_props and into its own slot.

So that leaves Graph and Kartographer which need some migration plan. Anything else?

I think @Anomie's patch adding support to the API is fine for now, as long as its explicitly a temporary thing that we only plan on supporting until current users can be migrated away (following deprecation practices, etc.).

Also, I think T127225: gzipped page props might be a duplicate of this one.

Yurik added a comment.Dec 20 2018, 9:12 PM

I agree with @Legoktm -- storing data blobs in page props was a hack. But to my knowledge, there is no good alternative storage for the parser-generated blobs. Essentially any system that needs independent access to those blobs would require something like this, essentially solving T119043

daniel added a subscriber: daniel.Apr 1 2019, 11:35 AM

Patch has comments. Now what?

Anomie added a comment.Apr 2 2019, 4:03 PM

I need to find time to look back at the patch.

daniel added a subscriber: Catrope.Tue, Aug 20, 1:33 PM

@Catrope just pushed two changes that may remove the use cases that trigger this problem (though they don't fix the problem in principle):
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Graph/+/530767
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Kartographer/+/531159