Page MenuHomePhabricator

gzip-encoded page properties can't be exported from the API
Open, MediumPublic

Description

Event Timeline

As an example, PHP var_export says the raw value is:

'kartographer' => ' � ' . "\0" . '' . "\0" . '' . "\0" . '' . "\0" . '' . "\0" . '' . "\0" . ' E�� �0 E�e�HJKm� \\�7��1 "P2� ���e�nϹ��pW�h!�uZ��y�E�}� 3��ʣc�83!� ���}�GC �S �@ 3&�g�R�>̼���' . "\0" . '' . "\0" . '' . "\0" . '',

but the value exported by the API for this is:

"kartographer": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdE\ufffd\ufffd\ufffd\ufffd0\ufffdE\ufffde\ufffdHJKm\ufffd\ufffd\\\ufffd7\ufffd\ufffd1\ufffd\"P2\ufffd\ufffd\ufffd\ufffd\ufffde\ufffdn\u03f9\ufffd\ufffdpW\ufffdh!\ufffduZ\ufffd\ufffdy\ufffdE\ufffd}\ufffd\ufffd3\ufffd\ufffd\u02a3\u436e\ufffdU\ufffd\ufffdv\ufffd\u06c2\ufffd\ufffd\ufffdM|\ufffdB\ufffd\ufffd\ufffd\ufffdF\u06a0\ufffd\ufffdk\ufffd\u6615\ufffd\ufffd\ufffd0\ufffd\ufffdk\ufffd\ufffdR\ufffdJ\ufffd\ufffdY\ufffd\ufffdn\ufffd\ufffdR\ufffd\ufffd\ufffd\ufffd)\ufffd\ufffd`\ufffd\ufffd\ufffd\ufffd\ufffdn\ufffd\rc\ufffd83!\ufffd\ufffd\ufffd\ufffd}\ufffdGC\ufffd\ufffdS\ufffd\ufffd@\ufffd3&\ufffdg\ufffdR\ufffd>\u033c\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd",

Note that all the \0 have become \ufffd.

Presumably this also is broken for TemplateData when that page prop is gzipped?

Anomie subscribed.

The API isn't really designed to return binary data. I'm not sure whether client libraries would handle it properly or if they too would somehow blow up when trying to process a non-UTF8 "string" as UTF8.

For a specific result property, I'd recommend encoding binary data before adding it to ApiResult. For example, ApiQueryCategories uses bin2hex() when returning category sortkeys. base-32 and base-64 encodings would also be decent options. For compressed data, I'd recommend just uncompressing it.

But it sounds like here you're talking about generic page properties, where we can't easily know whether each one is text, compressed text, or binary data. On the other hand, anything storing compressed or binary data into page_props is already broken on PostgreSQL and (I'm pretty sure) MSSQL, for much the same reason.

Probably the most straightforward solution would be to deprecate the "properties" value in ApiParse.php and replace it with a "binProperties" value which is the bin2hex'ed value of each property. That would ensure that there was some means of exporting even binary page properties, such as those used by Kartographer, TemplateData, Graph, etc.

OTOH, that would be a negative for the majority of properties that aren't shoving binary data into page_props.

Yes, true. But we need some way to disambiguate between returning \xFF meaning the literal byte 255 and \u00FF meaning the UTF-8 byte sequence denoting codepoint 255.

As a hack, I suppose we could look for the magic bytes denoting gzip compression, and switch to binhex if those are found. That should be a narrow fix and wouldn't change the output for properties which are already being exported properly.

IMO the thing to do for the API is probably twofold:

  1. Have ApiResult::validateValue() refuse strings that are not valid UTF-8. We can detect that by seeing if the normalized string has more � in it than the input string.
  1. Start returning page properties that are not valid UTF-8 encoded somehow, e.g.
"pageprops": {
    "noeditsection": "",
    "displaytitle": "<b>Title!</b>",
    "binaryprop": {
        "encoding": "base64",
        "data": "WW91J2xsIHN0aWxsIGhhdmUgdG8gZml4IHRoaW5ncyBmb3IgUG9zdGdyZVNRTCB0aG91Z2gu"
    }
}

That avoids ambiguity over whether a string "666F6F" is supposed to mean "666F6F" or "foo", no matter how unlikely the former might be. The encoding should be done in a way that can be easily reused in case anything else needs the same treatment.

I'll probably put code for that in Gerrit tomorrow if there are no objections.

Anomie moved this task from Needs details or plan to Needs Review on the MediaWiki-Action-API board.

I'll probably put code for that in Gerrit tomorrow if there are no objections.

I had the code all ready, and forgot to ever push it to Gerrit. :(

Change 478035 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] API: Better handle binary page props

https://gerrit.wikimedia.org/r/478035

We have had to deal with this in EventBus as well, thes page properties are not JSON-encodable. We ended up with a hack like this.

A question I would like to raise is whether we need to hack around binary page props? I have only seen that issue in Kartographer, and we have an event that JSON serializes all the page properties changes. Do we know if it's widely used? Could we just fix Kartographer, deprecate and then prohibit binary page props and remove all the hacks?

I suspect the same issue is in Graphoid - it pulls the same bin data blobs via api

A question I would like to raise is whether we need to hack around binary page props? I have only seen that issue in Kartographer, and we have an event that JSON serializes all the page properties changes. Do we know if it's widely used? Could we just fix Kartographer, deprecate and then prohibit binary page props and remove all the hacks?

Yeah, I'm starting to come around to prohibiting them as well. Ultimately, we need to decide whether MediaWiki intends to support binary data in page_props. I think the current expectation is that it doesn't, and people have just made it work in certain places as needed. My recollection from T53740 (I believe the first user of binary page_props) was that was done in a bit of a rush and I'm not sure whether it's been re-evaluated since then. My understanding from the MCR RfC is that the long-term plan is T56140: Move TemplateData to its own JSON-content namespace and associate with Template-namespace, or to its own TemplateData content model and revision slot, moving TemplateData out of page_props and into its own slot.

So that leaves Graph and Kartographer which need some migration plan. Anything else?

I think @Anomie's patch adding support to the API is fine for now, as long as its explicitly a temporary thing that we only plan on supporting until current users can be migrated away (following deprecation practices, etc.).

Also, I think T127225: gzipped page props might be a duplicate of this one.

I agree with @Legoktm -- storing data blobs in page props was a hack. But to my knowledge, there is no good alternative storage for the parser-generated blobs. Essentially any system that needs independent access to those blobs would require something like this, essentially solving T119043

I need to find time to look back at the patch.

@Catrope just pushed two changes that may remove the use cases that trigger this problem (though they don't fix the problem in principle):
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Graph/+/530767
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Kartographer/+/531159

TheDJ subscribed.

I've removed Kartograper as a project, since it is no longer affected. Keeping this open per "don't fix the problem in principle" comment.

AMooney removed a subscriber: Anomie.
Aklapper changed the task status from Stalled to Open.Oct 19 2020, 4:33 PM

The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status.

(Smallprint, as general orientation for task management: If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead. If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks...Edit Subtasks. If this task is stalled on an upstream project, then the Upstream tag should be added. If this task requires info from the task reporter, then there should be instructions which info is needed. If this task needs retesting, then the TestMe tag should be added. If this task is either out of scope and nobody should ever work on this, or nobody else managed to reproduce the problem described in this task, then this task should have the "Declined" status. If the task is valid but should not appear on some team's workboard, then the team project tag should be removed while the task has another active project tag.)

Change 478035 abandoned by Umherirrender:

[mediawiki/core@master] API: Better handle binary page props

Reason:

Old outdated patch set

https://gerrit.wikimedia.org/r/478035