[Task] EntityIdValues should be serialized as strings, not type/number structures.
Closed, ResolvedPublic
Actions

Description

Currently, EntityIdValues are represented in JSON as pairs of entitiy types and numeric ids. There is no easy for a client to make this to a prefixed id or URL. Instead, we should always use the prefixed ID form (e.g. "Q1234") to represent an entity ID externally (and probably also in the internal serialization).

Of course, we must keep accepting the old form as input. Otherwise, we would be unable to process serializations from existing revisions.

From T78294:
"unserializing the EntityIdValue (e.g. from memcached) from the entity type + numeric id format to an EntityId object has significant impact on performance and memory usage.

The unserialization uses the PropertyId::newFromNumber and ItemId::newFromNumber methods are memory intensive, with use of strtr (#5 in https://github.com/filbertkm/wb-profiling/blob/master/memory_own-itempurge-1.25wmf1.txt).

If we can somehow move away from entity type + numeric id format in EntityIdValue, I think that would be better."

https://github.com/wmde/WikibaseDataModel/issues/248

Details

Reference: bz54085

	Subject	Repo	Branch	Lines +/-
	Minimize EntityIdValue footprint when storing to the database	mediawiki/extensions/Wikibase	master	+179 -3

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined		dchen	T118706 Conduct heuristic evaluation of image upload and insert flow in VisualEditor
Open		None	T115858 Design improvements for mw.ForeignStructuredUpload.BookletLayout
Open		None	T115865 Insert image in content immediately after it's uploaded, skipping the "General settings" step
Duplicate		None	T115864 Figure out if the description of the image can be used as the caption on-wiki
Open	Feature	None	T53032 When inserting an image, set its caption by default to be the Commons image description
Open	Feature	None	T39534 Wikimedia Commons should support searching by color
Duplicate		None	T39535 Wikimedia Commons should support filtering by color
Resolved		None	T19503 Provide metadata support on Wikimedia Commons
Resolved		None	T51662 VisualEditor: Use Multimedia/Wikidata's proposed rich structured meta-data in the image insertion dialog
Resolved		None	T68108 [Epic] Store media information for files on Wikimedia Commons as structured data
Duplicate		None	T66288 basic support for structured data on mediawiki files
Invalid		Lydia_Pintscher	T76012 make use of new entity type for multimedia / structured data of media files
Open		None	T109579 [Epic] Give more sister projects access to Wikidata
Open		None	T187900 There is no way to reference a specific quote on Wikiquote
Stalled		None	T71753 [Story] Wikibase / Wikidata support on Wikiquote
Open		None	T88728 Improve Wikimedia dumping infrastructure
Open		None	T88991 improve Wikidata dumps [tracking]
Open		None	T67626 [Epic] Support for queries on-wiki (automated list generation)
Resolved		Addshore	T76019 [Story] Support new types of Entities in Wikibase Client
Resolved		thiemowmde	T135650 [Task] Migrate PropertySuggester away from assuming all entities are numeric
Resolved		Addshore	T75496 [Epic] Support new types of Entities in Wikibase Repository
Invalid		None	T109969 Wikidata breaking API changes (tracking)
Resolved		thiemowmde	T56085 [Task] EntityIdValues should be serialized as strings, not type/number structures.
Resolved		Lydia_Pintscher	T92962 Find further breaking (API) changes that could be rolled out together with T56085
Resolved		• adrianheine	T93172 [Task] Adapt JS Serialization to new EntityIdValue serialization
Resolved		thiemowmde	T132592 Array representation of EntityIdValue should be based on ID serialization.

Event Timeline

• bzimport raised the priority of this task from to High.Nov 22 2014, 2:13 AM

• bzimport added projects: MediaWiki-extensions-WikibaseRepository, good first task.

• bzimport set Reference to bz54085.

• bzimport added a subscriber: Unknown Object (MLST).

daniel created this task.Sep 12 2013, 6:01 PM

Lydia_Pintscher added a project: Wikidata.Dec 1 2014, 2:33 PM

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

daniel added a parent task: T88991: improve Wikidata dumps [tracking].Feb 9 2015, 5:13 PM

daniel merged a task: T78294: Move away from entity type + numeric id format in EntityIdValue.Feb 25 2015, 12:25 PM

daniel updated the task description. (Show Details)

daniel set Security to None.

daniel added subscribers: Aklapper, Lucie.

Tobi_WMDE_SW added a subtask: T92961: [Story] Versioning in JSON output.Mar 17 2015, 2:45 PM

Qgil mentioned this in T77925: Wikidata PageBanner extension.Mar 19 2015, 9:36 AM

Does the serialization change need to be in EntitySerializer.php?
Also what could possibly be a new serialization process?
I'd like to work on fixing this, I'm familiar with mediawiki code, but not much with wikidata in particular, a little hint would get me started.

Sorry for taking so long to reply to you Sumit. I don't know why this was tagged easy. This is a rather invasive change needing a lot of coordination in other areas. So probably not a good one to start with. Sorry! Have you looked at some other tickets already?

Lydia_Pintscher removed a project: good first task.Apr 10 2015, 2:59 PM

daniel moved this task from incoming to ready to go on the Wikidata board.Apr 17 2015, 1:33 PM

JanZerebecki lowered the priority of this task from High to Medium.May 16 2015, 5:32 PM

• Jonas renamed this task from EntityIdValues should be serialized as strings, not type/number structures. to [Task] EntityIdValues should be serialized as strings, not type/number structures..Aug 15 2015, 12:44 PM

• Jonas updated the task description. (Show Details)

Addshore updated the task description. (Show Details)Aug 21 2015, 12:43 PM

Ricordisamoa subscribed.Aug 22 2015, 2:01 PM

Addshore added a parent task: T109969: Wikidata breaking API changes (tracking).Aug 23 2015, 9:33 AM

Addshore closed subtask T92962: Find further breaking (API) changes that could be rolled out together with T56085 as Resolved.

Jimkont subscribed.Sep 11 2015, 6:30 AM

daniel added a parent task: T75496: [Epic] Support new types of Entities in Wikibase Repository.Feb 5 2016, 12:36 PM

hoo subscribed.Feb 21 2016, 8:00 PM

daniel added a project: Wikidata-Sprint-2016-03-01.Feb 23 2016, 11:35 AM

We discussed this in todays story time. We decided this should be a blocker for …

…, erm, why are there so many tickets with the same intent? Anyway, we decided this must be considered to not break stuff twice. Why? Because instead of introducing a new property type (or even a new value type, which is what T101752 is about) it was suggested to make the existing EntityIdValue type more generic.

What is this about?

Look at https://www.wikidata.org/wiki/Special:EntityData/Q21632765.json and search for "numeric-id". That's what we want to get rid of. Here is the relevant JSON snippet, representing an EntityIdValue:

"value":{"entity-type":"item","numeric-id":184}

Original proposal: Turn into string

"value":"Q184"

Issues:

The moment we want it to support items from an external repository (e.g. Commons should use the wikidata.org items), this must change again into something like "value":"wikidata:Q184" or even "value":"http://www.wikidata.org/entity/Q184".
All ids must have a prefix then, even all internal ones, wasting massive amounts of bytes. Having ids with and without prefixes would be a bad hack, since this means : is not allowed in internal ids any more.
A string like wikidata:Q184 combines two facts in one field, violating a basic database design principle.

Todays proposal: Keep object

Note: I'm suggesting the new keys "repo" and "id" here. This can also be something else, e.g. "serialization".

"value":{"repo":"wikidata","id":"Q184"}

A format like this allows for a much more convenient migration path including a deprecation phase, which could look like this:

"value":{"repo":"wikidata","id":"Q184","entity-type":"item","numeric-id":184}

Advantages:

Makes it possible to add keys later, including full URIs and URLs (derived values, see T118860).
Migration is possible.
The repo key can be omitted for internal links, saving bytes.

Disadvantages:

Wastes bytes, compared to "value":"wikidata:Q184".
Users can mistakenly ignore the repo part (no matter if it's always there or omitted for internal links) and think this is a link to https://www.wikidata.org/wiki/Q184. This can't happen with URIs like wikidata:Q184.

Both proposals combined

"value":{"id":"wikidata:Q184"}

Issues:

Wastes bytes, compared to "value":"wikidata:Q184".
Again, it's not possible to omit the prefix.
We can not start with "id":"Q184" and add the prefix later, since this would be an other breaking change.

Tpt subscribed.Apr 8 2016, 12:44 PM

So, if I'm right, we need a task breakdown for this next, correct @thiemowmde?

We should have a decision on the format first. Personally I think a short discussion with @daniel, @adrianheine and me could work out.

+1 to proposal

Apparently there is an advantage to not fixing "something obviously bad" for over 3 years :)

@JeroenDeDauw, do you have a favorite of the three different proposals I collected?

Tobi_WMDE_SW added a project: Wikidata-Sprint-2016-04-26.Apr 12 2016, 9:06 AM

In T56085#2198084, @JeroenDeDauw wrote:

Apparently there is an advantage to not fixing "something obviously bad" for over 3 years :)

Yea, that way, we have had more time to think about how to make this change backwards compatible -- which we wouldn't need if we had gotten it right in the first place.

Hah, I did not realize there are 3 serious proposals here, and thought only "Todays proposal: Keep object" was being proposed. That's the one that got the +1 from me. The combined one also seems reasonable.

I also support the "keep object" proposal, for compatibility reasons. It's ugly though, and wouldn't be needed if we hadn't started with a serialization that exposes internals. We are stuck with using an object here I guess, but we can move away from exposing the internals. I guess it's the best option we have right now.

What about formatversion=2?

@daniel, even with that we have many options. Use URIs like "wikidata:Q1", or full URLs, or split this into two elements? Which array keys to use for that? Personally I strongly suggest to keep the serialized JSON as short as possible because we have millions of these links in the database. My ideal solution (after a migration phase): "value":{"id":"Q184"}.

Quick summary of a discussion between Adrian, Thiemo, and me today:

Agreement:

Keep the object structure for entityid values
Add a new "id" key for the serialized entity id
{ "id":"Q184", "entity-type":"item", "numeric-id":184 }
For internal storage, we can drop the old fields right away, and go to { "id":"Q184" }
{ "id":"Q184" } should also become the default for API output at some point. We may continue to support "entity-type" and "numeric-id" as an option.

Note: we also want to support optional "url" and "uri" keys, but they should probably be treated as derived values, and not be part of the entity id value itself.

There are some questions that remain open regarding the support of external entity ids, for a federated setup. The changes outlined above can however be made without deciding the questions below.

The main question regarding external identifiers is: should a prefix that encodes the home repo of an entity be included in the id field? There are basically two options:

{ "id":"foo:Q184" } optionally expanded to { "id":"foo:Q184", "repo":"foo" }. Local IDs would have the form "Q184" without a prefix.
{ "id":"Q184" "repo":"foo" } optionally augmented with { "id":"Q184" "repo":"foo", "qname":"foo:Q184" }. The qname for a local ID would have the form mywiki:Q184, using a configurable prefix. An empty prefix could be allowed here, whih would lead to the form :Q184 for local entities.

Arguments for option(1):

The "id" field in API output has the same form as the expected input for API parameters an URLs.
We want to use prefixed IDs internally (nearly) everywhere where we currently have non-prefixed IDs. The notion of entity ID would be extended to include entities in other repos.
Clients that do not expect to see references to external entities would fail early, since they would not be able to parse prefixed IDs.
On a repo that is not federated with other repos, nothing changes, except that IDs become available as strings.

Arguments against (1):

The parser needs to detect whether the ID has a prefix or not.
IDs may not contain a colon (or a colon would need encoding or escaping).

Arguments for (2):

the "id" field never has a prefix, the "qname" field always has a prefix.
Clear distinction between "IDs" and "references".

Arguments against (2):

Clients that do not know about external entities may read the "id" field as being local, even if they are not.
IDs with the local prefix need to be accepted as input everywhere, including as parts of the URL.
What's the canonical form of the ID - with out without prefix? Should the canonical URI also contain the prefix?

daniel created subtask T132592: Array representation of EntityIdValue should be based on ID serialization..Apr 13 2016, 3:30 PM

@daniel's summary misses an option:

1. { "id": "foo:Q184" }, but a local ID would still be { "id": "Q1" }. Main advantage: Repos like wikidata.org would not change at all when the software starts supporting such prefixed external IDs everywhere (not only in the data value we discuss here).
2. Same, but local IDs would also be prefixed, e.g. { "id": "d:Q1" }. Advantage: The fact that IDs can be prefixed is explicit and obvious and can not be ignored. That's also a disadvantage: User code will break even when the repo the user cares about does not have external ids.
{ "id": "Q184", "repo": "foo" }. Main disadvantage: This changes the meaning of the existing "id" field.

I tend to agree that option 1.A is the best.

We continued discussing edge cases that involve colons in entity IDs. Solution: Prefix local IDs with a colon when they contain colons, e.g. { "id": ":Q1:2" }. We can and I think we should have support for this right away when we start working on this.

daniel mentioned this in T133381: Add support for foreign entities to EntityId.Apr 22 2016, 12:10 PM

Tobi_WMDE_SW removed a subtask: T92961: [Story] Versioning in JSON output.Apr 26 2016, 1:09 PM

thiemowmde closed subtask T132592: Array representation of EntityIdValue should be based on ID serialization. as Resolved.Jul 20 2016, 7:58 AM

Change 299963 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Minimize EntityIdValue footprint when storing to the database

https://gerrit.wikimedia.org/r/299963

gerritbot added a project: Patch-For-Review.Jul 20 2016, 10:06 AM

• Jonas closed subtask T93172: [Task] Adapt JS Serialization to new EntityIdValue serialization as Resolved.Aug 3 2016, 10:59 AM

The issue this ticket describes was actually solved with https://github.com/wmde/WikibaseDataModel/pull/671. There are a few other patches like the linked https://gerrit.wikimedia.org/r/299963, but these are optional optimizations and cleanups.

thiemowmde mentioned this in T202676: Customize Item Identifier prefix (currently: Q).Aug 24 2018, 8:34 AM

Change 299963 abandoned by Thiemo Kreuz (WMDE):
Minimize EntityIdValue footprint when storing to the database