[Story] Ensure feature parity of serialization based on WikibaseDataModelSerialization with what we do with WikibaseLib
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Tobi_WMDE_SW
	Sep 23 2014, 7:59 AM

Description

(Description updated after a discussion between Jan, Katie, Thiemo, and Daniel 2015-05-11)

Wikibase API modules currently use the old external serialization code in WikibaseLib, together with some code in ResultBuilder, to represent Entities and associated information in API results. If we want to ditch the old serialization code, we need to make sure we can cover the old functionality based on the new WikibaseDataModelSerialization module.

Requirements for feature parity are:

include the property data type in Property(Value)Snaks. The main issue here is avoiding circular dependencies between the serializer and the service interfaces. See T98850
grouped vs ungrouped output of Statements, Qualifiers, and Reference Snaks. Or decide we want to ditch the ungrouped representation, see T78653.
forcing (some) maps to lists (T98857)
- we may no longer need this for the XML output, if "indexTagName" now overrides explicit array keys in ApiResult
- currently, there is an option to force maps to objects - that should probably always be done, to avoid issues with the representation of empty maps in JSON (T98860)
injecting "indexTagName" and other meta-info for the MediaWiki API serializer. Hard coded knowledge about these markers could be kept out of the serializer by using some sort of callback logic. See T78652
maintain ordered maps: since JSON does not specify the order of entries in a map, we need to make that order explicit, since it should at least be stable. (T98861)
inject revision info (revision id, timestamp, page title). This is mediawiki specific, the serializer shouldn't know the meaning of such meta-data fields. (T98862)
deletion markers for deleted elements in API responses. We may want to implement this by representing deletions in the DataModel somehow. T98863

Additional features we want to support in the future, but should already consider when thinking abotu a solution for the issues above:

Derived values in Snaks (normalized quantities, expanded external IDs, etc; see T89005). Could be represented in the model in a way similar to TermFallback: ExtendedPropertyValueSnak would extend PropertyValueSnak (and perhaps implement a marker interface, to safeguard against serialization into the database). Note: derived values should not be supported as input to EditEntity, etc.
Filtered representations (terms filtered by language, sitelinks filtered by family or language, etc). Filtering can be done before serialization, but should be marked in the data model and in the serialization, to avoid data loss during round trips. (T73512)
Represent incoming redirects as "secondary ids" in the serialized entity data structure (T98039)

For implementing the features above, there are several options:

Add knowledge to the serializer
Wrap another set of serializers around the basic serializers
Post-process the resulting array structure to apply the necessary changes
Represent them in the data model explicitly
Layer another representation model on top of the basic data model

For practical reasons, things that only change the way the data is represented, should probably be implemented inside the serializer (Option 1 or 2, possibly 3 if there is a very good reason). Features that add or remove information (such a filtering or language fallback) should be represented in the model (options 4 or 5). Care must be taken to avoid serializing such derivative data into the database, or accepting it as input of edits.

Details

Reference: bz71170

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Resolved	Addshore	T64188 Replace old serialization code in lib with datamodel serialization
Resolved	Lydia_Pintscher	T73170 [Story] Ensure feature parity of serialization based on WikibaseDataModelSerialization with what we do with WikibaseLib
Resolved	JanZerebecki	T74183 New serialization code needs to support language fallback
Open	None	T73512 [Task] DataModelSerialization: create filter by language that can be applied to objects before passing to serialization
Declined	Addshore	T78653 DataModelSerialization: optionally (?!) group statements and snaks.
Resolved	Addshore	T78651 Add integration tests that ensure compatibility
Resolved	Lydia_Pintscher	T78652 DataModelSerialization: Serializers to be used with the MediaWiki API need to inject meta-info like _indexTagNames, to allow XML output from API modules.
Open	None	T98039 Include incoming redirects in JSON representation of entities (tracking)
Invalid	None	T98850 [Task] DataModelSerialization: include the property data type in Property*Snaks.
Invalid	None	T98857 [Task] DataModelSerialization: option to force maps to lists, if needed
Invalid	None	T98860 [Task] DataModelSerialization: always force maps to be represented as stdclass objects when generating JSON output.
Invalid	None	T98861 [Task] DataModelSerialization: represent map order in JSON output
Invalid	None	T98862 [Task] DataModelSerialization: allow injection of meta-data like the revision id
Declined	Bene	T98863 DataModelSerialization: The serializer should allow marking specific parts of the data as "deleted".
Resolved	Bene	T104380 Release Wikibase DataModel Serialization 1.5
Open	None	T89005 [Story] JSON should (optionally) contain expanded/normalized values.
		· · ·

Event Timeline

• bzimport raised the priority of this task from to High.Nov 22 2014, 3:48 AM

• bzimport added a project: MediaWiki-extensions-WikibaseRepository.

• bzimport set Reference to bz71170.

• bzimport added a subscriber: Unknown Object (MLST).

Tobi_WMDE_SW created this task.Sep 23 2014, 7:59 AM

It seems it is also missing the ability to represent removed Terms: search for removed in serializeMultilingualValues() in Wikibase.git/lib/includes/serializers/MultilingualSerializer.php .

Things that the old serializer supported (or should have supported), and need to be possible using the new model and serializer too:

Filtering (terms by language, sitelinks by site, statements by rank)
Apply language fallback (and put info about the fallback in the terms)
optionally group claims by property
optionally use lists instead of maps that use IDs as keys (to cater to the quirks of ApiFormatXml).
deletion markers used in responses of API modules like RemoveClaim

In some cases, we want to (optionally) inject extra information into the serialization (or presentational model), e.g.:

the DataType of PropertyValueSnaks
(future) full URIs for external identifiers
(future) quantity values converted to base units

We should try to avoid putting such "derivative" versions of our data back into the database, as this would constitute data loss and/or create confusion (especially in the case of automatic transliteration).

Another question is if and how "derivative" entity information can and should be represented by our data model. We should have a spec that makes a clear distinction between the "core" data model, and "representational" or "informational" derivatives.

PS: We also need a way to represent order explicitly when using id based maps instead of lists (for statements, qualifiers in a claim, references in a statement, and snaks in a reference). This is part of the core model, but was not addressed by the old serializer either.

After some discussion with Jan and Adrian, some points became clear:

We want to implement language fallback and filtering on the model level, not in the serializer
Other things however, like grouping of statements, or associative vs. indexed arrays, have to be implemented in the serializer (a flag to the serializer factory could do the trick)
Presentation-layer concepts can be represented in the datamodel using subclassing (e.g. TermWithLanguageFallback extends Term).
It would be good to include a version number in serializer output
Entities "tainted" by fallback, filtering, etc are not a big issue in practice, because EditEntity only adds/replaces entries per default.
EditEntity still needs to fail on terms with fallback info, to avoid writing automatic transliterations to the database.

In general, it became clear that the serialization we use in the database will often not be exactly the same as the one we use in the database. For instance, the serialization in the database does not contain the data-type in snaks, nor should it in the future contain things like expanded URIs of external IDs or converted quantity values.

The question whether we can always group statements by property, or whether we want to retain the option to ouput flat lists of Statements, remained open. It would be nicer if we could always group.

Another general consideration: we want our output format to stay relatively stable, and it should be easy to use directly, without the need for specialized data model libraries. While it would be nice to have libraries for serialization and representing our data model in multiple languages, we currently do not supply those. As long as we don't, we have to assume people operate on the raw data structures.

» we use in the database will often not be exactly the same as the one we use in the database.«?

I agree that we have to consider users which don't have a data model implementation. I just want the data model implementations we have to be able to provide the things we provide in the serialization.

Of course i meant "the serialization we use in the database will often not be exactly the same as the one we use in the API".

I agree with Adrian that our data model should generally be able to represent everything our serializer puts into the output.

Another aspect that came up is the conceptual separation of the "core" data model (representing the knowledge stored in the wiki directly) vs. the "presentational" model (with fallback, filtering, and expansion). A marker interface may work, but that's not very nice (think LSP). The cleanest solution would be to duplicate the entire data model class hierarchy, but that's massive overhead, hard to maintain, and potentially confusing. The best we can do for now seems to be to clearly document which classes and fields are part of the core model, and which are presentational.

Lydia_Pintscher added a project: Wikidata.Dec 1 2014, 2:29 PM

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).Dec 1 2014, 2:33 PM

adrianheine closed subtask T74183: New serialization code needs to support language fallback as Resolved.Jan 29 2015, 8:25 AM

daniel updated the task description. (Show Details)May 11 2015, 1:02 PM

daniel set Security to None.

daniel updated the task description. (Show Details)

daniel updated the task description. (Show Details)May 11 2015, 1:09 PM

daniel updated the task description. (Show Details)May 11 2015, 1:12 PM

daniel updated the task description. (Show Details)

thiemowmde added subtasks: T78653: DataModelSerialization: optionally (?!) group statements and snaks., T78651: Add integration tests that ensure compatibility, T78652: DataModelSerialization: Serializers to be used with the MediaWiki API need to inject meta-info like _indexTagNames, to allow XML output from API modules..May 11 2015, 4:50 PM

daniel updated the task description. (Show Details)May 12 2015, 1:08 PM

Bene subscribed.May 12 2015, 1:12 PM

daniel updated the task description. (Show Details)May 12 2015, 1:21 PM

daniel added a subtask: T98039: Include incoming redirects in JSON representation of entities (tracking).May 12 2015, 1:29 PM

daniel updated the task description. (Show Details)May 12 2015, 1:36 PM

daniel updated the task description. (Show Details)

daniel updated the task description. (Show Details)May 12 2015, 1:47 PM

daniel updated the task description. (Show Details)May 12 2015, 4:04 PM

daniel updated the task description. (Show Details)

daniel updated the task description. (Show Details)May 12 2015, 4:07 PM

daniel added a subtask: T89005: [Story] JSON should (optionally) contain expanded/normalized values..May 12 2015, 4:11 PM

daniel added a subtask: T98863: DataModelSerialization: The serializer should allow marking specific parts of the data as "deleted"..May 12 2015, 4:13 PM

I must say I'm a bit confused by this tracking ticket. It says it's about "feature parity" and blocks T64188: Replace old serialization code in lib with datamodel serialization, but contains proposals for what sound like actual breaking changes to me.

Did the old serializer "include the property data type in Property(Value)Snaks" and how did it do that when the new serializer can't?
"ditch the ungrouped representation" means a breaking change, right?
"we need to make that order explicit" means introducing a new feature that currently does not exist in the serialization, right?
Is "inject revision info" something the old serializer does and how does it do that?
"deletion markers" are another non-existing feature, right? How is this relevant for this ticket?

As discussed I'm arguing that the serializers are creating a specific serialization anyway and adding some knowledge about markers (as strings, not as an actual MediaWiki dependency) is ok in my opinion.

Wrapping a second "datamodel" around the current one sounds fascinating, but I can't think of a concrete way to implement this.

In T73170#1282773, @thiemowmde wrote:

Did the old serializer "include the property data type in Property(Value)Snaks" and how did it do that when the new serializer can't?

Yes, SnakSerializer gets a PropertyDataTypeLookup injected.

"ditch the ungrouped representation" means a breaking change, right?

Ditching this option is mentioned as an alternative to maintaining full party, namely, offering an api parameter to switch between the grouped and ungrouped mode.

"we need to make that order explicit" means introducing a new feature that currently does not exist in the serialization, right?

I think we had that at one point, but I'm not quite sure. It's an aspect we should keep in mind though.

Is "inject revision info" something the old serializer does and how does it do that?

ResultBuilder injects this into the ApiResult. It would be nice to integrate this better with serialization, but it's not technically a blocker.

"deletion markers" are another non-existing feature, right? How is this relevant for this ticket?

We do this at least for removed sitelinks, see ResultBuilder::addSiteLinks, using EntitySerializer::OPT_PARTS, => "sitelinks/removed". I seem to recall we did this in more places.

We also already do normalized/expanded data, with EntitySerializer::OPT_PARTS => "sitelinks/urls"

As discussed I'm arguing that the serializers are creating a specific serialization anyway and adding some knowledge about markers (as strings, not as an actual MediaWiki dependency) is ok in my opinion.

Yes, incorporating some "magical" knowledge may not be the cleanest, but i nthe end the sanest way to go about this. Though I'm still considering post-processing for injecting the indextTagName stuff.

JeroenDeDauw subscribed.May 27 2015, 4:30 PM

Keep in mind that serializers take the thing to serialize and give you the serialization of that thing. And nothing more. You could construct an "entity serializer" that takes an entity id and then fetches the entity. This fetching is an added and misplaced responsibility. The same goes for adding additional information to whatever is passed in based on queries to services. This does not mean you cannot create a serializer with such an output, it just means the fetching of additional stuff should be done before using the serializer, not during.

Related to this: Wikibase DataModel Serialization is for serialization of DataModel objects. Things such as MediaWiki revision IDs are thus out of scope. Again, this does not mean such things cannot be serialized. It just means Wikibase DataModel Serialization is the wrong place to put such code.

It seems like a bunch of blockers where added that have little to do with feature parity between the current code in Lib and the Wikibase DataModel Serialization code. I recommend to keep moving away from the legacy code and implementing new functionality separate.

@JeroenDeDauw In theory, you are right: the serializer should just serialize whetever they are given. In practice, this means adding a second, rather complex, data model on top of the core data model, or complicating the core data model. That's a lot of work, and the modeling isn't easy to get right either. We explicitly discussed this option, but decided that it's not worth the pain to keep the serializers "clean".

@JeroenDeDauw You can also see it this way: feature parity is for model+serializer. If we move all the options and modes into the model, we need something to build the "output model" from the "core model", and that converter thing would need to do all the things described in the blockers.

Going this way, feature parity would be blocked on creating the "output model" and the corresponding converter, which would in turn be blocked on all the features mentioned here. So you'd add at least two more blockers, and the need for a few additional data structures. This does not strike me as a good thing.

Sure, code that relies on modes and switches is annoying. And serializers should be dumb. But if upholding these principles 100% causes a lot of overhead, we should probably go for a balance point closer to the energetic minimum...

I have edited the description to remove the implication that *all* these features have to be implemented *inside* WikibaseDataModelSerialization.

@daniel: I do not see how we need to be "adding a second, rather complex, data model on top of the core data model". For the example provided, having a serialization of a property value snak with added data type, all you need is a simple typed snak object. One such value object is neither complex nor a duplication of the whole model, and much less icky than introducing interfaces for non-serialization services in a serialization component.

Sure, code that relies on modes and switches is annoying. And serializers should be dumb. But if upholding these principles 100% causes a lot of overhead, we should probably go for a balance point closer to the energetic minimum...

I'm not sure if that is meant to be a reply to anything I said. Just in case: I agree switches are not nice though that the most pragmatic trade-of will probably involve some. In fact, there already is one: https://github.com/wmde/WikibaseDataModelSerialization/blob/master/src/SerializerFactory.php#L50

We'd need at least:

Term with fallback (already exists)
Snak with DataType
Snak with extra DataValues (normalized/expanded)
Grouped StatementList and SnakList
Link, Term, Statement, etc with a "deleted" flag
LinkList, TermList, StatementList, etc with a "filtered" flag
Entities with secondary IDs (redirects)

All of them should be acceptable for deserialization, but should not be acceptable input for EditEntity and friends, since the information is derived or imposed by filtering/sorting parameters. We can go this route, shifting some of the required changes from the serializer component to the datamodel component. But the serializers still need to support the new types, and we are still blocked on implementing all this before we can ditch the old serializer.

On top of the above, we still need the serializer itself, or the ResultBuilder, to support:

empty maps (not lists) as objects
indexedTagName markers (I think post-processing is ok for this)
injection of version info
maintaining order of map entries

Bene added a subtask: T104380: Release Wikibase DataModel Serialization 1.5.Jun 30 2015, 8:04 PM

Bene closed subtask T104380: Release Wikibase DataModel Serialization 1.5 as Resolved.Jul 1 2015, 7:12 PM

Ricordisamoa subscribed.Jul 16 2015, 3:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 16 2015, 3:58 PM

Bene closed subtask T98863: DataModelSerialization: The serializer should allow marking specific parts of the data as "deleted". as Declined.Jul 28 2015, 4:37 PM

Addshore closed subtask T78653: DataModelSerialization: optionally (?!) group statements and snaks. as Declined.Jul 29 2015, 9:35 AM

Addshore closed subtask T78651: Add integration tests that ensure compatibility as Resolved.Aug 5 2015, 9:11 AM

Lydia_Pintscher closed subtask T78652: DataModelSerialization: Serializers to be used with the MediaWiki API need to inject meta-info like _indexTagNames, to allow XML output from API modules. as Resolved.Aug 7 2015, 4:00 PM

• Jonas renamed this task from Ensure feature parity of serialization based on WikibaseDataModelSerialization with what we do with WikibaseLib to [Story] Ensure feature parity of serialization based on WikibaseDataModelSerialization with what we do with WikibaseLib.Aug 15 2015, 12:27 PM

JanZerebecki removed a subtask: T89005: [Story] JSON should (optionally) contain expanded/normalized values..Aug 16 2015, 9:20 AM

JanZerebecki added a subtask: T89005: [Story] JSON should (optionally) contain expanded/normalized values..

JanZerebecki mentioned this in T89005: [Story] JSON should (optionally) contain expanded/normalized values..Aug 16 2015, 9:27 AM

JanZerebecki moved this task from incoming to needs discussion or investigation on the Wikidata board.Sep 10 2015, 10:24 PM

JanZerebecki added a project: Story.Sep 25 2015, 8:28 PM

@daniel Can this be closed? I assume this is no longer relevant by now.

Lydia_Pintscher closed subtask T98850: [Task] DataModelSerialization: include the property data type in Property*Snaks. as Invalid.Apr 23 2017, 4:50 PM

Lydia_Pintscher closed subtask T98860: [Task] DataModelSerialization: always force maps to be represented as stdclass objects when generating JSON output. as Invalid.

Lydia_Pintscher closed subtask T98857: [Task] DataModelSerialization: option to force maps to lists, if needed as Invalid.

Lydia_Pintscher closed subtask T98861: [Task] DataModelSerialization: represent map order in JSON output as Invalid.Apr 23 2017, 4:53 PM

Lydia_Pintscher closed subtask T98862: [Task] DataModelSerialization: allow injection of meta-data like the revision id as Invalid.