Page MenuHomePhabricator

[Story] Ensure feature parity of serialization based on WikibaseDataModelSerialization with what we do with WikibaseLib
Closed, ResolvedPublic

Description

(Description updated after a discussion between Jan, Katie, Thiemo, and Daniel 2015-05-11)

Wikibase API modules currently use the old external serialization code in WikibaseLib, together with some code in ResultBuilder, to represent Entities and associated information in API results. If we want to ditch the old serialization code, we need to make sure we can cover the old functionality based on the new WikibaseDataModelSerialization module.

Requirements for feature parity are:

  • include the property data type in Property(Value)Snaks. The main issue here is avoiding circular dependencies between the serializer and the service interfaces. See T98850
  • grouped vs ungrouped output of Statements, Qualifiers, and Reference Snaks. Or decide we want to ditch the ungrouped representation, see T78653.
  • forcing (some) maps to lists (T98857)
    • we may no longer need this for the XML output, if "indexTagName" now overrides explicit array keys in ApiResult
    • currently, there is an option to force maps to objects - that should probably always be done, to avoid issues with the representation of empty maps in JSON (T98860)
  • injecting "indexTagName" and other meta-info for the MediaWiki API serializer. Hard coded knowledge about these markers could be kept out of the serializer by using some sort of callback logic. See T78652
  • maintain ordered maps: since JSON does not specify the order of entries in a map, we need to make that order explicit, since it should at least be stable. (T98861)
  • inject revision info (revision id, timestamp, page title). This is mediawiki specific, the serializer shouldn't know the meaning of such meta-data fields. (T98862)
  • deletion markers for deleted elements in API responses. We may want to implement this by representing deletions in the DataModel somehow. T98863

Additional features we want to support in the future, but should already consider when thinking abotu a solution for the issues above:

  • Derived values in Snaks (normalized quantities, expanded external IDs, etc; see T89005). Could be represented in the model in a way similar to TermFallback: ExtendedPropertyValueSnak would extend PropertyValueSnak (and perhaps implement a marker interface, to safeguard against serialization into the database). Note: derived values should not be supported as input to EditEntity, etc.
  • Filtered representations (terms filtered by language, sitelinks filtered by family or language, etc). Filtering can be done before serialization, but should be marked in the data model and in the serialization, to avoid data loss during round trips. (T73512)
  • Represent incoming redirects as "secondary ids" in the serialized entity data structure (T98039)

For implementing the features above, there are several options:

  1. Add knowledge to the serializer
  2. Wrap another set of serializers around the basic serializers
  3. Post-process the resulting array structure to apply the necessary changes
  4. Represent them in the data model explicitly
  5. Layer another representation model on top of the basic data model

For practical reasons, things that only change the way the data is represented, should probably be implemented inside the serializer (Option 1 or 2, possibly 3 if there is a very good reason). Features that add or remove information (such a filtering or language fallback) should be represented in the model (options 4 or 5). Care must be taken to avoid serializing such derivative data into the database, or accepting it as input of edits.

Details

Reference
bz71170

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusSubtypeAssignedTask
ResolvedAddshore
ResolvedLydia_Pintscher
ResolvedJanZerebecki
OpenNone
DeclinedAddshore
ResolvedAddshore
ResolvedLydia_Pintscher
OpenNone
InvalidNone
InvalidNone
InvalidNone
InvalidNone
InvalidNone
DeclinedBene
ResolvedBene
OpenNone

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:48 AM
bzimport set Reference to bz71170.
bzimport added a subscriber: Unknown Object (MLST).

It seems it is also missing the ability to represent removed Terms: search for removed in serializeMultilingualValues() in Wikibase.git/lib/includes/serializers/MultilingualSerializer.php .

Things that the old serializer supported (or should have supported), and need to be possible using the new model and serializer too:

  • Filtering (terms by language, sitelinks by site, statements by rank)
  • Apply language fallback (and put info about the fallback in the terms)
  • optionally group claims by property
  • optionally use lists instead of maps that use IDs as keys (to cater to the quirks of ApiFormatXml).
  • deletion markers used in responses of API modules like RemoveClaim

In some cases, we want to (optionally) inject extra information into the serialization (or presentational model), e.g.:

  • the DataType of PropertyValueSnaks
  • (future) full URIs for external identifiers
  • (future) quantity values converted to base units

We should try to avoid putting such "derivative" versions of our data back into the database, as this would constitute data loss and/or create confusion (especially in the case of automatic transliteration).

Another question is if and how "derivative" entity information can and should be represented by our data model. We should have a spec that makes a clear distinction between the "core" data model, and "representational" or "informational" derivatives.

PS: We also need a way to represent order explicitly when using id based maps instead of lists (for statements, qualifiers in a claim, references in a statement, and snaks in a reference). This is part of the core model, but was not addressed by the old serializer either.

After some discussion with Jan and Adrian, some points became clear:

  • We want to implement language fallback and filtering on the model level, not in the serializer
  • Other things however, like grouping of statements, or associative vs. indexed arrays, have to be implemented in the serializer (a flag to the serializer factory could do the trick)
  • Presentation-layer concepts can be represented in the datamodel using subclassing (e.g. TermWithLanguageFallback extends Term).
  • It would be good to include a version number in serializer output
  • Entities "tainted" by fallback, filtering, etc are not a big issue in practice, because EditEntity only adds/replaces entries per default.
  • EditEntity still needs to fail on terms with fallback info, to avoid writing automatic transliterations to the database.

In general, it became clear that the serialization we use in the database will often not be exactly the same as the one we use in the database. For instance, the serialization in the database does not contain the data-type in snaks, nor should it in the future contain things like expanded URIs of external IDs or converted quantity values.

The question whether we can always group statements by property, or whether we want to retain the option to ouput flat lists of Statements, remained open. It would be nicer if we could always group.

Another general consideration: we want our output format to stay relatively stable, and it should be easy to use directly, without the need for specialized data model libraries. While it would be nice to have libraries for serialization and representing our data model in multiple languages, we currently do not supply those. As long as we don't, we have to assume people operate on the raw data structures.

» we use in the database will often not be exactly the same as the one we use in the database.«?

I agree that we have to consider users which don't have a data model implementation. I just want the data model implementations we have to be able to provide the things we provide in the serialization.

Of course i meant "the serialization we use in the database will often not be exactly the same as the one we use in the API".

I agree with Adrian that our data model should generally be able to represent everything our serializer puts into the output.

Another aspect that came up is the conceptual separation of the "core" data model (representing the knowledge stored in the wiki directly) vs. the "presentational" model (with fallback, filtering, and expansion). A marker interface may work, but that's not very nice (think LSP). The cleanest solution would be to duplicate the entire data model class hierarchy, but that's massive overhead, hard to maintain, and potentially confusing. The best we can do for now seems to be to clearly document which classes and fields are part of the core model, and which are presentational.

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).Dec 1 2014, 2:33 PM
daniel set Security to None.
daniel updated the task description. (Show Details)
daniel updated the task description. (Show Details)
daniel updated the task description. (Show Details)
daniel updated the task description. (Show Details)
daniel updated the task description. (Show Details)
daniel updated the task description. (Show Details)

I must say I'm a bit confused by this tracking ticket. It says it's about "feature parity" and blocks T64188: Replace old serialization code in lib with datamodel serialization, but contains proposals for what sound like actual breaking changes to me.

  • Did the old serializer "include the property data type in Property(Value)Snaks" and how did it do that when the new serializer can't?
  • "ditch the ungrouped representation" means a breaking change, right?
  • "we need to make that order explicit" means introducing a new feature that currently does not exist in the serialization, right?
  • Is "inject revision info" something the old serializer does and how does it do that?
  • "deletion markers" are another non-existing feature, right? How is this relevant for this ticket?

As discussed I'm arguing that the serializers are creating a specific serialization anyway and adding some knowledge about markers (as strings, not as an actual MediaWiki dependency) is ok in my opinion.

Wrapping a second "datamodel" around the current one sounds fascinating, but I can't think of a concrete way to implement this.

  • Did the old serializer "include the property data type in Property(Value)Snaks" and how did it do that when the new serializer can't?

Yes, SnakSerializer gets a PropertyDataTypeLookup injected.

  • "ditch the ungrouped representation" means a breaking change, right?

Ditching this option is mentioned as an alternative to maintaining full party, namely, offering an api parameter to switch between the grouped and ungrouped mode.

  • "we need to make that order explicit" means introducing a new feature that currently does not exist in the serialization, right?

I think we had that at one point, but I'm not quite sure. It's an aspect we should keep in mind though.

  • Is "inject revision info" something the old serializer does and how does it do that?

ResultBuilder injects this into the ApiResult. It would be nice to integrate this better with serialization, but it's not technically a blocker.

  • "deletion markers" are another non-existing feature, right? How is this relevant for this ticket?

We do this at least for removed sitelinks, see ResultBuilder::addSiteLinks, using EntitySerializer::OPT_PARTS, => "sitelinks/removed". I seem to recall we did this in more places.

We also already do normalized/expanded data, with EntitySerializer::OPT_PARTS => "sitelinks/urls"

As discussed I'm arguing that the serializers are creating a specific serialization anyway and adding some knowledge about markers (as strings, not as an actual MediaWiki dependency) is ok in my opinion.

Yes, incorporating some "magical" knowledge may not be the cleanest, but i nthe end the sanest way to go about this. Though I'm still considering post-processing for injecting the indextTagName stuff.

Keep in mind that serializers take the thing to serialize and give you the serialization of that thing. And nothing more. You could construct an "entity serializer" that takes an entity id and then fetches the entity. This fetching is an added and misplaced responsibility. The same goes for adding additional information to whatever is passed in based on queries to services. This does not mean you cannot create a serializer with such an output, it just means the fetching of additional stuff should be done before using the serializer, not during.

Related to this: Wikibase DataModel Serialization is for serialization of DataModel objects. Things such as MediaWiki revision IDs are thus out of scope. Again, this does not mean such things cannot be serialized. It just means Wikibase DataModel Serialization is the wrong place to put such code.

It seems like a bunch of blockers where added that have little to do with feature parity between the current code in Lib and the Wikibase DataModel Serialization code. I recommend to keep moving away from the legacy code and implementing new functionality separate.

@JeroenDeDauw In theory, you are right: the serializer should just serialize whetever they are given. In practice, this means adding a second, rather complex, data model on top of the core data model, or complicating the core data model. That's a lot of work, and the modeling isn't easy to get right either. We explicitly discussed this option, but decided that it's not worth the pain to keep the serializers "clean".

@JeroenDeDauw You can also see it this way: feature parity is for model+serializer. If we move all the options and modes into the model, we need something to build the "output model" from the "core model", and that converter thing would need to do all the things described in the blockers.

Going this way, feature parity would be blocked on creating the "output model" and the corresponding converter, which would in turn be blocked on all the features mentioned here. So you'd add at least two more blockers, and the need for a few additional data structures. This does not strike me as a good thing.

Sure, code that relies on modes and switches is annoying. And serializers should be dumb. But if upholding these principles 100% causes a lot of overhead, we should probably go for a balance point closer to the energetic minimum...

daniel renamed this task from Ensure WikibaseDataModelSerialization feature parity with WikibaseLib to Ensure feature parity of serialization based on WikibaseDataModelSerialization with what we do with WikibaseLib.May 28 2015, 11:04 AM
daniel updated the task description. (Show Details)

I have edited the description to remove the implication that *all* these features have to be implemented *inside* WikibaseDataModelSerialization.

@daniel: I do not see how we need to be "adding a second, rather complex, data model on top of the core data model". For the example provided, having a serialization of a property value snak with added data type, all you need is a simple typed snak object. One such value object is neither complex nor a duplication of the whole model, and much less icky than introducing interfaces for non-serialization services in a serialization component.

Sure, code that relies on modes and switches is annoying. And serializers should be dumb. But if upholding these principles 100% causes a lot of overhead, we should probably go for a balance point closer to the energetic minimum...

I'm not sure if that is meant to be a reply to anything I said. Just in case: I agree switches are not nice though that the most pragmatic trade-of will probably involve some. In fact, there already is one: https://github.com/wmde/WikibaseDataModelSerialization/blob/master/src/SerializerFactory.php#L50

We'd need at least:

  • Term with fallback (already exists)
  • Snak with DataType
  • Snak with extra DataValues (normalized/expanded)
  • Grouped StatementList and SnakList
  • Link, Term, Statement, etc with a "deleted" flag
  • LinkList, TermList, StatementList, etc with a "filtered" flag
  • Entities with secondary IDs (redirects)

All of them should be acceptable for deserialization, but should not be acceptable input for EditEntity and friends, since the information is derived or imposed by filtering/sorting parameters. We can go this route, shifting some of the required changes from the serializer component to the datamodel component. But the serializers still need to support the new types, and we are still blocked on implementing all this before we can ditch the old serializer.

On top of the above, we still need the serializer itself, or the ResultBuilder, to support:

  • empty maps (not lists) as objects
  • indexedTagName markers (I think post-processing is ok for this)
  • injection of version info
  • maintaining order of map entries
Jonas renamed this task from Ensure feature parity of serialization based on WikibaseDataModelSerialization with what we do with WikibaseLib to [Story] Ensure feature parity of serialization based on WikibaseDataModelSerialization with what we do with WikibaseLib.Aug 15 2015, 12:27 PM

@daniel Can this be closed? I assume this is no longer relevant by now.

Lydia_Pintscher claimed this task.

Closing based on Marius' comment