Page MenuHomePhabricator

[RFC] How to represent derived values in the data model, and allow for deferred deserialization
Closed, ResolvedPublic

Description

The goal of this RFC is to decide how two new requirements for the data model could be implemented: derived values (and other secondary information), and deferred deserialization. These two issues are unrelated, but both impact the architecture of the data model, so they should both be considered when deciding on the future architecture of the model.

Some context for each of these issues:


Some use cases for derived values:

  • Terms with language fallback information (derived term list)
  • SiteLink with derived URL
  • ID snak with derived URI and URL values
  • Quantity snak with derived value converted to the base unit
  • Time snak with derived Gregorian ISO value
  • CommonsMedia snak with derived image URI, description page URL, thumbnail URL(s)
  • maybe even EntityId snaks with derived entity URI (useful for external entities / federation)
  • Serializer modes: include extra, strip extra, fail on extra (we want extra data in api output and dumps, but never in the database).
  • Unserializer modes: construct "simple" or "extended" data model objects? Ignore extra info, or fail on extra info?

T90707: [Task] Implement deferred deserialization of Entity calls for deferred deserialization especially of lists / containers like:

  • terms
  • statements
  • sitelinks

In T112547: [RFC] Decide on a mechanism for supporting derived values during serialization it was decided that we should represent derived values in the data model in order to include them in the JSON output. Several options were identified for doing so:

a. Use subclassing to created "extended" versions of the relevant classes in the model. This is already doen for some things (Terms with language fallback, for instance), but was identified as the least desirable option in the previsous RFC discussion.
b. Create a completely separate model hierarchy, wrapping the base model, and attaching additional information. This was seen as clean but tedious in the previous RFC discussion.
c. Create specialized models for each use case, independent of the base model, with the appropriate transformation services. This was seen as rather hard to maintain and tricky to document and communicate to third parties.
d. Define interfaces for the core model, but support additional (derived) data in the implementation classes (or have separate interfaces and/or implementations that support additional information). This would break all code that currently instantiates model objects directly.
e. Provide mappings from original values to derived values parallel to, but separate from, the data model (e.g. SnakValueMapper::getNormalizedValue( $snak ) ); In JSON, the derived values would be provided on the top level of the entity, as maps from original value (hashes) to derived values.

Results of Meeting on 2015-11-05:

Present: Bene, Thiemo, Jan, Jeroen, Jonas, Daniel. Adrian provided input beforehand, suggesting the new option (e).

We made some general observation that helped us narrow down the choices:

  • Option (b) (wrapper model) seems pointless if the wrapper model implements the same interfaces as the original model - this would be equivalent to (a). If it does not, it is equivalent to (c) (specialized data models).
  • Options (c) (lean specialized models) is a good fit when only a subset of the original data is needed foir a particular use case. It's more useful for removing/hiding information than for adding extra data. It does not help us with our use case of generating JSON and RDF, which needs all original data plus some derived information. Using a specialiezd lean data model as input for the HTML generation may still make sense.
  • Option (d) (interfaces with separate implementations) is not a separate option, but rather a flavor of (a) and (b). For (b) interfaces would probably be necessary, for (a) they are rather pointless.
  • Derived data is an issue for "leaf" objects of the data model (Snaks, Terms, SiteLinks), while deferred deserialization is an issue for container classes (SiteLinkList, StatementList, TermList, etc). To avoid data loss when round-tripping, containers should however indicate whether they contain filtered information, in which case they would be refused as input for edits. This can be done with a flag or possibly a marker interface.
  • We want interfaces for container classes, so we can provide deferred implementations for them. The deferred implementations would live in the serialization component.

Some additional use cases were suggested for consideration:

  • inclusion of property data types in snaks
  • including labels and URLs (or URIs) of referenced entities in JSON output, for conveniance.
  • including thumbnail URLs for sitelink badges in JSON output.

The above leaves option (a) (subclassing) and option (e) (mapper services) for representing derived information (terms, sitelinks, snaks). These two options on the PHP side correspond to (but are not bound to, in the actual implementation) different representations in JSON:

Subcalssing:

$snak->getDataType();

if ( $snake instanceof SlottyPropertyValueSnak ) {
  $snak->getValueSlot( 'uri' )
}
"mainsnak":{
  "snaktype":"value",
  "property":"P854",
  "datavalue":{
    "value":"85312226",
    "type":"string"
  },
  "datavalue-uri":{
    "value":"https://viaf.org/viaf/85312226/",
    "type":"string"
  },
  "datatype":"ID"
}
  • Convenient/intuitive
  • Good information locality
  • Straight-forward implementation
  • JSON is easy to process by naive clients
  • Denormalized, potentially very redundant
  • Inflexible, data model needs to be aware of all kinds of derived "extra" data.
    • E.g. no good way to have only some extra data in a Sitelink object in PHP

Mapping:

$this->dataTypeLookup->getDataType( $snak->getPropertyId() );

$this->snakValueNormalizer->getNormalizedSnakValue( $snak );
"mainsnak":{
  "snaktype":"value",
  "property":"P854",
  "datavalue":{
    "value":"85312226",
    "type":"string"
  }
},
...
"property-datatype-mapping": {
  "P854":"ID"
},
"value-uri-mapping": {
  "c55656909a7e6f225d97e589a12c30a36cd05a12": {
    "value":"https://viaf.org/viaf/85312226/",
    "type":"string"
  }
}
  • Flexible: derived values can be extracted from JSON, or computed, or loaded from an API
  • Data model does not know about derived "extra" data.
  • Normalized, little redundancy
  • Mapping services need to be injected
  • Mapping services need to be implemented in all target languages
  • JSON processing requires hash-based lookups. Calculating these hashes is not trivial.
  • When processing a dump, we don't want to keep all mappings for everything in memory at once. The mapping info would be per entity, and would need to be attached to the Entity object.
  • When using mapping objects to generate "inline" JSON, serializers need to have the appropriate mapping services injected. Multiple serializers need to be applied on various levels of the output structure, to provide serialization for different (derived) aspects of a data object.

  • From the above, it seems that an "inline" approach is clearly better for JSON than the mapping approach: it's easier to use for "naive" clients, and provides better readability and discoverability.
  • The inline/subclassing approach does however not translate to PHP so well, since we can't easily model optional fields; Code that needs a specific kind of extra data from the model cannot easily use type hints to get just that data. This is particularly annoying for classes that take a container object as a parameter.
    • For example, what relationship should TypedSnak have to PropertyValueSnak? Do we need a TypedPropertyValueSnak, a TypedNoValueSnak, etc? Do we then need a TypedSnakList in addition to SnakList? A TypedSnakStatement in addition to Statement? And how do we combine two kinds of "extra" data, e.g. data types in snaks, and derived values in snaks? A separate implementation for every possibly permutation is not a good solution. A single implementation that does everything is possible, btu not very nice.
  • For representation in PHP, a compromise would be possible:
    • Attach extra data to the plain model objects "somehow", but access that information only transparently via mapping services (this is "(e) on top of (a)")
    • Make a specialized "read model" for each "extra" aspect, and implement it on top of a mapping service ("(c) on top of (e)")

Some more findings after a brief investigation of how we currently place some secondary data in the serialization output:

  • TypedSnak exists as a wrapper around Snak, but doesn't implement Snak. It's not used except in tests, and it cannot be used in places where a Snak is expected; This makes it rather impractical for the desired purpose of representing extra data in the serialization output of full entities.
  • DerivedPropertyValueSnak exists but is not yet used outside tests. It extends PropertyValueSnak, so it could be used as a drop-in replacement.

SerializationModifier is used to inject information into a serialization array structure. Following the visitor pattern, it applies a callback to nodes matching a specific pattern, allowing the nodes to be modified. This is used extensively by ResultBuilder, but also by JsonDumpGenerator and ClientEntitySerializer. In particular, this mechanism is used to:

  • Inject the datatype of Snaks
  • Inject the target URL for SiteLinks

Summary of the results from T112893: [Task] Investigate how and where data model objects are instanciated in our code base:

  • AliasGroup
    • Instantiations in production: AliasGroupList, AliasGroupListDeserializer, Fingerprint, LegacyFingerprintDeserializer
    • Usage in tests: 122
  • PropertyNoValueSnak
    • Instantiations in production: LegacySnakDeserializer, SnakDeserializer, SnakFactory
    • Usage in tests: 432
  • PropertySomeValueSnak
    • Instantiations in production: LegacySnakDeserializer, SnakDeserializer, SnakFactory
    • Usage in tests: 110
  • PropertyValueSnak
    • Instantiations in production: DerivedPropertyValueSnak, LegacySnakDeserializer, SnakDeserializer, SnakFactory
    • Usage in tests: 264
  • SiteLink
    • Instantiations in production: DeletePageNoticeCreator, InfoActionHookHandler, LegacySiteLinkListDeserializer, LinkTitles, MovePageNotice, OtherProjectsSidebarGenerator, Runner, SiteLinkDeserializer, SiteLinkList, SiteLinkTable, SiteLinkUniquenessValidator, UpdateRepo, UpdateRepoOnMoveJob, WikibaseLuaBindings
    • Usage in tests: 155
  • Statement
    • Instantiations in production: LegacyStatementDeserializer, StatementDeserializer, StatementGroupListView, StatementList
    • Usage in tests: 212

Numbers for Term, TermList, and TermFallback appear to be missing from the analysis.

If we decide to change the model in a way that would require changes to how data model objects are instantiated, the above places in the codebase would need to be changed to accommodate the new approach.


After some research considering the above, it seems that the Role Pattern best firs our needs. It offers a compromize between a clean base model and a powerful extension mechanism, while maintaining information locality and discoverability. A detailed proposal will follow.

Event Timeline

daniel claimed this task.
daniel raised the priority of this task from to Medium.
daniel raised the priority of this task from Medium to High.
daniel updated the task description. (Show Details)

but was identified as the least desirable option in the previsous RFC discussion.

Was it really? Inheritance by itself isn't a bad thing and I think in some places it is completely valid to use that feature to extend our data model. I wouldn't throw that option away from the beginning.

Where is the list of "derived values" we need to support again?

My design intuition suggests approach b, which is backed up by my theoretical knowledge. We can create such a model without modifying the existing one or impacting users of the existing one. This makes it easy to try it out and see how it goes. The other approaches are more disruptive.

committed for sprint Wikidata-Sprint-2015-09-29: @daniel will broaden the scope of the RfC to include deferred deserialization and then have a discussion with @daniel @thiemowmde @Bene and @JeroenDeDauw (optional: @Jonas, @JanZerebecki).

daniel renamed this task from [RFC] Decide how to represent derived values in the data model to [RFC] Represent deferred values in the data model, and allow for deferred deserialization.Sep 29 2015, 5:28 PM
daniel updated the task description. (Show Details)

changed the title from "[RFC] Decide how to represent derived values in the data model" to "[RFC] Represent deferred values in the data model, and allow for deferred deserialization".

Deferred instead of derived? Is that a mistake? The description certainly did not see a matching change...

Deferred instead of derived? Is that a mistake? The description certainly did not see a matching change...

It's not a mistake. We have two new requirements to meet:

  • accommodate derived values
  • accommodate deferred (lazy) deserialization

These two were discussed separately before, but for the next generation of the data model, we need to consider both. Which is Important - for example, I can see subclassing working for derived values, but not really for deferred deserialization.

I added a reference to T90707 to the description when I changed the title. See there for details.

It's clear to me we have those two requirements. One is not the subtype of the other though, so having a single one in the description is misleading. Please change to match what you intend, which at this point is unclear to me.

@daniel Shouldn't it be "[RFC] Represent derived values in the data model, and allow for deferred deserialization"?

Orr, now I see my mistake! Sorry for being blind :(

daniel renamed this task from [RFC] Represent deferred values in the data model, and allow for deferred deserialization to [RFC] Represent derived values in the data model, and allow for deferred deserialization.Oct 1 2015, 6:34 PM

Still waiting for an answer to: Where is the list of "derived values" we need to support again?

@JeroenDeDauw I updated the description to include a list of derived values we want to support.

but was identified as the least desirable option in the previsous RFC discussion.

Was it really? Inheritance by itself isn't a bad thing and I think in some places it is completely valid to use that feature to extend our data model. I wouldn't throw that option away from the beginning.

Subclassing seems like a viable option for derived values, but not really for deferred deserialisation. We should probably discuss it again.

A lot of these things are about DataValues, right? I think those can be tackled without talking about the Wikibase data model.

@thiemowmde and I just talked about sitelinks, and we came to the following conclusion:

  • The functionality of getting the URL for a site link should not be in a data model object (There should not be a SiteLink::getUrl method)
  • The functionality of getting the URL for a site link should be provided by a service (There should be a SiteLinkUrlGenerator interface)
  • When a serialization with linked sitelinks is requested:
    1. Standard data model objects are serialized by standard data model serializers
    2. A special serializer generates a special serialization from standard data model SiteLinks using a supplied SiteLinkUrlGenerator
    3. Standard and special serialization are merged by deep merge ({ entity: { sitelinks: { 'enwiki': { url: 'https://…' } } } or just by appending ({ entity: {/* … */}, sitelink_urls: { 'enwiki': 'https://…' } })
  • When a serialization with linked sitelinks is deserialized:
    1. A standard deserializer derializes into standard data model objects
    2. A special deserializer is used to construct a SiteLinkUrlGenerator
daniel updated the task description. (Show Details)
daniel updated the task description. (Show Details)
daniel updated the task description. (Show Details)
daniel renamed this task from [RFC] Represent derived values in the data model, and allow for deferred deserialization to [RFC] How to represent derived values in the data model, and allow for deferred deserialization.Nov 17 2015, 3:11 PM

Closing now with a decision to go for the role object pattern. Details of that should be discussed at T118860: [RFC] Use Role Object Pattern to represent derived data in the data model.