Change Details

The goal of this RFC is to decide how two new requirements for the data model could be implemented: derived values (and other secondary information), and deferred deserialization. These two issues are unrelated, but both impact the architecture of the data model, so they should both be considered when deciding on the future architecture of the model. Some context for each of these issues: ------ Some use cases for derived values: * Terms with language fallback information (derived term list) * SiteLink with derived URL * ID snak with derived URI and URL values * Quantity snak with derived value converted to the base unit * Time snak with derived Gregorian ISO value * CommonsMedia snak with derived image URI, description page URL, thumbnail URL(s) * maybe even EntityId snaks with derived entity URI (useful for external entities / federation) * Serializer modes: include extra, strip extra, fail on extra (we want extra data in api output and dumps, but never in the database). * Unserializer modes: construct "simple" or "extended" data model objects? Ignore extra info, or fail on extra info? ------ {T90707} calls for deferred deserialization especially of lists / containers like: * terms * statements * sitelinks ------ In {T112547} it was decided that we should represent derived values in the data model in order to include them in the JSON output. Several options were identified for doing so: a. Use subclassing to created "extended" versions of the relevant classes in the model. This is already doen for some things (Terms with language fallback, for instance), but was identified as the least desirable option in the previsous RFC discussion. b. Create a completely separate model hierarchy, wrapping the base model, and attaching additional information. This was seen as clean but tedious in the previous RFC discussion. c. Create specialized models for each use case, independent of the base model, with the appropriate transformation services. This was seen as rather hard to maintain and tricky to document and communicate to third parties. d. Define interfaces for the core model, but support additional (derived) data in the implementation classes (or have separate interfaces and/or implementations that support additional information). This would break all code that currently instantiates model objects directly. e. Provide mappings from original values to derived values parallel to, but separate from, the data model (e.g. SnakValueMapper::getNormalizedValue( $snak ) ); In JSON, the derived values would be provided on the top level of the entity, as maps from original value (hashes) to derived values. Results of Meeting on 2015-11-05: ============================== Present: Bene, Thiemo, Jan, Jeroen, Jonas, Daniel. Adrian provided input beforehand, suggesting the new option (e). We made some general observation that helped us narrow down the choices: * Option (b) (wrapper model) seems pointless if the wrapper model implements the same interfaces as the original model - this would be equivalent to (a). If it does not, it is equivalent to (c) (specialized data models). * Options (c) (lean specialized models) is a good fit when only a subset of the original data is needed foir a particular use case. It's more useful for removing/hiding information than for adding extra data. It does not help us with our use case of generating JSON and RDF, which needs all original data plus some derived information. Using a specialiezd lean data model as input for the HTML generation may still make sense. * Option (d) (interfaces with separate implementations) is not a separate option, but rather a flavor of (a) and (b). For (b) interfaces would probably be necessary, for (a) they are rather pointless. * Derived data is an issue for "leaf" objects of the data model (Snaks, Terms, SiteLinks), while deferred deserialization is an issue for container classes (SiteLinkList, StatementList, TermList, etc). To avoid data loss when round-tripping, containers should however indicate whether they contain filtered information, in which case they would be refused as input for edits. This can be done with a flag or possibly a marker interface. * We want interfaces for container classes, so we can provide deferred implementations for them. The deferred implementations would live in the serialization component. Some additional use cases were suggested for consideration: * inclusion of property data types in snaks * including labels and URLs (or URIs) of referenced entities in JSON output, for conveniance. * including thumbnail URLs for sitelink badges in JSON output. The above leaves option (a) (subclassing) and option (e) (mapper services) for representing derived information (terms, sitelinks, snaks). These two options on the PHP side correspond to (but are not bound to, in the actual implementation) different representations in JSON: **Subcalssing**: ```lang=php $snak->getDataType(); if ( $snake instanceof SlottyPropertyValueSnak ) { $snak->getValueSlot( 'uri' ) } ``` ```lang=js "mainsnak":{ "snaktype":"value", "property":"P854", "datavalue":{ "value":"85312226", "type":"string" }, "datavalue-uri":{ "value":"https://viaf.org/viaf/85312226/", "type":"string" }, "datatype":"ID" } ``` * Convenient/intuitive * Good information locality * Straight-forward implementation * JSON is easy to process by naive clients * Denormalized, potentially very redundant * Inflexible, data model needs to be aware of //all// kinds of derived "extra" data. ** E.g. no good way to have only //some// extra data in a Sitelink object in PHP **Mapping**: ```lang=php $this->dataTypeLookup->getDataType( $snak->getPropertyId() ); $this->snakValueNormalizer->getNormalizedSnakValue( $snak ); ``` ```lang=js "mainsnak":{ "snaktype":"value", "property":"P854", "datavalue":{ "value":"85312226", "type":"string" } }, ... "property-datatype-mapping": { "P854":"ID" }, "value-uri-mapping": { "c55656909a7e6f225d97e589a12c30a36cd05a12": { "value":"https://viaf.org/viaf/85312226/", "type":"string" } } ``` * Flexible: derived values can be extracted from JSON, or computed, or loaded from an API * Data model does not know about derived "extra" data. * Normalized, little redundancy * Mapping services need to be injected * Mapping services need to be implemented in all target languages * JSON processing requires hash-based lookups. Calculating these hashes is not trivial. * When processing a dump, we don't want to keep all mappings for everything in memory at once. The mapping info would be per entity, and would need to be attached to the Entity object. ----- * From the above, it seems that an "inline" approach is clearly better for JSON than the mapping approach: it's easier to use for "naive" clients, and provides better readability and discoverability. * The inline/subclassing approach does however not translate to PHP so well, since we can't easily model optional fields; Code that needs a specific kind of extra data from the model cannot easily use type hints to get just that data. This is particularly annoying for classes that take a container object as a parameter. * For representation in PHP, a compromise would be possible: ** Attach extra data to the plain model objects "somehow", but access that information only transparently via mapping services (this is "e on to of a") ** Make a specialized "read model" for each "extra" aspect, and implement it on top of a mapping service ("c on top of e")

The goal of this RFC is to decide how two new requirements for the data model could be implemented: derived values (and other secondary information), and deferred deserialization. These two issues are unrelated, but both impact the architecture of the data model, so they should both be considered when deciding on the future architecture of the model. Some context for each of these issues: ------ Some use cases for derived values: * Terms with language fallback information (derived term list) * SiteLink with derived URL * ID snak with derived URI and URL values * Quantity snak with derived value converted to the base unit * Time snak with derived Gregorian ISO value * CommonsMedia snak with derived image URI, description page URL, thumbnail URL(s) * maybe even EntityId snaks with derived entity URI (useful for external entities / federation) * Serializer modes: include extra, strip extra, fail on extra (we want extra data in api output and dumps, but never in the database). * Unserializer modes: construct "simple" or "extended" data model objects? Ignore extra info, or fail on extra info? ------ {T90707} calls for deferred deserialization especially of lists / containers like: * terms * statements * sitelinks ------ In {T112547} it was decided that we should represent derived values in the data model in order to include them in the JSON output. Several options were identified for doing so: a. Use subclassing to created "extended" versions of the relevant classes in the model. This is already doen for some things (Terms with language fallback, for instance), but was identified as the least desirable option in the previsous RFC discussion. b. Create a completely separate model hierarchy, wrapping the base model, and attaching additional information. This was seen as clean but tedious in the previous RFC discussion. c. Create specialized models for each use case, independent of the base model, with the appropriate transformation services. This was seen as rather hard to maintain and tricky to document and communicate to third parties. d. Define interfaces for the core model, but support additional (derived) data in the implementation classes (or have separate interfaces and/or implementations that support additional information). This would break all code that currently instantiates model objects directly. e. Provide mappings from original values to derived values parallel to, but separate from, the data model (e.g. SnakValueMapper::getNormalizedValue( $snak ) ); In JSON, the derived values would be provided on the top level of the entity, as maps from original value (hashes) to derived values. Results of Meeting on 2015-11-05: ============================== Present: Bene, Thiemo, Jan, Jeroen, Jonas, Daniel. Adrian provided input beforehand, suggesting the new option (e). We made some general observation that helped us narrow down the choices: * Option (b) (wrapper model) seems pointless if the wrapper model implements the same interfaces as the original model - this would be equivalent to (a). If it does not, it is equivalent to (c) (specialized data models). * Options (c) (lean specialized models) is a good fit when only a subset of the original data is needed foir a particular use case. It's more useful for removing/hiding information than for adding extra data. It does not help us with our use case of generating JSON and RDF, which needs all original data plus some derived information. Using a specialiezd lean data model as input for the HTML generation may still make sense. * Option (d) (interfaces with separate implementations) is not a separate option, but rather a flavor of (a) and (b). For (b) interfaces would probably be necessary, for (a) they are rather pointless. * Derived data is an issue for "leaf" objects of the data model (Snaks, Terms, SiteLinks), while deferred deserialization is an issue for container classes (SiteLinkList, StatementList, TermList, etc). To avoid data loss when round-tripping, containers should however indicate whether they contain filtered information, in which case they would be refused as input for edits. This can be done with a flag or possibly a marker interface. * We want interfaces for container classes, so we can provide deferred implementations for them. The deferred implementations would live in the serialization component. Some additional use cases were suggested for consideration: * inclusion of property data types in snaks * including labels and URLs (or URIs) of referenced entities in JSON output, for conveniance. * including thumbnail URLs for sitelink badges in JSON output. The above leaves option (a) (subclassing) and option (e) (mapper services) for representing derived information (terms, sitelinks, snaks). These two options on the PHP side correspond to (but are not bound to, in the actual implementation) different representations in JSON: **Subcalssing**: ```lang=php $snak->getDataType(); if ( $snake instanceof SlottyPropertyValueSnak ) { $snak->getValueSlot( 'uri' ) } ``` ```lang=js "mainsnak":{ "snaktype":"value", "property":"P854", "datavalue":{ "value":"85312226", "type":"string" }, "datavalue-uri":{ "value":"https://viaf.org/viaf/85312226/", "type":"string" }, "datatype":"ID" } ``` * Convenient/intuitive * Good information locality * Straight-forward implementation * JSON is easy to process by naive clients * Denormalized, potentially very redundant * Inflexible, data model needs to be aware of //all// kinds of derived "extra" data. ** E.g. no good way to have only //some// extra data in a Sitelink object in PHP **Mapping**: ```lang=php $this->dataTypeLookup->getDataType( $snak->getPropertyId() ); $this->snakValueNormalizer->getNormalizedSnakValue( $snak ); ``` ```lang=js "mainsnak":{ "snaktype":"value", "property":"P854", "datavalue":{ "value":"85312226", "type":"string" } }, ... "property-datatype-mapping": { "P854":"ID" }, "value-uri-mapping": { "c55656909a7e6f225d97e589a12c30a36cd05a12": { "value":"https://viaf.org/viaf/85312226/", "type":"string" } } ``` * Flexible: derived values can be extracted from JSON, or computed, or loaded from an API * Data model does not know about derived "extra" data. * Normalized, little redundancy * Mapping services need to be injected * Mapping services need to be implemented in all target languages * JSON processing requires hash-based lookups. Calculating these hashes is not trivial. * When processing a dump, we don't want to keep all mappings for everything in memory at once. The mapping info would be per entity, and would need to be attached to the Entity object. * When using mapping objects to generate "inline" JSON, serializers need to have the appropriate mapping services injected. Multiple serializers need to be applied on various levels of the output structure, to provide serialization for different (derived) aspects of a data object. ----- * From the above, it seems that an "inline" approach is clearly better for JSON than the mapping approach: it's easier to use for "naive" clients, and provides better readability and discoverability. * The inline/subclassing approach does however not translate to PHP so well, since we can't easily model optional fields; Code that needs a specific kind of extra data from the model cannot easily use type hints to get just that data. This is particularly annoying for classes that take a container object as a parameter. * For representation in PHP, a compromise would be possible: ** Attach extra data to the plain model objects "somehow", but access that information only transparently via mapping services (this is "e on to of a") ** Make a specialized "read model" for each "extra" aspect, and implement it on top of a mapping service ("c on top of e")