The goal of this RFC is to decide how two new requirements for the data model could be implemented: derived values (and other secondary information), and deferred deserialization. These two issues are unrelated, but both impact the architecture of the data model, so they should both be considered when deciding on the future architecture of the model.
Some context for each of these issues:
------
Some use cases for derived values:
* Terms with language fallback information (derived term list)
* SiteLink with derived URL
* ID snak with derived URI and URL values
* Quantity snak with derived value converted to the base unit
* Time snak with derived Gregorian ISO value
* CommonsMedia snak with derived image URI, description page URL, thumbnail URL(s)
* maybe even EntityId snaks with derived entity URI (useful for external entities / federation)
------
{T90707} calls for deferred deserialization especially of lists / containers like:
* terms
* statements
* sitelinks
------
In {T112547} it was decided that we should represent derived values in the data model in order to include them in the JSON output. Several options were identified for doing so:
a. Use subclassing to created "extended" versions of the relevant classes in the model. This is already doen for some things (Terms with language fallback, for instance), but was identified as the least desirable option in the previsous RFC discussion.
b. Create a completely separate model hierarchy, wrapping the base model, and attaching additional information. This was seen as clean but tedious in the previous RFC discussion.
c. Create specialized models for each use case, independent of the base model, with the appropriate transformation services. This was seen as rather hard to maintain and tricky to document and communicate to third parties.
d. Define interfaces for the core model, but support additional (derived) data in the implementation classes (or have separate interfaces and/or implementations that support additional information). This would break all code that currently instantiates model objects directly.
e. Provide mappings from original values to derived values parallel to, but separate from, the data model (e.g. SnakValueMapper::getNormalizedValue( $snak ) ); In JSON, the derived values would be provided on the top level of the entity, as maps from original value (hashes) to derived values.
Results of Meeting on 2015-11-05:
==============================
Present: Bene, Thiemo, Jan, Jeroen, Jonas, Daniel. Adrian provided input beforehand, suggesting the new option (e).
We made some general observation that helped us narrow down the choices:
* Option (b) (wrapper model) seems pointless if the wrapper model implements the same interfaces as the original model - this would be equivalent to (a). If it does not, it is equivalent to (c) (specialized data models).
* Options (c) (lean specialized models) is a good fit when only a subset of the original data is needed foir a particular use case. It's more useful for removing/hiding information than for adding extra data. It does not help us with our use case of generating JSON and RDF, which needs all original data plus some derived information. Using a specialiezd lean data model as input for the HTML generation may still make sense.
* Option (d) (interfaces with separate implementations) is not a separate option, but rather a flavor of (a) and (b). For (b) interfaces would probably be necessary, for (a) they are rather pointless.
* Derived data is an issue for "leaf" objects of the data model (Snaks, Terms, SiteLinks), while deferred deserialization is an issue for container classes (SiteLinkList, StatementList, TermList, etc). To avoid data loss when round-tripping, containers should however indicate whether they contain filtered information, in which case they would be refused as input for edits. This can be done with a flag or possibly a marker interface.
* We want interfaces for container classes, so we can provide deferred implementations for them. The deferred implementations would live in the serialization component.
Some additional use cases were suggested for consideration:
* inclusion of property data types in snaks
* including labels and URLs (or URIs) of referenced entities in JSON output, for conveniance.
* including thumbnail URLs for sitelink badges in JSON output.
The above leaves option (a) (subclassing) and option (e) (mapper services) for representing derived information (terms, sitelinks, snaks). These two options on the PHP side correspond to (but are not bound to, in the actual implementation) different representations in JSON:
**Subcalssing**:
```lang=php
$snak->getDataType();
if ( $snake instanceof SlottyPropertyValueSnak ) {
$snak->getValueSlot( 'uri' )
}
```
```lang=js
"mainsnak":{
"snaktype":"value",
"property":"P854",
"datavalue":{
"value":"85312226",
"type":"string"
},
"datavalue-uri":{
"value":"https://viaf.org/viaf/85312226/",
"type":"string"
},
"datatype":"ID"
}
```
* Convenient/intuitive
* Good information locality
* Streight forward implementation
* JSON is easy to process by naive clients
* Denormalized, potentialy very redundant
* Inflexible, data model needs to be aware of all kinds of derived "extra" data.
**Mapping**:
```lang=php
$this->dataTypeLookup->getDataType( $snak->getPropertyId() );
$this->snakValueNormalizer->getNormalizedSnakValue( $snak );
```
```lang=js
"mainsnak":{
"snaktype":"value",
"property":"P854",
"datavalue":{
"value":"85312226",
"type":"string"
}
},
...
"property-datatype-mapping": {
"P854":"ID"
},
"value-uri-mapping": {
"c55656909a7e6f225d97e589a12c30a36cd05a12" => {
"value":"https://viaf.org/viaf/85312226/",
"type":"string"
}
}
```
* Flexible: derived values can be extracted from JSON, or computed, or loaded from an API
* Data model does not know about derived "extra" data.
* Normalized, little redundancy
* Mapping services need to be injected
* Mapping services need to be implemented in all target languages
* JSON proccessing requires hash-based lookups