Page MenuHomePhabricator

[RFC] Design of provider/holder interfaces in Wikibase's data model
Closed, ResolvedPublic

Description

The current interfaces to access terms (LabelsProvider, DescriptionsProvider and AliasesProvider) all state in their contract: "It is not guaranteed that this method returns the original object." This is a problem because with that sentence we cannot rely on the fact that changes made to the TermList returned by the method are visible to the provider.

We have three options so solve this issue:

  • We can either create a second list of interfaces called LabelsHolder, DescriptionsHolder and AliasesHolder. This is how we solved the issue for Fingerprint and StatementList. The disadvantages are that we get the overhead of three additional types and that we have to call the setter everytime we want to change someting. Code will always be like $var = $obj->getFoo(); $var->setBar( 'baz' ); $obj->setFoo( $var );.
  • Another option is to change the contract of the interfaces so that they always return a reference to the original object in the provider. We can then be sure that all changes are visible to the provider. However, there is no option to replace the entire object in case we want to clear the list (for example).
  • Third option would be to merge all the "provider" and "holder" interfaces. We than don't have the choice to provide immutable providers but we don't have the overhead of three (+two) additional interfaces.

It would be great to get some opinions on this question so that we can make a decision on the architecture of our data model.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
DuplicateNone
OpenFeatureNone
DuplicateNone
OpenFeatureNone
ResolvedNone
ResolvedNone
ResolvedNone
DuplicateNone
InvalidLydia_Pintscher
Resolveddaniel
ResolvedTobi_WMDE_SW
ResolvedNone
ResolvedLuke081515
ResolvedBene
ResolvedBene
ResolvedBene
Resolved adrianheine
Resolved adrianheine
ResolvedBene
ResolvedBene
Resolvedthiemowmde
DuplicateNone
Declined adrianheine
Resolved adrianheine
Resolveddaniel
Resolved adrianheine
Resolved adrianheine
Resolved adrianheine
DeclinedNone
Resolved adrianheine
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
ResolvedLydia_Pintscher
OpenNone
OpenNone
StalledNone
OpenNone
ResolvedAddshore
Resolvedthiemowmde
ResolvedAddshore
Resolvedthiemowmde
ResolvedJeroenDeDauw
Resolvedthiemowmde
Declined adrianheine
ResolvedBene
ResolvedBene
ResolvedBene
ResolvedNone

Event Timeline

Bene raised the priority of this task from to Needs Triage.
Bene updated the task description. (Show Details)
Bene added subscribers: Bene, thiemowmde, JeroenDeDauw, daniel.

It's generally good to design interfaces from the perspective of their users. This is done by looking at each user and determining what would be best for them, as opposed to trying to come up with an interface that suits all users (including those that do not exist yet). The existing provider interfaces serve their users well, and have optimal interface segregation (their users do not end up depending on stuff they do not need). If there is a new class of users, namely those where immutability is not desired, then creating a new interface is typically better than making the interface less ideal for both users and implementers.

I think that looking at the number of interfaces one introduces is a very bad way of gauging complexity and maintenance cost. These come from the new functionality (the mutability) and the coupling caused by it.

I'm a bit worried about the contract of the existing method though. It says no guarantee of mutability is provided, which means no guarantee of immutability is provided either. Users might use mutability they should not rely upon without this being noticed. After all, tests will pass, and the violation is not obvious, esp if the programmers are thinking about a specific concrete implementation where mutability is guaranteed. They have have to notice the docs in the interface.

One way around this later concern is to have a dedicated interface for the case where users do not need mutability and implementers might not be able to provide it. In this interface the contract would be guaranteed lack of mutability. Users that would need mutability would use a different interface that does not, and this is key, derive from the other one. This interface would guarantee mutability.

I'm not aware of which users you are trying to cater to, or what kind of implementations you have in mind. It depends on these what the best solutions are.

It is not guaranteed that this method returns the original object.

Weird contract, as there presumably is no guarantee there is such a thing as an original object. (the return value can be created on the fly)

I'm a bit worried about the contract of the existing method though.

I agree that this is not optimal for the reasons you pointed out. It would indeed be nicer if immutability could be provided. However, to achieve that we either have to make TermList, StatementList etc. immutable or we have to do a full object clone everytime we access the object which sounds quite expensive.

I'm not aware of which users you are trying to cater to, or what kind of implementations you have in mind. It depends on these what the best solutions are.

The concrete users of these interfaces are the ones that currently use FingerprintHolder. We cannot longer bind terms to a fingerprint because not all entities have the combination a fingerprint provides (eg. the media type will only have labels and descriptions). Therefore, we need a way to tell a LabelsProvider that some label should change.

Regarding mutability: most of our data model objects are mutable. The idea is that we want to be able to build or modify domain objects simply and efficiently. For the same reason, the semantics for mutable objects returned from getters should always be that any changes to that object will be reflected in the "parent object".

The "mutable copy" semantics is something I would generally like to avoid. It's inefficient and confusing. Either our model objects are mutable, or they aren't. I see no use case for modifying a term list or sitelinks list or whatever in which we would not want that change to be reflected in the entity. Well, ok, maybe for filtering during output, though that can also be done on the fly. But if we want a copy, we should explicitly create a copy, not rely on the getter to always implicitly provide a copy, even if none is needed because no modifications are done.

JEumerus set Security to None.

Where do we currently need mutability of our data model objects?

Tobi_WMDE_SW added a project: Proposal.

@adrianheine we use the mutable interface to build the entity object during deserialization, but more importantly, we use it when applying ChangeOps to perform edits. I'm not sure we use it for filtering - I think we construct new instances in that case. But I could be wrong.

I suppose ChangeOps could easily work with immutable objects by having setters return new objects. For deserialization, this would probably produce a lot of objects, though. Can't we deserialize bottom-up to avoid this?

@adrianheine If you want to set a sitelink, the setter should return a new Item? That sounds rather awkward and inefficient.

We could re-design the data model to be immutable. It would mean rewriting a lot of code. I don't see that happening.

Generally: Manipulating complex data structures efficiently in a purely functional style means using "zippers" of some kind, which as far as I understand need good reference handling, and good memoization. Both would be a pain to do in PHP.

Why would that be awkward? It's another item. Also, you don't need the old one afterwards. I think in ChangeOps it would not look worse. Also, what would be inefficient about that? We are neither doing a lot of changes in ChangeOps nor are we working a lot with the entities after having changed them, so both immediate and lazy manipulations should be efficient enough.

If I get this ticket right we'll probably have to rewrite a lot of code anyway, and we can think upfront about what it should look like afterwards.

Why would that be awkward?

For one thing, because the SiteLinkList (and ReferenceList, etc) would have to know how to construct an entity, and which concrete type of entity to construct, and how. I see no good way to do that.

Or, if you spread that over multiple levels, you end up with something like

$statements = $entity->getStatementList();
$statement = $statements->getStatement( $hash );
$references = $statement->setReferenceList();
$entity = $entity->setStatementList(
  $statements->updateStatement(
    $hash,
    $statement->setReferenceList(
      $references->addReference( $reference ) ) ) );

Reads like lisp. I'd call it awkward compared to

$entity->getStatementList()
  ->getStatement( $hash )
    ->getReferenceList()
      ->addReference( $reference )

It's another item. Also, you don't need the old one afterwards.

If you don't need the old one afterwards, why create a new one instead of recycling the old one?

I think in ChangeOps it would not look worse. Also, what would be inefficient about that? We are neither doing a lot of changes in ChangeOps nor are we working a lot with the entities after having changed them, so both immediate and lazy manipulations should be efficient enough.

Duplicating the entire data structure for every change of any part of it seems rather wasteful. But you are right that it would be acceptable for edits. It wouldn't be acceptable for filtering or augmenting a data structure for output.

If I get this ticket right we'll probably have to rewrite a lot of code anyway, and we can think upfront about what it should look like afterwards.

I don't see that, actually. We are adding interfaces that will mostly just declare methods that already exist in the implementations. If we change contracts, we'll have to check where that may cause problems, but this is a far cry from changing the paradigm from "manipulate a data structure using methods" to "construct derived data structure using pure functions".

Why would that be awkward?

For one thing, because the SiteLinkList (and ReferenceList, etc) would have to know how to construct an entity, and which concrete type of entity to construct, and how. I see no good way to do that.

Yeah, well, that would be awkward.

Or, if you spread that over multiple levels, you end up with something like

$statements = $entity->getStatementList();
$statement = $statements->getStatement( $hash );
$references = $statement->setReferenceList();
$entity = $entity->setStatementList(
  $statements->updateStatement(
    $hash,
    $statement->setReferenceList(
      $references->addReference( $reference ) ) ) );

That's basically what I would suggest. We could add some delegation methods (Item::getStatement( $hash )), but it boils down to this.

It's another item. Also, you don't need the old one afterwards.

If you don't need the old one afterwards, why create a new one instead of recycling the old one?

Because it's much easier to reason about immutable data structures, and because immutable data structures can usually be implemented more efficiently.

I think in ChangeOps it would not look worse. Also, what would be inefficient about that? We are neither doing a lot of changes in ChangeOps nor are we working a lot with the entities after having changed them, so both immediate and lazy manipulations should be efficient enough.

Duplicating the entire data structure for every change of any part of it seems rather wasteful. But you are right that it would be acceptable for edits. It wouldn't be acceptable for filtering or augmenting a data structure for output.

That's where lazy manipulations could be used (FilteredEntity(Entity, EntityFilter) implements Entity). Quite difficult without generics.

If I get this ticket right we'll probably have to rewrite a lot of code anyway, and we can think upfront about what it should look like afterwards.

I don't see that, actually. We are adding interfaces that will mostly just declare methods that already exist in the implementations. If we change contracts, we'll have to check where that may cause problems, but this is a far cry from changing the paradigm from "manipulate a data structure using methods" to "construct derived data structure using pure functions".

We will have to switch stuff from fingerprint to individual term manipulation. It's probably not that much, though.

@daniel I think you're not getting what Adrian means.

class Item {

    public function withSiteLink( SiteLink $link ) {
        return new Item( /* stuff with new link added */ );
    }

}

You'd not get dependencies from the smaller objects onto the bigger ones.

Making all model objects immutable is not what this issue is about though. This is however something that we've been "oh that would be nice" at since the start of the project (long before we had a DataModel component), though something that while nice in theory, does not seem practical in PHP.

@JeroenDeDauw I did get that, but that only works for things that exist on the Entity level. Or you need to have setters for *every* part of the model on the Entity level. Item::addBadgeToSiteLink( $siteId, $badge ), Item::addReferenceToStatement( $hash, $reference ), etc. Which in turn means that everything that modifies badges would have to operate on Items, even though it shouldn't have to know about entities at all.

Yes, possible, but awkward.

@adrianheine I basically agree with you that formally, that would be really nice. But practically, with PHP and the existing code base, it's pretty much out of the question.

I vote for mutability. All provider interfaces (including the existing FingerprintProvider and possibly more) should guarantee they do nothing but returning what they have. If an implementation can not simply return an Object as it is but has to construct it whenever the getter is called, it should not use this interface. It should not be a requirement that a getter must do expensive cloning before returning anything. This should be a separate thing, which we just recently discussed, see https://github.com/wmde/WikibaseDataModel/pull/626. This also caused trouble in Wikibase-DataModel-JavaScript, see https://github.com/wmde/WikibaseDataModelJavaScript/pull/47 and much more patches there.

From my point of view this was one of the top reasons why we introduced "trivial" array wrappers like StatementList and so on. You can do …

$fingerprint = $item->getFingerprint();
$fingerprint->setLabel(  );

… and be fine. This is not possible with arrays. A user calling a setter like in the example above should not be enforced to call an other setter. This feels like calling setForReal(). I find this confusing and can tell from experience that this is error-prone.

  • I consider calling setFingerprint a mistake in all cases. We do have some calls to this method in our code base because we had somewhat "broken" versions of Wikibase-DataModel that returned a new Fingerprint in getFingerprint. All these setters should be removed.
  • I consider type hints against a …Holder interface a mistake and code smell when no setter is called.
  • We currently "misuse" setStatements( new StatementsList() ) to remove all statements from an entity. This can easily be solved by introducing StatementList::clear.

I basically agree with what @thiemowmde wrote.

Does that include we should get rid of all ...Holder interfaces and instead introduce clear methods on StatementList and TermList? We can still have setStatements in Item and Property for tests.

In my opinion deprecating the interfaces and not using them (thus finally removing them) is the cleanest solution and avoids that they are misused as described above.

If there aren't any objections against this, I think we can close this RFC, deprecate the ...Holder interfaces in the next release of Wikibase-DataModel and not use them in Wikibase.