Page MenuHomePhabricator

[RFC] Use Role Object Pattern to represent derived data in the data model
Closed, InvalidPublic

Description

This tracking ticket governs the implementation of derived (secondary) information in the wikibase data model. Examples for derived data:

  • a Snak's datatype
  • a SiteLink's URL
  • a Badge's thumbnail URL
  • an external ID's URI
  • a quantity's normalized value
  • a Statement's constraint violations

In addition to the above, the data model should be able to support deferred deserialization. The mechanism for secondary information should at least not obstruct the implementation of deferred deserialization.

Design

We are looking for a mechanism that allows arbitrary information to be "glued onto" Wikibase data model objects, for use by specializaed code when rendering, serializing, or exporting (parts of) the data model. After some investigation (see T112550 and T112547) we already ruled out some options. In particular, we decided that we want to represent the extra information inside the data model, for symmetry with client side code, usability of the code base, and also the complexity the alternative solutions would require for some of the edge cases. For representation inside the data mode, we found that specialized read models for every purpose cannot really be combined, and mean a lot of duplicate code to maintain. Putting knowledge about all possible extra info into the core model makes the model bloated and inflexible.

Some research turned up the Role Object Pattern as a good solution to our problem: alternative "views" of a component can be attached dynamically simply by adding an object that implements the desired "role". This pattern has been described several times in literature and papers:

The definitions and implementations differ slightly, but the general idea is always to allow a component (object) to offer different interfaces for different tasks by somehow attaching other objects that implement these interfaces at runtime. For Wikibase, the role object pattern could be used as follows:

interface RoleEnabled {
  function getRoleManager();
}

interface RoleManager {
  function hasRole( $name );
  function getRoleObject( $name, $type = null );
  function addRoleObject( $name, $object );
}

interface LinkTargetProvider {
  function getLinkTargetUrl();
}

interface DataTypeProvider {
  function getDataType();
}

interface DerivedValueProvider {
  function hasDerivedValue( $name );
  function getDerivedValue( $name );
}

The relevant classes in the data model would then each need to implement RoleEnabled.
For performance reasons, the RoleManager instance should be created on demand. then
code for maintaining the RoleManager instance would be duplicated in each relevant data
model class.

class PropertyValueSnak implements RoleEnabled {
  private $roleManager = null;
  
  public function getRoleManager() {
    if ( !$this->roleManager ) {
      $this->roleManager = new SimpleRoleManager();
    }
    
    return $this->roleManager;
  }
}

class Term implements RoleEnabled {
  ...
}

class Statement implements RoleEnabled {
  ...
}

Note that roles can freely be added over the life time of an object. Not all roles will be known
when the object is instantiated.

The design outlined above provides quite a bit of flexibility, without adding any need for code to know about the new "role" concept. All support for roles is completely optional:

  • When deserializing from the database, no role objects (and not even RoleManager objects) get instantiated.
  • When deserializing API input, extra information can esily be ignored or rejected.
  • When deserializing API output on the client, extra information can be handled by a "role deserializer" facility, and attached to the appropriate model objects.
  • Roles may be defined outside Wikibase: a ConstraintViolationProvider role may be added by the WikibaseQuality extension, and ConstraintViolationProvider role objects may be added to the model when and where desired.
  • Formatters and other export mappings can make use of optional information in a type-safe way.
  • The model can easily be extended to accept callbacks for instantiating role objects on demand.
  • Role objects may implement RoleEnabled themselves, and may maintain a "backling" to the original object (making the question which of the objects is the "main" object rather academic)

Once major concern here is performance: the infrastructure for managing role objects means allocating extra objects (the RoleManager and probably an array inside the RoleManager) in addition to the actual role objects. This can be avoided at the cost of some of the abstraction:

class DataModelObject implements RoleManager {
  function hasRole( $name ) {
    return isset( $this->_role_$name );
  }
  
  function getRoleObject( $name, $type = null ) {
    return $this->_role_$name;
  }
  
  function addRoleObject( $name, $object ) {
    $this->_role_$name = $object;
  }
}

Data model objects would then use DataModelObject as their base class:

class PropertyValueSnak extends DataModelObject {
  ...
}

class Term extends DataModelObject {
  ...
}

class Statement extends DataModelObject {
  ...
}

This is admittedly a bit hacky, but avoids any extra objects being instantiated,
and access to roles still happens via a well defined interface.

Breakdown

TBD

Event Timeline

daniel raised the priority of this task from to Needs Triage.
daniel updated the task description. (Show Details)
daniel subscribed.

After reading about the role object pattern for a bit, I agree that this is a good approach. The idea above makes sense to me, I'm just a little worried about performance (we need to make sure that we don't create to many objects in the process, as that's painfully slow). We probably want to make sure we reuse objects for the same purpose as much as possible here.

That looks to me like a fancier and less understandable way of just having an additionalData hash on each data model object. Did I miss something?

interface AdditionalDataSupport {
  function getAdditionalData();
}

interface AdditionalData {
  function hasData( $name );
  function getData( $name, $type = null );
  function addData( $name );
}

interface LinkTargetProvider {
  function getLinkTargetUrl();
}

interface DataTypeProvider {
  function getDataType();
}

interface DerivedValueProvider {
  function hasDerivedValue( $name );
  function getDerivedValue( $name );
}

The Role Object Pattern doesn't help saying anything about the name of roles or the type of role objects as far as I see it. Just as a hash, it stores arbitrary things under arbitrary keys.

Structurally, this would work, but it seems like a very general solution with a lot of overhead. Not sure that this pattern works well on PHP, where the cost of creating additional objects is huge. I also wonder whether it really is good to manage all those (very different!) types of "derived" information in a uniform way. The examples given belong to very different objects and are based on very different inputs (some things requiring external data sources, some not). I find it a bit unmotivated to architecturally unify things that are conceptually and technically so very different. The motivation given for choosing this solution starts from the premise that one has to find a single solution that works for all cases, including some "edge cases". Without this assumption, one would be free to solve the different problems individually, using what is best for each, instead of being forced to go for some least common denominator.

To pick just one example, consider the (article) URL of a SiteLink. To create it, one needs to have access to the content of the sites table. In WDTK, we encapsulate the sites table in an object (called Sites). To find out the URL of a SiteLink, one has to call a method of the Sites object (called getArticleUrl() or something), which takes a SiteLink as an input. This design is simple and efficient, uses no additional memory for storing new objects or values, and clearly locates the responsibility for computing this information (the Sites table). In a single-site setting (like you have in PHP), there is only one sites table, and you can access it statically, so the caller does not even need to have a reference to a Sites object as in WDTK. I therefore don't see any benefit in creating a role object for this simple task. It's just more indirection, without any convenience for the software developer or any gain in performance.

The situation is similar in several of the other cases mentioned. In the end, this is a choice for PHP development, which won't affect my work, so I'll leave it to you, but it seems you are making your life harder than necessary by going for a complicated solution instead of several simple solutions. There is only going to be a small number of different kinds of "derived data" ever, and there is hardly any place where such data is used in a way that does not need to understand its meaning (mainly for serialization).

For JSON exports of such data, I don't think this approach would make sense (but I suppose this is not intended here). There, one would simply use optional keys for including additional data. Likewise for RDF, where one would not want to introduce additional "role" objects that create another layer of indirection to access derived values (of course, RDF has a tradition of dealing with derived values, and nobody expects special structures there). I suppose that this proposal has nothing to do with JSON or RDF, but just to be sure we are on the same page.

The term "data model" is a bit over-used in our context -- maybe it would make sense to indicate in this bug report that it is specific to the object model used in the PHP implementation, and has no implications for other implementations or export formats.

That looks to me like a fancier and less understandable way of just having an additionalData hash on each data model object. Did I miss something?

That's pretty much it, yes.

The Role Object Pattern doesn't help saying anything about the name of roles or the type of role objects as far as I see it. Just as a hash, it stores arbitrary things under arbitrary keys.

The names would be arbitrary, but it can be made type safe. My idea was to use getRoleObject( $name, $type = null ) to perform an optional check against $type.

In addition, the RoleManager doesn't have to be a hash. It could also instantiate objects on demand. With a plain hash, we would not have the freedom to do that.

While this approach isn't ideal (it's a bit mor "amorphic" thank I would like), but considering our previous discussions, it's the best compromise between flexibility and information locality I could find. It's also well described as a solution for the kind of need we have (context dependent perspectives and additions on some kind of basic data model).

Structurally, this would work, but it seems like a very general solution with a lot of overhead. Not sure that this pattern works well on PHP, where the cost of creating additional objects is huge. I also wonder whether it really is good to manage all those (very different!) types of "derived" information in a uniform way.

The need for this solution is driven by the fact that using separate implementation strategies for the different kinds of derived data was worked rather badly for us in the past to iterations of the serialization code. We ended up with a lot of inconsistencies and duplicate code.

I do share the concern about performance. I think it is managable, but we should be careful about this. I think deferred deserialization is going to be key to that.

To pick just one example, consider the (article) URL of a SiteLink. To create it, one needs to have access to the content of the sites table. In WDTK, we encapsulate the sites table in an object (called Sites). To find out the URL of a SiteLink, one has to call a method of the Sites object (called getArticleUrl() or something), which takes a SiteLink as an input. This design is simple and efficient, uses no additional memory for storing new objects or values, and clearly locates the responsibility for computing this information (the Sites table).

But it requires the serialization and formatting code to depend on the lookup services, and have these lookup services injected. We have found that this is a bad thing, both in theory (representing all data in a dumb model before generating output was suggested by two external reviews) and in practice (where the cross-dependencies have kept us from splitting serialization code into a separate component for a long time).

In a single-site setting (like you have in PHP), there is only one sites table, and you can access it statically, so the caller does not even need to have a reference to a Sites object as in WDTK. I therefore don't see any benefit in creating a role object for this simple task. It's just more indirection, without any convenience for the software developer or any gain in performance.

Relying on global state like that for conveniance is what MediaWiki is doing all over the place, with horrible results for testability and modularity. Getting away from that tangle is one of my main priorities as a core architect.

There is only going to be a small number of different kinds of "derived data" ever, and there is hardly any place where such data is used in a way that does not need to understand its meaning (mainly for serialization).

You always need to understand the meaning in order to use it. That's why roles are defined by interfaces. We are not passing around random blobs with random names.

I suppose that this proposal has nothing to do with JSON or RDF, but just to be sure we are on the same page.

Yes, this has nothing to do with JSON and RDF. Depending on the type of information, it would be represented in structurally different ways in the exported data structure. We have separate tickets for that.

The term "data model" is a bit over-used in our context -- maybe it would make sense to indicate in this bug report that it is specific to the object model used in the PHP implementation, and has no implications for other implementations or export formats.

You are right, this refers specifically to the PHP wikibase/data-model component.

@daniel As long as it works for you, this is all fine by me, but in my experience with PHP this could cost a lot of memory, which could be a problem for the long item pages that already caused problems in the past.

But it requires the serialization and formatting code to depend on the lookup services

I know that it's always nice in software architecture to reduce the number of dependencies (for all kinds of reasons). However, I don't see a strong reason why code that formats a sitelink should not depend on the facility that provides the URL to link to. When things have a conceptual dependency, it is not bad design to have a code dependency there as well. I don't think there is any reason to have duplicate code in either solution -- that's just a matter of coordinating work within the team (not saying it is easy, but architecture alone will not solve this ...).

Relying on global state like that for conveniance is what MediaWiki is doing all over the place, with horrible results for testability and modularity

The state would not be global (I was over-simplifying). Of course you would have a $db object that provides the access to the sites table. Reading from a table is a stateless operation, so there is no state (global or local) involved and you can indeed use static code if you like. But of course you could also use a Sites object like in WDTK. Regarding testing, I guess you already have a mock db connection object anyway (otherwise, how would you test db read operations ...). I don't see a reason why this solution should be any less modular or testable than what you propose. There is also no need to have any duplicate code.

When things have a conceptual dependency, it is not bad design to have a code dependency there as well.

I agree, but that dependency should be as narrow as possible. That's why "Read Models" exist.

The code that creates an HTML link from a SiteLink should have a dependency on an interface that provides a URL for the SiteLink. It should not known about lookups, and definitely not about database. More importantly, the code that tests the render code should know nothing about databases.

I don't think there is any reason to have duplicate code in either solution -- that's just a matter of coordinating work within the team (not saying it is easy, but architecture alone will not solve this ...).

We have discussed several options and approaches in detail, over months. This is the result. See T112550 and T112547 for some discussion of other approaches.

Regarding testing, I guess you already have a mock db connection object anyway (otherwise, how would you test db read operations ...). I don't see a reason why this solution should be any less modular or testable than what you propose. There is also no need to have any duplicate code.

Because you are proposing strong coupling between rendering/serialization and storage layer code. We do have a mock database for testing database level code. It should not be needed anywhere else.

When things have a conceptual dependency, it is not bad design to have a code dependency there as well.

As far as I understand that, it argues for tight coupling (i.e. Sites::getSingleton() where the URL is used). Arguing for loose coupling might read like this:

"When things have a conceptual dependency they should share an interface and the dependency should be injected. The origin of dependency, and the implementation should have a code dependency on the interface, not on each other."

Relying on global state like that for conveniance is what MediaWiki is doing all over the place, with horrible results for testability and modularity.

so there is no state (global or local) involved and you can indeed use static code if you like

State was here referring to names that are in scope. Static code is global, you can not replace a call to Sites::getSingleton() with a call to a mock without changing the global named Sites. Thus using static code makes it harder to test units in isolation, where the dependencies are replaced with mocks.

I don't want to detail every bit here, but it should be clear that one can easily eliminate the dependency to $db in the formatter code. The Sites object I mentioned is an example. It is *not* static in our implementation. You can make it an interface. You can inject Sites (or a mocked version of it) for testing -- this is what we do. The only dependency you will retain is that the formatting code, or some code that calls the formatting code, must know where to get the URL from.

All of this will work in any "sane" architecture -- I don't see where you need a role manager for this. In particular, you can always pull out dependencies by injecting interface-based objects instead, and this has nothing to do with how you represent the "derived data" in memory using objects. The role-based approach discussed here simply seems to be a generalised version of this pattern, with a one-solution-fits-all interface ("Role") instead of task-specific interfaces ("Sites" etc.). The reason why I am not convinced by this here is that the tasks at hand are quite diverse and refer to different objects. So for any particular object (such as SiteLink) you might not have many possible roles available, probably just one, and in such a situation the complexity of the general solution might be avoidable.

Maybe it's clearer if I say it in terms of the simpler "hash map of additional data" approach that @adrianheine mentioned: it seems to me you are adding such hashmaps to all objects (using a somewhat complicated way to encode the hashmap), just to have one or two entries per object in the end. In such a case, rather than using a hashmap, you would better use a member variable that can be null if the additional data is not there.

@mkroetzsch You are raising valid points, and we have been over them a few times in the last weeks. I'll try to briefly explain why we decided against some approaches:

  • core model class knows about all optional information: it pollutes the main model, and forces additional dependencies on the data-model component. It's not extensible (e.g. WikidataQuality has no place to put constraint violation info on Statements).
  • use lookup services in the serializer and renderer: that was actually a strong contender, and would work well enough for output. But when loading and deserializing data that contains extra information, the deserializers have no place in the model to put that extra info. It would have to be returned to the caller via some side channel. That's confusing and unintuitive, and especially bad for the "consumer" side, which will often be a 3rd party.

The role object isn't a particularly elegant solution. But all things considered, it seems the best compromise. It gives us the flexibility and modularity we need while keeping the code straight forwards and the information flow obvious.

Another advantage of putting the relevant data into the model in a pre-processing step before output is that it gives us an opportunity do do bulk queries for labels etc. We'd just get the bulk and inject the info in one go, instead of snaking it in through the back door of some caching lookup service with prefetch-functionality. That keeps the relevant code in one place.

I find this approach very odd, do not understand why it would be a good idea, and am generally quite worried about it.

My main concern is pretty much described by the first paragraph of @mkroetzsch his first post. I've read through the replies, though found the reasoning in there to generally be either besides the point, or lacking actual depth such as "I did something like this before and it did not work well".

As I've pointed out before, it makes no sense to me starting to discuss this on the premise we need one global solution for this. Looking at individual problems we need to solve and what the best solution is for those seems like a much better approach. Similarly, the need for deferred deserialization has not been demonstrated to me. I can see how it can be helpful in theory, though am missing the concrete places of where this would be, and why for each of those cases the deffered approach is best. "Trust me, I already thought about it, and came to this conclusion, no need to re-examine it" is not an acceptable starting point if you want me to reason up from it.

A second concern I have is a combination of this approach ticking off several design warning boxes and me not having seen anything in this direction used or justified for a usecase like this anywhere. That by itself does not mean the approach is bad, though it certainly adds to my worry.

That looks to me like a fancier and less understandable way of just having an additionalData hash on each data model object. Did I miss something?

The names would be arbitrary, but it can be made type safe. My idea was to use getRoleObject( $name, $type = null ) to perform an optional check against $type.

This saves the caller from writing an if check, at the cost of passing a type as string and having the role manager do the check? I think that's a bad trade-off, and certainly not big enough of a good thing to warrant the additional complexity proposed here.

(representing all data in a dumb model before generating output was suggested by two external reviews)

Those reviewers talked about having a response model that you can give to a presenter. That is very different from that you are proposing here.

where the cross-dependencies have kept us from splitting serialization code into a separate component for a long time

Huh? Splitting off the serialization code was not very hard, and not delayed by needing services such as those Markus talks about.

When things have a conceptual dependency, it is not bad design to have a code dependency there as well.

I agree, but that dependency should be as narrow as possible. That's why "Read Models" exist.

I utterly do not understand your reply to Markus here. What do read models have to do with what he wrote?

WikidataQuality has no place to put constraint violation info on Statements

What makes you think it's a good idea for an extension that adds a new concept to glue it onto an object from a different context?

PS: I do not think it is a good idea to start a discussion on if a particular pattern is suitable for ones usecase by listing a bunch of big names that have used the pattern in the past. Appeal to authority.

As I've pointed out before, it makes no sense to me starting to discuss this on the premise we need one global solution for this. Looking at individual problems we need to solve and what the best solution is for those seems like a much better approach.

During our discussions of this topic, we have found a need for an extensible model, where code that handles the data model does not need to know about all the data associated with individual model objects. That doesn't mean that all derived data has to be handled using that mechanism. It does mean though that we need such a mechanism. And if we have it, we should consider whether we still need anything else.

I admit that as a compromise, it leans towards convenience, at the expense of strictness. It makes the model somewhat amorphous, a bit like our FormatterOptions class or the Serializer interface. I don't see a huge problem with this though.

Similarly, the need for deferred deserialization has not been demonstrated to me.

I don't see how the advantage can be demonstrated without implementing it. But that's besides the point here - in the context of representing derived values, we should just make sure we don't do anything that would keep us from implementing deferred deserialization.

This saves the caller from writing an if check, at the cost of passing a type as string and having the role manager do the check? I think that's a bad trade-off, and certainly not big enough of a good thing to warrant the additional complexity proposed here.

If you want it pretty, you can always wrap the model, or the access to the model, in some kind of adapter. If we have wide spread need for this kind of access, I'd recommend that. But I currently don't see that need. Access to "extra" data will be limited to specialized serializers (and perhaps formatters, which currently use services to get extra data).

The alternative is to either have a specialized model for every use case (lots of overhead for writing the code and instantiating the model), or keeping the data associated with, but separate from, the actual model (awkward mapping logic).

Both of these alternative approaches are rather problematic when you think of loading a model from JSON. Neither has a way to simply collect all relevant information directly in the model, and return it as a single structure. To me, that is a pretty important point.

Now, it's true that in the repo code, we don't currenty need that (though we probably will when we start to implement federation between repos). On the client wiki side, we do have a need for a consolidated model (for Lua access).

We could remove the round-trip requirement from the model, but I think this would be a bad idea: our PHP code wouldn't be usable for 3rd party clients, and would not be usable as a reference for client library code in JS or Python. Removing this requirement is currently the one thing I see that could persuade me to look at the "inject lookups/mappings" approach again.

(representing all data in a dumb model before generating output was suggested by two external reviews)

Those reviewers talked about having a response model that you can give to a presenter. That is very different from that you are proposing here.

The key point I took away from that is is that the serializer should know only about the data in the model. Dependencies on services should be avoided.

where the cross-dependencies have kept us from splitting serialization code into a separate component for a long time

Huh? Splitting off the serialization code was not very hard, and not delayed by needing services such as those Markus talks about.

Well yes, because these serializers ignored the need for derived data in the output. We now work around this by hacking that data into the serialized structure after the fact, using a rather brittle visitor/callback mechanism. That need kept us from switching to the new serializer for a rather long time (about a year, iirc).

I agree, but that dependency should be as narrow as possible. That's why "Read Models" exist.

I utterly do not understand your reply to Markus here. What do read models have to do with what he wrote?

As above: The key point is that the serializer should know only about the data in the model. Dependencies on services should be avoided. Also: having all data in one place makes the code easier to work with, and easier to understand, I think.

WikidataQuality has no place to put constraint violation info on Statements

What makes you think it's a good idea for an extension that adds a new concept to glue it onto an object from a different context?

Simply the fact that it allows us to pass all the relevant data around in one piece, instead of having to maintain awkward mappings and making sure we pass around the right associated data thingies in the right place. It also makes bulk loading easier and more intuitive, because the loaded data is in plain sight, and the mechanism by which it is passed around is obvious.

PS: I do not think it is a good idea to start a discussion on if a particular pattern is suitable for ones usecase by listing a bunch of big names that have used the pattern in the past. Appeal to authority.

Coming across some big names in this context is something I found curious and interesting. I'm not saying we should do this because some big shot liked the idea. And it's not the reason I think the model is a good fit. The fact that this model has been studied and discussed before is definitely encouraging, and looking at the papers re-assured me that it's a good fit for our situation.

Anyway, if you think tailored solutions for the individual needs listed in the description would be better, please make concrete suggestions.

I've been asked to comment on use-cases. I do believe we should have good support for people who write a plugin for Drupal or Wordpress for example that makes use of Wikidata's data. I do expect a lot more websites like inventaire and reasonator to pop up in the not too distant future as well.

@Lydia_Pintscher I take that to mean that we would like such projects to use our PHP binding of the data model, and we want to make it simple and intuitive to use the PHP binding with the JSON such 3rd party clients would receive from our web API. Did I get that right?

3rd party usage in PHP needs to be easy and intuitive, yes.

I wonder if we could just use the decorator pattern (isn't that basically the "interfaces" approach?).

Another idea I found is the Extension Objects Pattern (http://st.inf.tu-dresden.de/Lehre/WS06-07/dpf/gamma96.pdf, https://msdn.microsoft.com/en-us/library/ms733816%28v=vs.110%29.aspx?f=255&MSPPError=-2147217396). That seems to be somehow related to the roles pattern (is it actually the same?).

I'm not sure if I understood Role Object Pattern correctly. To me, it seems that it is created to support objects which occur in completely different roles (like the Character/Director example on enwiki) while in our use case the statement doesn't suddenly become a sitelink. A sitelink with an url attached is still in the role of a sitelink and an unit snak with some normalized unit is still a snak. Do I understand the intention of the Role Pattern correctly? Is that really (from an abstract point of view) what we need conceptually in our data model implementation?

I wonder if we could just use the decorator pattern (isn't that basically the "interfaces" approach?).

With decorators, you can't mix and match: you only have one perspective (role), and no way to get the other perspectives. You can give the decorators a way to access other decorators, of course - that's essentially what the Role Object Pattern is.

Another idea I found is the Extension Objects Pattern (http://st.inf.tu-dresden.de/Lehre/WS06-07/dpf/gamma96.pdf, https://msdn.microsoft.com/en-us/library/ms733816%28v=vs.110%29.aspx?f=255&MSPPError=-2147217396). That seems to be somehow related to the roles pattern (is it actually the same?).

It's at least very similar. It seems like some authors use "Extension" when the roles share a common base interface, and "Role" if they don't - but others use the terms differently.

I don't like the term "role" very much, btw. I would personally prefer "facet".

I'm not sure if I understood Role Object Pattern correctly. To me, it seems that it is created to support objects which occur in completely different roles (like the Character/Director example on enwiki) while in our use case the statement doesn't suddenly become a sitelink. A sitelink with an url attached is still in the role of a sitelink and an unit snak with some normalized unit is still a snak.

And the Character and the Director are still people, not hydrants.

A Sitelink with a URL attached could be said to have a UrlProvider role (which defines a getUrl() method). The code that uses the URL may or may not also use the "main" object, or other roles.

A Snak with derived values could provide an AlternativeValues role (which defines a getValue($name) method).

A ConstraintViolationSubject role could be added to Statements, without the core model caring. It would be simple to include the constraint violation info in the output.

A snak with an EntityId value could have the target's label attached via a ThingWithLabel role. The label could be included in JSON output, for easy display, simply by registering a serialiezr for that role.

Do I understand the intention of the Role Pattern correctly? Is that really (from an abstract point of > view) what we need conceptually in our data model implementation?

It seems to me the following requirements indicate the Role Object Pattern:

  1. we want the data model to be used for input: 3rd party clients should use it to represent the output of the web API, and the content of JSON dumps, including any extra info like derived values, sitelink URLs, or entity labels.
  2. we want the model to be extensible: it should be possible to associate additional data with parts of the model without modifying the model component (e.g. attach constraint violations).

Without requirement 1, we could go for the "injected lookup service" approach which we currently use in many formatters.

Without requirement 2, we could go for subclassing or decorators (though this gets a bit messy, since we can't mix & match freely).

If we want to cover both requirements, the ROP seems a nice fit, even though it's a bit amorphic for my taste. But I don't see any good alternative, really.

The question has been raised whether it's a good idea to look for a single approach that can cover all our "extra data" needs. Maybe the requirements are overly broad, leading to an overly complex solution.

But I think the essential requirement is actually always the same: provide a way to associate additional data with a specific bit of the data model, for use during rendering or serialization. Where that information comes from and how it is used for generating output doesn't really matter for the modeling requirements.

Thanks for the detailed explanation Daniel, I think I now understand the idea much better :-)

What I still wonder about currently is however, if we can implement this pattern without changing our current implementation completely. Although that is somehow subjective, it feels a bit strange to me to add all the role stuff to our core model. Would it be possible to keep the current implementation as simple as it is and treat it only as one (default/main) role? This would also allow us to encapsulate the plain data which is stored and created by the users from the auto-generated data we add as other roles. Serializers could then decide to only use the main/default role or to look for other roles attached to the objects.

@Bene: the classes in the core model would get a method to get a different "role" or "facet". That doesn't add much to the data model. If we use the "alternative implementation" I outlined, this doesn't even add any overhead in terms of allocated objects.

The easiest way to get role support to all data model objects is a base class. Not the nicest approach in terms of abstraction, but avoids overhead and redundant code. And since the methods defined in the base class are part of the public interface, not internal helpers, it should be ok. I also don't see any point in having alternative implementations of e.g. Statement or SiteLink, especially if we have support for attaching arbitrary roles.

Anyway, it's always possible to just look at the "main" part of the model, or only at specific roles. This is one of the nice things about the ROP: all the roles are completely independent, and code that deals with one doesn't have to know anything about the others (or even about the core model). Classes implementing individual roles don't have to know anything about the data model, any object can be used as a role.

It would be possible to use a generic RoleBundle object for getting roles, and have our current data model objects as one of the roles, but that would make navigation through the model (and across roles) rather awkward. You'd need a way to go from the data model object back to the RoleBundle - and then the data model is aware of roles again, and we are back where we started.


PS: We could go one step further by implementing a static utility function that does a "role cast". It could be used with any object; if the object didn't implement any support for roles, the utility would behave as if the object did support roles, but not the particular role you asked for. This would allow the core model to no even mention roles. in the interface. Implementations may or may not support roles. Calling code would just use Role::case( $obj, $role, $type ) to get a different "angle" on an object.

But that may be going a step to far, making this too generic.

I think we should get rid of the idea that people who want to use our data in a simple and straight-forward way are going to use our code at all. They won't even use composer for their hacked Wordpress plugin. If people indeed want to do stuff the right way, because they want to do non-trivial work with the data, they should be able to do it right, and from my point of view that's the separate services way.

daniel renamed this task from [Tracking] Represent derived data in the data model to [RFC] Use Role Object Pattern to represent derived data in the data model.Nov 25 2015, 10:15 AM
daniel moved this task from Proposed to Review on the Wikidata-Sprint-2015-11-17 board.

Summary of some a brief discussion with Adrian and Jan:

With the Role Object approach, consumers of extra data cannot declare that they want to use that data. With lookup/mapping services, consumers can declare the dependency by requiring the respective services to be injected. This disadvantage of the Role Object approach is mitigated somewhat by the fact that the extra data is usually considered optional (even if retrieved via a lookup service).

This however does not answer the question how calling code should know when which extra roles need to be injected into the model. When a model is built from an API response, all the desired extra data would be in the JSON, and then also in the model, as role objects. However, if the original model is deserialized from the database representation that lacks any extra info, we need to somehow know which additional data should be inserted before passing the data model to e.g. a formatter/view component for rendering to HTML.

Summary of today's discussion with @thiemowmde:

  • Don't replicate code in the model, use a common base class (or trait), so facets are allowed in all data model nodes.
  • Inject facet declarations (a FacetDefinition service or some such) into each object that supports facets. This allows facets to be checked against the declared type, as well as listing of supported facets.
  • Define all aspects of a given facet in a single place, similar to the declaration of data types (or like ContentHandler).

I can't really debunk this idea without having any code that uses it. The PR in data model doesn't look too bad, just unnecessary, but I'm pretty sure I'll have something against code that would use it. Maybe you try the RDF thing with this and I and others can propose other solutions, so we have something to compare?

@Tobi_WMDE_SW *sigh* no. I need to write down an updated plan, to summarize the current state of discussion. I'm pushing back the final decision on the details until we are have a concrete plan to work on this. Perhaps we can talk about that at the next story time?

@adrianheine it's always easier to assess an idea once there is code. But in order to get to that point, quite a bit of work has to be done. It will be hard to go back, then. I suppose I could write strawman code for the RDF mapping based on the proposed changes to the data model...

Interestingly, Adrian wants to see how facets are used, while Thiemo is concerned with how they are defined.

Not quite. I want to see facet usage to have more opportunities at explaining you why I think they are bad, since my abstract points didn't get to you yet :)

Addshore subscribed.

Task break down is no longer valid, and we have closed the epic T112548