Page MenuHomePhabricator

Rewrite Wikibase data model implementation
Open, LowPublic

Description

Currently, data of claims, qualifiers and references are held in a single class, pywikibot.Claim. I propose to factor things common for all of them to a new class pywikibot.BaseClaim and introduce new classes pywikibot.Qualifier, pywikibot.Reference and pywikibot.ReferenceGroup. All but pywikibot.ReferenceGroup would inherit from the base class.

This will have many advantages:

  • fixing T186198 will be easier
  • input validation (eg. in site.py) will only require assert isinstance()
  • pywikibot.Qualifier(repo, 'P123') is simpler and less confusing than pywikibot.Claim(repo, 'P123', isQualifier=True)

Problem: backwards compatibility.

Current framework

class Claim(Property):

  '''Class variables'''
  TARGET_CONVERTER = {}
  SNAK_TYPES = ()

  def __init__():
    '''Attributes'''
    self.snak = snak
    self.hash = hash
    self.isReference = isReference
    self.isQualifier = isQualifier
    self.sources = []
    self.qualifiers = OrderedDict()
    self.target = None
    self.snaktype = 'value'
    self.rank = 'normal'
    self.on_item = None

  '''Methods'''
  def __repr__
  def fromJSON
  def referenceFromJSON
  def qualifierFromJSON
  def toJSON
  def setTarget
  def changeTarget
  def getTarget
  def getSnakType
  def setSnakType
  def getRank
  def setRank
  def changeRank
  def changeSnakType
  def getSources
  def addSource
  def addSources
  def removeSource
  def removeSources
  def addQualifier
  def removeQualifier
  def removeQualifiers
  def target_equals
  def has_qualifier
  def _formatValue
  def _formatDataValue

Proposed framework

class BaseClaim(Property):

  '''Class variables'''
  TARGET_CONVERTER = {}
  SNAK_TYPES = ()

  def __init__():
    '''Attributes'''
    self.target = None
    self.snaktype = 'value'

  '''Methods'''
  def __repr__(self): raise NotImplementedError
  def fromJSON(self): raise NotImplementedError
  def toJSON(self): raise NotImplementedError
  def changeTarget(self): raise NotImplementedError
  def changeSnakType(self): raise NotImplementedError
  def target_equals
  def setTarget
  def getTarget
  def getSnakType
  def setSnakType
  def _formatValue
  def _formatDataValue

class Claim(BaseClaim):

  def __init__():
    '''Attributes'''
    self.rank = 'normal'
    self.qualifiers = OrderedDict()
    self.references = []

  def getRank
  def setRank
  def changeRank
  def getSources
  def addSource
  def addSources
  def removeSource
  def removeSources
  def addQualifier
  def removeQualifier
  def removeQualifiers
  def has_qualifier

class Qualifier(BaseClaim):

  def __init__():
    '''Attributes'''
	self.parent = None

  def __repr__
  def fromJSON
  def toJSON
  def changeTarget
  def changeSnakType

class ReferenceGroup(object):

  def __init__():
    '''Attributes'''
	self.parent = None
	self.references = OrderedDict()

  def toJSON
  def fromJSON

class Reference(BaseClaim):

  def __init__():
    '''Attributes'''
	self.parent = None

  def __repr__
  def fromJSON
  def toJSON
  def changeTarget
  def changeSnakType

Related Objects

Event Timeline

Additionally, I propose to migrate from getters/setters to @property.

+1 for this idea. It will also make T76615: Claim equality operator easier to deal with.

One note though is that the the suggestion above does not match Wikibase steucture as described in https://phabricator.wikimedia.org/T76615#3464800. Importantly pywikibot.Claim here is a mixture of Claim (value +Qualifiers) and Statement (Claim+References).

If we are breaking things this would be a great time to drop the missleading target_equals (or rename it to clarify its not a true equals).
Might also be worth considering if self.qualifiers should be simple dict (since Wikibase does not say the order has significance).

I'm not familiar with Wikidata, but at least this seems to make the code cleaner, improve its readability, and reduce code-complexity on Codeclimate.

Just make sure you will solve some of the issues in Pywikibot-Wikidata belonging to this part as they would require to rewrite the code once more when being solved after the rewrite.

Expanding on my previous comment about the current and proposed pywikibot.Claim being a mixture of Wikibase's Claim and Statement. Reproducing that structure in pywikibot would be fairly easy using the below amendment to the proposal

class Claim(BaseClaim):

 def __init__():
  '''Attributes'''
  self.rank = 'normal'
  self.qualifiers = OrderedDict()

 def getRank
 def setRank
 def changeRank
 def addQualifier
 def removeQualifier
 def removeQualifiers
 def has_qualifier

class Statement(object):
 def __init__():
  '''Attributes'''
  self.claim = None
  self.references = []

 def getSources
 def addSource
 def addSources
 def removeSource
 def removeSources

BUT other than reducing some confusion and being able to make sensible Claim comparison (comparing should ignore sources) the benefits don't outweigh the benefits of having them in one object. I would however propose naming it Statement(BaseClaim) or StatementClaim(BaseClaim) to avoid confusion. This would also be a clean break alowing us to temporarily raise an Error if anyone calls Claim. I also think this is something we want to do for Lexeme support in the future where there are yet other types of claims.

I noticed that the snak, hash and on_item attributes have been dropped. Is this on purpose? Are they not currently needed?

For the equality issue I would propose adding something like the following method to the Statement/StatementClaim class

def same_as(target, ignore_rank=True, ignore_refs=True, ignore_quals=False, ignore_qual_order=False):
  # order of self.references is always ignored

def __eq__(target):
 return self.same_as(target, False, False)

It might also be useful to default to ignoring the order of ReferenceGroup.references for comparisons.

Lastly to reiterate that with the above caveat about renaming (and dumping targetEquals) I fully support the much needed refactoring.

I'm not familiar with Wikidata, but at least this seems to make the code cleaner, improve its readability, and reduce code-complexity on Codeclimate.

Just make sure you will solve some of the issues in Pywikibot-Wikidata belonging to this part as they would require to rewrite the code once more when being solved after the rewrite.

At a quick glance T170432: Function to add a claim with source in a single edit/T112577: Make it possible to add a qualifier together with a new claim using new_claim.addQualifier() are probably the most interesting ones to keep in mind. For those to be possible there would have to be some way of preparing references/qualifiers without the addSource/addQualifier calling the API. This refactoring doesnt solve that but I also don't believe it makes it harder. Possibly we want to add a function to BaseClaim which addSource/addQualifier/setTarget (etc.) extends so that the postponement mechanism can be added there.

T124139: Claim new method has_source/T76615: Claim equality operator would be fixed/easier to fix by the refactoring (possibly no longer needeed with the above same_as suggestion)

T107712: Use narrow API calls instead of wbeditentity in Pywikibot Would be helped by the refactor.

Per T108440: Pull out all Wikibase-related parts of pywikibot to pywikibot/wikibase and use it as a submodule it might be worth taking a look at https://github.com/wikimedia/pywikibot-wikibase to see how/if it tackled this. Or even revisit the question about moving all wikibase code into a submodule.

I think reworking this implementation would be very welcome because at the moment it is not pretty, to say it politely.
But I am not convinced by the alternative either. Why would Reference inherit from BaseClaim? A reference is not a claim. What would the getSnakType method mean when called on a Reference?

Maybe you could have a look at the Java library Wikidata-Toolkit: the data model is implemented very well there. No nonsensical inheritance relationships, clean and precise interfaces.

https://github.com/Wikidata/Wikidata-Toolkit/tree/master/wdtk-datamodel/src/main/java/org/wikidata/wdtk/datamodel/interfaces

Also, I think it would be very hard to clean up this implementation while maintaining compatibility with existing user code. Given that pywikibot has basically no releases and people are encouraged to use the master branch, I don't see how a breaking change like this could be carried out without creating a big mess. So in short, I don't see how that can be done at all in pywikibot's git repository. But I would be very happy to be proven wrong.

Xqt triaged this task as Low priority.Dec 18 2018, 9:43 AM

Also, I think it would be very hard to clean up this implementation while maintaining compatibility with existing user code. Given that pywikibot has basically no releases and people are encouraged to use the master branch, I don't see how a breaking change like this could be carried out without creating a big mess. So in short, I don't see how that can be done at all in pywikibot's git repository. But I would be very happy to be proven wrong.

This at least has been addressed with users now recomended to use release tagged with "stable" while pywikibot developers use the master branch well aware that it might break/be broken. While backwards compatibility is likely impossible we should strive to ensure any "old" calls raise suitable deprecation exceptions for a while.

This rework is probably also a good opportunity to pull wikibase related code out of __init__.py into e.g. wikibase.py. With T189321 and T223820 there will be even more wikibase related code coming in and __init__ is fairly bloated as is.

It's probably worth having another look at the suggested implementation (and related comments) with the way we now know WikibaseLexeme and WikibaseMediaInfo work to ensure its compatible/extensible.

Just today I've found another attempt of Python Wikidata UI except Pywikibot and Wikidata Integrator:

Daty, which uses Pywikibot in the backend

Another interesting project we might cooperate with when rewriting Wikibase model.

It's clear now the question is not whether but how.

The major problem is the backwards compatibility. @Lokal_Profil proposes splitting pywikibot.Claim since

pywikibot.Claim here is a mixture of Claim (value +Qualifiers) and Statement (Claim+References).

and refers to my citation of documentation in T76615#3464801. However, it isn't accurate:

So I believe this splitting would be conter-productive.

Another incompatibility of class API is the change from pywikibot.Claim(repo, 'P123', isQualifier=True) to pywikibot.Qualifier(repo, 'P123') or similar. There are three possibilities:

  1. renaming Claim to Statement and making pywikibot.Claim a deprecated factory function
  2. same as 1 but Claim is not deprecated for calls with neither isQualifier nor isReference and eventually becomes an alias of Statement
  3. implementing Claim.__new__ (will return Qualifier for isQualifier=True and Reference for isReference=True, otherwise just Claim)

The problem with the first two is that the class Claim also holds public instance methods (most notably Claim.fromJSON). I've never implemented Python's __new__ metamethods but I hope it works the way I mean.

It's clear now the question is not whether but how.

The major problem is the backwards compatibility

Probably we can introduce a new Python Namespace for a new model i.e the wikibase file or subfolder to implement a new clean stucture, keeping the old unchangend. Step by step the old implementation can be deprecated and replaced by the new.

It's clear now the question is not whether but how.

The major problem is the backwards compatibility

Probably we can introduce a new Python Namespace for a new model i.e the wikibase file or subfolder to implement a new clean stucture, keeping the old unchangend. Step by step the old implementation can be deprecated and replaced by the new.

+1 for this idea

+1. Very good idea (I believe the best one). page.py is getting a lot of new things in https://gerrit.wikimedia.org/r/c/pywikibot/core/+/526380 (so putting it on hold), this overhaul would add more and there is still a lot to be implemented... So there will be a new subtask.

T108440 proposes creating a submodule but just moving all the stuff to a separate file would work too, so that can be done later.