Page MenuHomePhabricator

Handle property datatype changing gracefully
Open, HighPublic

Description

Pywikibot will completely break whenever T95686: [Task] write a maintenance script to migrate properties from string to new identifier datatype is used, as it caches the property datatype for one hundred years, and has done that since April 2013 rPWBC6ea6d3330829: Split getting a property's datatype into its own request so it can be cached..

The simple workaround, e.g. for people using old versions of pywikibot, is to delete the apicache each time a property datatype changes, but that isnt a user-friendly solution.

Changes that could mean Pywikibot is less impacted by this:

Pywikibot could force a cache refresh more regularly.

Ideally we detect that a property has a new datatype, and clear the relevant cache entry.

Also, Pywikibot can eliminate the reliance on getPropertyType, as many code paths already have the properties datatype from when the entity was loaded, e.g. via wbgetentities. If getPropertyType is only used infrequently, the caching can be eliminated.

Event Timeline

jayvdb created this task.Oct 16 2015, 1:56 AM
jayvdb raised the priority of this task from to High.
jayvdb updated the task description. (Show Details)
jayvdb added subscribers: jayvdb, Legoktm.
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptOct 16 2015, 1:56 AM
jayvdb updated the task description. (Show Details)Oct 16 2015, 2:01 AM
jayvdb set Security to None.
Multichill added subscribers: Lydia_Pintscher, hoo.
Multichill added a subscriber: Multichill.

https://gerrit.wikimedia.org/r/137737 and https://gerrit.wikimedia.org/r/227252 basically mean that getPropertyType only gets called when Property is constructed manually, right?

https://gerrit.wikimedia.org/r/137737 and https://gerrit.wikimedia.org/r/227252 basically mean that getPropertyType only gets called when Property is constructed manually, right?

Yes. Property is a parent class of Claim, so the current exposure is quite large; roughly any time a Claim is instantiated directly in order to be used with addClaim or similar. i.e. reading will likely not break, at least not frequently and probably easily fixable, however writing to Wikibase will likely be very broken.

Do we expect this to happen again? A property's datatype is supposed to be immutable, and except for this one weird case, it is. Unless we can also vary on some other constant changing, I don't see any good cache invalidation strategy here.

If we expect this kind of change to happen again, I think we should lower the cache time to 1 day and accept the minor performance hit. If not, lets just tell users to rm -rf their apicache and continue on.

On cache invalidation, we can force a refresh of the getPropertyType property-datatype mapping whenever a new datatype is encountered. What would be nice is if Wikibase provided an API call to list all datatypes, so we can detect when the list of datatypes changes. i.e. expose https://test.wikidata.org/wiki/Special:ListDatatypes via the API.
However we dont need that, as we can get the list of datatypes from the paraminfo of wbformatvalue.datatype. However we need to be mindful of T55265, if that problem hasnt been fixed yet.

The harder, and much lower exposed problem, is if pywikibot users are using CachedRequest for entities, as the cached serialisation will have the old datatype in it, not the new datatype. IMO this problem is the callers responsibility, and our solution should be "delete your apicache".

hoo added a comment.Oct 17 2015, 2:17 PM

I'm not sure what's the best to do for pywikibot here, but the plan is to perform such a change only once, on a previously discussed (thus known in advance) list of properties. Because of that, I think having a one-off solution would be ok here.

If there's anything else you need from us in order to reflect the migration in pywikibot, please let us know and we'll try to figure out a way.

I would recommend lowering the caching. We really try to not change the datatype of a property but I can't promise this will be the only time.

XZise added a subscriber: XZise.Oct 17 2015, 3:36 PM

By the way the cache time doesn't matter that much as the cached item does not store the time of expiration but the time of creation. So we could reduce the time and anyone who uses a version after that won't have cached it for about 100 years.

What would be nice is if Wikibase provided an API call to list all datatypes, so we can detect when the list of datatypes changes. i.e. expose https://test.wikidata.org/wiki/Special:ListDatatypes via the API.

I'm not too sure how this relates to this bug, but submitted https://gerrit.wikimedia.org/r/247080 for this anyways.

daniel added a subscriber: daniel.Oct 25 2015, 6:55 PM

Does pywikibot differentiate between value type and data type? Does pywikibot even *use* the data type as such? The value type is not going to change, only the data type.

For example, for the "url" data type, the value type is "string". For the new ID data type, the value type will remain "string" when the data type is changed from "string" to "ID".

Also note that this kind of change is not going to happen often. It's a breaking change (to the knowledge base, not the software), and will be announced as such.

Pywikibot does use the data type, quite extensively. e.g.
https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/page.py#L3951

They are mapped to classes, and mapped to value-types.

With something like https://gerrit.wikimedia.org/r/247080 , we can easily create those mappings dynamically at startup, and provide fallback classes to unknown data types.

thiemowmde added a subscriber: thiemowmde.

I reworked https://gerrit.wikimedia.org/r/247080 completely. Please review.

I see the current patch puts this new data structure into meta=siteinfo. I am wondering why there, instead of meta=wikibase. Is meta=wikibase being deprecated?

IMO ideally this mapping should also be exposed on Special:ListDatatypes (e.g. mention that a 'URL' is stored using value type 'string') and then build an api query module to provide access all the information on ListDatatypes, including the help strings. That would mean this new api dataset is aligned with an existing special page dataset, and the two should naturally grow together consistently.

thiemowmde added a comment.EditedJan 27 2016, 8:50 AM

Have a look into the file ApiQuerySiteinfo.php and how it works. It's impossible to inject something like meta=wikibase siprop=wikibase. Why do you call this "deprecated"? This never existed, as far as I can see.

What we currently have in https://gerrit.wikimedia.org/r/247080 is good enough for now, in my opinion. If people prefer a separate API module, somebody should write a specification for that, or even better: submit actual code. Otherwise I suggest we merge what we have.

Have a look into the file ApiQuerySiteinfo.php and how it works. It's impossible to inject something like meta=wikibase. Why do you call this "deprecated"? This never existed, as far as I can see.

It isnt related to Siteinfo. It exists, but on the client (client/includes/api/ApiClientInfo.php)

https://www.wikidata.org/w/api.php?action=query&meta=wikibase

Sorry, I got confused. I fixed my comment above. So what you are proposing is a new ApiRepoInfo? Sure, "somebody" could do that, as said above.

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Apr 3 2016, 12:53 PM