[Task] write a maintenance script to migrate properties from string to new identifier datatype
Closed, ResolvedPublic

Description

Existing properties with datatype string that are identifiers should be migrated in a one-time operation to using the new identifier datatype. We need a maintenance script to do this migration. Only the datatype is changed. The value type stays the same.

Before doing the migration we need to provide the editors with a list of all properties that would be migrated to make sure it is correct.

WARNING: We decided to not have the possibility to change a data type via a special page (technically this would be possible if the value type does not change). This breaks semantics, exports, external usages and such in possibly bad ways.
Lydia_Pintscher updated the task description. (Show Details)
Lydia_Pintscher raised the priority of this task from to Normal.
Lydia_Pintscher added a subscriber: Lydia_Pintscher.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 10 2015, 1:21 PM
Jonas renamed this task from write a maintenance script to migrate properties from string to new identifier datatype to [Task] write a maintenance script to migrate properties from string to new identifier datatype.Aug 15 2015, 12:53 PM
Jonas set Security to None.
thiemowmde updated the task description. (Show Details)Oct 6 2015, 12:22 PM
thiemowmde added a subscriber: daniel.
hoo claimed this task.Oct 15 2015, 7:03 PM
hoo moved this task from Backlog to Doing on the Wikidata-Sprint-2015-10-13 board.

Change 246781 had a related patch set uploaded (by Hoo man):
Add a changePropertyDataType maintenance script

https://gerrit.wikimedia.org/r/246781

hoo moved this task from Doing to Review on the Wikidata-Sprint-2015-10-13 board.Oct 15 2015, 9:02 PM
jayvdb added a subscriber: jayvdb.

We decided to not have the possibility to change a data type via a special page (technically this would be possible if the value type does not change). This breaks semantics, exports, external usages and such in possibly bad ways.

changing it via the backend will also break lots of things :P e.g. many parts of Pywikibot will break T115679 (full impact not yet known, but it is big).

How will this impact JSON serialization? Will a new revision be recorded for each item? Will wbgetentities output and/or dumps format for old revisions change too?

Will a new revision be recorded for each item?

No.

Will wbgetentities output and/or dumps format for old revisions change too?

Yes.

This will change the property, statements using this property will reflect this change where the API and UI add information from the property. Statements in dumps and wbgetentities of items contain information that is obtained from the properties. The JSON serialization for items in the db will not change, but the JSON output of wbgetentities will change as it contains additional information for convenience.

The JSON serialization for items in the db will not change

I suppose it will change for new edits, won't it?
And what about XML dumps of with full history?

The JSON serialization for items in the db will not change

I suppose it will change for new edits, won't it?

Besides the change the edit caused, I currently assume that any other change would be a bug.

And what about XML dumps of with full history?

The XML dumps are supposed to contain the reserialized JSON, similar to wbgetentities. I have never looked at those with history, but assume it is the same for historic revisions. So this change is retroactively changing history of each Entities representation of information derived from other Entities, i.e. it will propagate back to the beginning of time except for the actual edit to the Property. (Similar to how if you render an old revision of a Mediawiki page it will still try to use the current revision of templates it refers to.)

XML dumps with histories are indeed an issue: the script that builds them will take old revisions from old dump files, to avoid re-serializing the data. We would need to tell it not to do this, and to re-serialize everything. This is not hard, but it causes the dump process to take a looooong time, which makes it more likely to fail. And then we have to start over. A bit annyoing.

I can only recommend against using data from XML dumps. We do not give *any* guarantees to the format of the content blobs you find in there. They may change without notice, and contain serializations in various forms.

I can only recommend against using data from XML dumps. We do not give *any* guarantees to the format of the content blobs you find in there. They may change without notice, and contain serializations in various forms.

Where can I find reliable dumps of Wikidata's full history?

@Ricordisamoa There is currently no good way to get the full edit history using the proper JSON serialization. We never implemented this, since it seemed that people are either interested in the metadata of the edit history (XML), or the current content (JSON). What's the use case of having both? If you just need old snapshots, not the full history, you can use old JSON dumps.

@Ricordisamoa There is currently no good way to get the full edit history using the proper JSON serialization. We never implemented this, since it seemed that people are either interested in the metadata of the edit history (XML), or the current content (JSON). What's the use case of having both?

Not putting excessive load on servers and at the same time not having to deal with different serializations? :)

To get there one would load the dump and write it out again with the code that is run on the WMF cluster, without telling it to reuse parts from the previous dump.

Change 246781 merged by jenkins-bot:
Add a changePropertyDataType maintenance script

https://gerrit.wikimedia.org/r/246781

aude closed this task as Resolved.Nov 3 2015, 11:05 AM
aude removed a project: Patch-For-Review.