Page MenuHomePhabricator

Identify steps involved in migrating a dataset
Closed, InvalidPublic4 Estimated Story Points

Description

While working on uploading the first datasets, put down an outline of all the steps involved, from raw data to live bot run. Such an outline can then be used as a framework for working with WLM datasets on phabricator.

Have a look at: T139335 for how this has been done for Batch Uploads.

Related: T156889.


DATA EXPLORATION

  • Set up milestone on Phabricator under Connected-Open-Heritage-Wikidata-migration, using the name of the db table -- eg. se-arbetsl.
  • Set up page under https://www.wikidata.org/wiki/Wikidata:WikiProject_WLM/Mapping_tables.
    • Fill it out with sample data.
    • Note: As of now, these are all created and filled out thanks to this script. It only needs to be rerun if a new table is added to the WLM db.
  • Look at the unique identifier of each item. Does it correspond to an identifier in an external source?
    • If yes, find or request an appropriate property.
    • If no (i.e. the ID is just for internal WLM use), this might mean the dataset is not suitable for import. Without a real-world reference, we can't tell much about the completeness or selection criteria of the data.
  • Identify heritage status. Do all the items represent the same type of heritage protection (eg. national monument in <country>)?
    • If not, how can the heritage status of each item be inferred?
    • Create or edit any necessary items, eg. cultural monument of the Czech Republic (Q385405). It should at least have assigned country and subclass of cultural property / national heritage site.
  • Identify P31
    • A default P31 for all the items -- something basic like building or ancient monument.
    • Sometimes there's a separate column for this, like type, that can be used to substitute the default one if possible.
  • Create necessary lookup tables.
  • Identify and download any necessary offline data.
    • This is to avoid doing live queries while running the program, which takes a lot of time.
    • Usually stuff like placenames, administrative units.
    • Data that does not change often.
  • Identify areas that can benefit from community input.
    • Problematic due to language.
    • Problematic due to lack of factual knowledge.
  • Labels and descriptions
    • Can the name column be used as-is for label?
    • Descriptions can be made using the default P31/heritage and country/administrative location
    • Descriptions in extra languages, apart from the language of the dataset?

CODING

UPLOADING

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 14 2017, 12:38 PM
Alicia_Fagerving_WMSE changed the point value for this task from 1.5 to 4.

It would also be good to explicitly add any points where contact with the community is needed (i.e. people who know about the mapping or who might like to be involved/asked).

Jopparn closed this task as Resolved.Feb 20 2017, 1:24 PM
Jopparn reopened this task as Open.Feb 20 2017, 2:31 PM

Should be added to a separate document when we feel that it is ready.

Jopparn closed this task as Resolved.Mar 6 2017, 1:49 PM
Jopparn reopened this task as Open.
Jopparn moved this task from ♾️ Watching to ☑️ Done on the User-Alicia_Fagerving_WMSE board.
Jopparn moved this task from ☑️ Done to 📆 This week on the User-Alicia_Fagerving_WMSE board.

Update with info on when to contact other interested parties.
Divide according to tasks on Phabricator to make it easier to copy-paste.

Update with stuff to do after the upload is done: communication, whom to notify?

Update with stuff to do after the upload is done: communication, whom to notify?

Would be good.

Also for our own use it would be good to have the "get statistics" task added to uploading. Also something about the "unmatch data" file which is created.

I would also like a task early on with "identify community liasons" (or similar) and then some tasks for interaction with them (to review test uploads, review mappings, help with mappings, help with sources).

Divide according to tasks on Phabricator to make it easier to copy-paste.

Definitely ensure that each entry would work as a task and link it to an example task.

I would consider keeping the copy-paste bit separate from the list or at least have it as collapsible sections so that one can read the general overview without it.

Would be good to also add the statistics task at the end to the checklist (with a link to the template)

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 28 2017, 12:21 PM

It would be good to have this broken down into a few Phabricator tasks which are always created for a new dataset. That way we can ensure that things such as statistics/publishing reports don't get forgotten.

At the same time we don't want to create loads of tasks straight away. An in between solution is creating a few larger tasks (e.g. "post-upload") which specify which sub-tasks are to be created handled once it is time to deal with that task.

Some thoughts on Phabricator tasks:

Restricted Application added a subscriber: Urbanecm. · View Herald TranscriptJan 29 2019, 11:15 AM
Aklapper removed Alicia_Fagerving_WMSE as the assignee of this task.Jun 19 2020, 4:28 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

Jopparn closed this task as Invalid.Aug 17 2020, 8:08 AM

Parts were finalized when the project was active. Not likely to be fully finalized at this point as our workflow has changed.