Change Details

While working on uploading the first datasets, put down an outline of all the steps involved, from raw data to live bot run. Such an outline can then be used as a framework for working with WLM datasets on phabricator. Have a look at: T139335 for how this has been done for Batch Uploads. Related: T156889. ------- **DATA EXPLORATION** * Set up milestone on Phabricator under #Connected-Open-Heritage-Wikidata-migration, using the name of the db table -- eg. `se-arbetsl`. * Set up page under https://www.wikidata.org/wiki/Wikidata:WikiProject_WLM/Mapping_tables. ** Fill it out with sample data. ** Note: As of now, these are all created and filled out thanks to [[ https://github.com/Vesihiisi/COH-tools/blob/master/create_mapping_tables.py | this script ]]. It only needs to be rerun if a new table is added to the WLM db. * Look at the unique **identifier** of each item. Does it correspond to an identifier in an external source? ** If yes, find or request an appropriate property. ** If no (i.e. the ID is just for internal WLM use), this might mean the dataset is not suitable for import. Without a real-world reference, we can't tell much about the completeness or selection criteria of the data. * Identify **heritage status**. Do all the items represent the same type of heritage protection (eg. //national monument in <country>//)? ** If not, how can the heritage status of each item be inferred? ** Create or edit any necessary items, eg. [[ https://www.wikidata.org/wiki/Q385405 | cultural monument of the Czech Republic (Q385405) ]]. It should at least have assigned country and subclass of cultural property / national heritage site. * Identify **`P31`** ** A default `P31` for all the items -- something basic like //building// or //ancient monument//. ** Sometimes there's a separate column for this, like //type//, that can be used to substitute the default one if possible. * Create necessary lookup tables. ** Some fields have a limited range of distinct values, for example https://www.wikidata.org/wiki/Wikidata:WikiProject_WLM/Mapping_tables/se-fornmin_(sv)/types. ** In SQL, you can check it using `select distinct(columnname) from tablename;` ** The script for this is https://github.com/Vesihiisi/COH-tools/blob/master/create_distinct_lookup_table.py ** Focus on mapping the most common ones first * Identify and download any necessary offline data. ** This is to avoid doing live queries while running the program, which takes a lot of time. ** Usually stuff like placenames, administrative units. ** Data that does not change often. * Identify areas that can benefit from community input. ** Problematic due to language. ** Problematic due to lack of factual knowledge. * Labels and descriptions ** Can the `name` column be used as-is for label? ** Descriptions can be made using the default P31/heritage and country/administrative location ** Descriptions in extra languages, apart from the language of the dataset? **CODING** * Create a basic mapping file like https://github.com/Vesihiisi/COH-tools/blob/master/importer/mappings/se-arbetsl_(sv).json ** Contains data that apply to all the items. ** If possible, use a unique property (for ID number) that will be used in addition to monument_article to see whether an item might already exist. * Create statements for all relevant columns. * All statements have a source -- see T155241. **UPLOADING** * Create page with preview of processed data. ** Example: https://www.wikidata.org/wiki/Wikidata:WikiProject_WLM/Mapping_tables/se-ship_(sv)/preview * Request for permission ** Link to preview ** Describe how data is processed. ** Describe how already existing items are detected. * Test upload of ~10 items. * Upload of dataset.

While working on uploading the first datasets, put down an outline of all the steps involved, from raw data to live bot run. Such an outline can then be used as a framework for working with WLM datasets on phabricator. Have a look at: T139335 for how this has been done for Batch Uploads. Related: T156889. ------- **DATA EXPLORATION** * Set up milestone on Phabricator under #Connected-Open-Heritage-Wikidata-migration, using the name of the db table -- eg. `se-arbetsl`. * Set up page under https://www.wikidata.org/wiki/Wikidata:WikiProject_WLM/Mapping_tables. ** Fill it out with sample data. ** Note: As of now, these are all created and filled out thanks to [[ https://github.com/Vesihiisi/COH-tools/blob/master/create_mapping_tables.py | this script ]]. It only needs to be rerun if a new table is added to the WLM db. * Look at the unique **identifier** of each item. Does it correspond to an identifier in an external source? ** If yes, find or request an appropriate property. ** If no (i.e. the ID is just for internal WLM use), this might mean the dataset is not suitable for import. Without a real-world reference, we can't tell much about the completeness or selection criteria of the data. * Identify **heritage status**. Do all the items represent the same type of heritage protection (eg. //national monument in <country>//)? ** If not, how can the heritage status of each item be inferred? ** Create or edit any necessary items, eg. [[ https://www.wikidata.org/wiki/Q385405 | cultural monument of the Czech Republic (Q385405) ]]. It should at least have assigned country and subclass of cultural property / national heritage site. ..* Identify **`P31`** ** A default `P31` for all the items -- something basic like //building// or //ancient monument//. * Identify **identifier**** Sometimes there's a separate column for this, like //type//, that can be used to substitute the default one if possible. * Identify **`P31`**Create necessary lookup tables. ** Some fields have a limited range of distinct values, for example https://www.wikidata.org/wiki/Wikidata:WikiProject_WLM/Mapping_tables/se-fornmin_(sv)/types. ** In SQL, you can check it using `select distinct(columnname) from tablename;` ** The script for this is https://github.com/Vesihiisi/COH-tools/blob/master/create_distinct_lookup_table.py ** Focus on mapping the most common ones first * Identify and download any necessary offline data. ** This is to avoid doing live queries while running the program, which takes a lot of time. ** Usually stuff like placenames, administrative units. ** Data that does not change often. * Identify areas that can benefit from community input. ** Problematic due to language. * Any fields p** Problematic due to language?ack of factual knowledge. * candidates for lookup tables* Labels and descriptions * offline files -- stuff like municipalities/counties* Can the `name` column be used as-is for label? ** Descriptions can be made using the default P31/heritage and country/administrative location ** Descriptions in extra languages, apart from the language of the dataset? **CODING** * * **RIGHT BEFORE THE UPLOAD**Create a basic mapping file like https://github.com/Vesihiisi/COH-tools/blob/master/importer/mappings/se-arbetsl_(sv).json ** Contains data that apply to all the items. ** If possible, use a unique property (for ID number) that will be used in addition to monument_article to see whether an item might already exist. * Create statements for all relevant columns. * All statements have a source -- see T155241. **UPLOADING** * Create page with preview of processed data. ** Example: https://www.wikidata.org/wiki/Wikidata:WikiProject_WLM/Mapping_tables/se-ship_(sv)/preview * Request for permission ** Link to preview ** Describe how data is processed. * * ** Describe how already existing items are detected. * Test upload of ~10 items. * Upload of dataset.