While working on uploading the first datasets, put down an outline of all the steps involved, from raw data to live bot run. Such an outline can then be used as a framework for working with WLM datasets on phabricator.
Have a look at: T139335 for how this has been done for Batch Uploads.
- Set up milestone on Phabricator under Connected-Open-Heritage-Wikidata-migration, using the name of the db table -- eg. se-arbetsl.
- Set up page under https://www.wikidata.org/wiki/Wikidata:WikiProject_WLM/Mapping_tables.
- Fill it out with sample data.
- Note: As of now, these are all created and filled out thanks to this script. It only needs to be rerun if a new table is added to the WLM db.
- Look at the unique identifier of each item. Does it correspond to an identifier in an external source?
- If yes, find or request an appropriate property.
- If no (i.e. the ID is just for internal WLM use), this might mean the dataset is not suitable for import. Without a real-world reference, we can't tell much about the completeness or selection criteria of the data.
- Identify heritage status. Do all the items represent the same type of heritage protection (eg. national monument in <country>)?
- If not, how can the heritage status of each item be inferred?
- Create or edit any necessary items, eg. cultural monument of the Czech Republic (Q385405). It should at least have assigned country and subclass of cultural property / national heritage site.
- Identify P31
- A default P31 for all the items -- something basic like building or ancient monument.
- Sometimes there's a separate column for this, like type, that can be used to substitute the default one if possible.
- Create necessary lookup tables.
- Some fields have a limited range of distinct values, for example https://www.wikidata.org/wiki/Wikidata:WikiProject_WLM/Mapping_tables/se-fornmin_(sv)/types.
- In SQL, you can check it using select distinct(columnname) from tablename;
- The script for this is https://github.com/Vesihiisi/COH-tools/blob/master/create_distinct_lookup_table.py
- Focus on mapping the most common ones first
- Identify and download any necessary offline data.
- This is to avoid doing live queries while running the program, which takes a lot of time.
- Usually stuff like placenames, administrative units.
- Data that does not change often.
- Identify areas that can benefit from community input.
- Problematic due to language.
- Problematic due to lack of factual knowledge.
- Labels and descriptions
- Can the name column be used as-is for label?
- Descriptions can be made using the default P31/heritage and country/administrative location
- Descriptions in extra languages, apart from the language of the dataset?
- Create a basic mapping file like https://github.com/Vesihiisi/COH-tools/blob/master/importer/mappings/se-arbetsl_(sv).json
- Contains data that apply to all the items.
- If possible, use a unique property (for ID number) that will be used in addition to monument_article to see whether an item might already exist.
- Create statements for all relevant columns.
- All statements have a source -- see T155241.
- Create page with preview of processed data.
- Request for permission
- Link to preview
- Describe how data is processed.
- Describe how already existing items are detected.
- Test upload of ~10 items.
- Upload of dataset.
- Publish report files