Proposed in Community-Wishlist-Survey-2016. Received 38 support votes, and ranked #36 out of 265 proposals. View full proposal with discussion and votes here.
Importing data with a one-time procedure is good, but we should think about what happens afterwards: we have to keep the data in sync. Many people import from external information sources (Examples: museum web page listing the birth/death of Sri Lanka singers, foreign affairs website listing Luxemburg's embassies, etc) using a self-made combination of scripts+spreadsheets+copy/pasting, then input the results in QuickStatements or similar APIs. Then they forget about it, the scripts stay on their own computers and eventually get deleted, and the next person who wants to update the info from the same website has to start from scratch, and figure out what items have to be created and what items have to be updated and how.
Who would benefit
People who import specialized datasets into Wikidata
Let's have a platform that facilitates reuse and keeps the data in sync. Rationalize the process, make it less error-prone, more efficient, and more collaborative, by having a Git-backed webapp where people can easily:
- Propose a new import script (including metadata about copyright) via a pull request. An import script scrapes information from some website and generates a QuickStatements file.
- Run an existing import script, potentially with a preview screen to check that data has been correctly extracted before injecting it into Wikidata.
- Metadata is kept about when the data was last synchronized, and when each data element has been updated last, both on the external side and on the Wikidata side.
- Metadata is kept about exceptions (cases where the external database is wrong, for instance).
All of these modules (except the import scripts) would be the same for all databases, which would help a lot in factorizing efforts, avoiding traps, making sync efficient, preventing contributors from overwriting each other endlessly.
Time, expertise and skills required
- e.g. Hackathon, GSOC, Outreachy, etc.