Page MenuHomePhabricator

Keep Wikidata in sync with external databases
Open, LowPublicFeature

Description

Proposed in Community-Wishlist-Survey-2016. Received 38 support votes, and ranked #36 out of 265 proposals. View full proposal with discussion and votes here.

Problem

Importing data with a one-time procedure is good, but we should think about what happens afterwards: we have to keep the data in sync. Many people import from external information sources (Examples: museum web page listing the birth/death of Sri Lanka singers, foreign affairs website listing Luxemburg's embassies, etc) using a self-made combination of scripts+spreadsheets+copy/pasting, then input the results in QuickStatements or similar APIs. Then they forget about it, the scripts stay on their own computers and eventually get deleted, and the next person who wants to update the info from the same website has to start from scratch, and figure out what items have to be created and what items have to be updated and how.

Who would benefit

People who import specialized datasets into Wikidata

Proposed solution

Let's have a platform that facilitates reuse and keeps the data in sync. Rationalize the process, make it less error-prone, more efficient, and more collaborative, by having a Git-backed webapp where people can easily:

  • Propose a new import script (including metadata about copyright) via a pull request. An import script scrapes information from some website and generates a QuickStatements file.
  • Run an existing import script, potentially with a preview screen to check that data has been correctly extracted before injecting it into Wikidata.
  • Metadata is kept about when the data was last synchronized, and when each data element has been updated last, both on the external side and on the Wikidata side.
  • Metadata is kept about exceptions (cases where the external database is wrong, for instance).

All of these modules (except the import scripts) would be the same for all databases, which would help a lot in factorizing efforts, avoiding traps, making sync efficient, preventing contributors from overwriting each other endlessly.

Technical details

Time, expertise and skills required

  • e.g. 2-3 weeks, advanced contributor, javascript, css, etc

Suitable for

  • e.g. Hackathon, GSOC, Outreachy, etc.

Proposer

Syced

Related links

Event Timeline

This task was proposed in the Community-Wishlist-Survey-2016 and in its current state needs owner. Wikimedia is participating in Google Summer of Code 2017 and Outreachy Round 14. To the subscribers -- would this task or a portion of it be a good fit for either of these programs? If so, would you be willing to help mentor this project? Remember, each outreach project requires a minimum of one primary mentor, and co-mentor.
Aklapper changed the subtype of this task from "Task" to "Feature Request".

I think the groundwork for this has been done with the Mismatch Finder. Any additional work should probably be put into expanding it instead of building something new.