Page MenuHomePhabricator

Harvest Wikidata into the Monuments database
Closed, ResolvedPublic

Description

Definitely counter-intuitive, but likely the easiest way to support both Wikidata and structured lists as monument sources.

The alternative is to upgrade all of our tools to work with Wikidata in addition to the monuments database, which would be a lot of work.

Chatting with @Lokal_Profil , it sounds like this would be fairly easy.

  • Define a config format mapping properties to database fields
  • Write the importer
  • Off we go :)

Related Objects

Event Timeline

Change 299891 had a related patch set uploaded (by Jean-Frédéric):
Harvest Wikidata item for Canada in English ca_(en)

https://gerrit.wikimedia.org/r/299891

@JeanFred I sent the following to you but I have a sneaky feeling it was to an e-mail you don't actually read =). Anyhow here it is

I thought I'd sketch out my thoughts (and our offsite talk) about a strategy for supporting Wikidata harvesting into the monuments database (in parallel with list harvesting). If you think the overall schema looks ok then I'll set up a Milestone and some overarching/tracking tasks and try and document this somewhere.

Let me know if it sounds ok. Otherwise this will be my starting point for the hackathon.


So the base plan is to support harvesting while keeping the majority of today's structure intact. Long term changes (such as making heritage into a custom api on Wikidata without its own database) are out of scope for now, but should be had in mind when implementing Wikidata harvesting.

The flow I envisage is the following:

  1. A config file, similar to those for list harvesting, specifying the criteria needed for inclusion in a dataset as well as which values/properties to harvest.
  2. A mechanism translating the config into a sparql query.
  3. A mechanism for running the sparql query and handling the results (including any formatting needs).
  4. A mechanism for storing these in a table.
  5. Ensuring that the mechanism for pulling dataset tables into monuments_all supports the table mentioned in 4.

Affecting 2, 4 and 5 is that I would like to limit the data that we store in the dataset table to only be values which are used in monuments_all. This way we should be able to rely on default values a lot of the time and 5 becomes merely a formality.

This plan lends itself quite well to a separation where 1+2 can be developed a bit separately from 3 which in turn can be separated from 4+5.

  1. Is in itself largely useless (unlike its equivalent for the lists which is the first time we have the list info in a structured form) but for the sake of not having to maintain two separate mechanisms I believe it can be quite nice.

Many of these steps will invariably require some re-factoring of the existing code and/or modifications to allow for new assumptions. E.g. in the api do we link to the wikidataobject for the municipality or should we try to resolve it to a wikipedia-link using the config language? How about multilingual lists (i.e. separate per wiki) which we can now combine into one config?

I'm interested in working on this too. Added the hackathon project

Discussion notes:

In the monuments DB:

  • Design generic SPARQL query, with mapping of “fields we want” <--> “WD properties”
  • Have country-specific config with the bits needed to customise the above SPARQL query
  • Write the harvesting script using the SPARQL/config (likely mainly a pywikibot wrapper)
  • Layer to translate Wikidata bits to the output expected by the API

On Wikipedias:

  • Make tool to add Wikidata identifiers to Wikipedia lists (probably reuse the commonscat script thingy)

On Wikidata

  • Enhance the Contraint reports on Wikidata about monuments

Bonus points:

  • Design a gating test to avoid replacing data with incomplete data (because of error during harvesting)
SELECT DISTINCT ?item ?itemLabel ?id ?admin ?adminLabel ?image ?commonscat ?address ?lat ?lon WHERE {
  # Make it properties and filter out end time
  { ?item wdt:P359 [] } UNION 
  { ?item wdt:P1435 wd:Q916333 } UNION 
  { ?item wdt:P1435 wd:Q13423591 } UNION 
  { ?item wdt:P1435 wd:Q17698911 } .
  OPTIONAL { ?item wdt:P359 ?id } .
  OPTIONAL { ?item wdt:P131 ?admin } .
  OPTIONAL { ?item wdt:P18 ?image } .
  OPTIONAL { ?item wdt:P373 ?commonscat } .
  OPTIONAL { ?item wdt:P969 ?address } .
 OPTIONAL { ?item wdt:P625 ?coordinate } . 
  #OPTIONAL { ?adm3 wdt:P131 ?adm2 } .
  #OPTIONAL { ?adm2 wdt:P131 ?adm1 } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "nl"  }
  
  } LIMIT 1000
DROP TABLE IF EXISTS `monuments_{}_({})`;
CREATE TABLE IF NOT EXISTS `monuments_{}_({})` (
  `id` int(11) NOT NULL DEFAULT  0,
  `admin` varchar(255) NOT NULL DEFAULT '',
  `commonscat` varchar(255) NOT NULL DEFAULT '',
  `lat` double DEFAULT NULL,
  `lon` double DEFAULT NULL,
  `image` varchar(255) NOT NULL DEFAULT '',
  `source` varchar(510) NOT NULL DEFAULT '',
  `changed` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `monument_article` varchar(255) NOT NULL DEFAULT '',
  `registrant_url` varchar(255) NOT NULL DEFAULT '',
  `wd_item` varchar(255) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `latitude` (`lat`),
  KEY `longitude` (`lon`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

@Multichill @JeanFred:

There is now a separate branch (wikidata) on heritage against which commits can be done (you do it from this dashboard). To use do:

git fetch --all
git checkout -b wikidata origin/wikidata
git checkout -b <random-new-topic-name>
<commit code>
git review -R wikidata

(possibly gerrit/wikidata depending on how you have set it up)

(Also ping @Sebastian_Berlin-WMSE because we talked about the same thing in the context of Wikispeech (Tool for suggesting missing lexicon entries))

With the Wikidata branch merged into master I'll delete it from Gerrit unless someone objects

Change 532658 had a related patch set uploaded (by Jean-Frédéric; owner: Jean-Frédéric):
[labs/tools/heritage@master] Run Wikidata harvesting as part of normal harvest

https://gerrit.wikimedia.org/r/532658

Change 532658 merged by jenkins-bot:
[labs/tools/heritage@master] Run Wikidata harvesting as part of normal harvest

https://gerrit.wikimedia.org/r/532658

Mentioned in SAL (#wikimedia-cloud) [2019-09-08T22:20:54Z] <JeanFred> Deploy latest from Git master: d36f393 (T138668)

Change 535292 had a related patch set uploaded (by Jean-Frédéric; owner: Jean-Frédéric):
[labs/tools/heritage@master] Do not skip Wikidata datasets when making statistis

https://gerrit.wikimedia.org/r/535292

Change 535292 merged by jenkins-bot:
[labs/tools/heritage@master] Do not skip Wikidata datasets when making statistics

https://gerrit.wikimedia.org/r/535292

Pppery subscribed.

All patches were merged. Can this be closed as resolved?

No response, closing.