Harvest Wikidata into the Monuments database
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	JeanFred
	Jun 25 2016, 3:27 PM

Description

Definitely counter-intuitive, but likely the easiest way to support both Wikidata and structured lists as monument sources.

The alternative is to upgrade all of our tools to work with Wikidata in addition to the monuments database, which would be a lot of work.

Chatting with @Lokal_Profil , it sounds like this would be fairly easy.

Define a config format mapping properties to database fields
Write the importer
Off we go :)

Details

	Subject	Repo	Branch	Lines +/-
	Do not skip Wikidata datasets when making statistics	labs/tools/heritage	master	+1 -1
	Run Wikidata harvesting as part of normal harvest	labs/tools/heritage	master	+1 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T138668 Harvest Wikidata into the Monuments database
Resolved	Lokal_Profil	T165757 [Wikidata2MonumentsDB] Config to Sparql
Open	None	T165975 Parameterise sparql in config
Resolved	Lokal_Profil	T165953 Set up proof of concept config for wikidata harvesting
Resolved	Lokal_Profil	T165758 [Wikidata2MonumentsDB] Run sparql and format results to fit with legacy data
Resolved	Lokal_Profil	T165919 Create per dataset sql tables for wikidata configs
Resolved	Lokal_Profil	T165932 Create fill monuments_all sql for wikidata configs
Resolved	Lokal_Profil	T165988 Create proof of concept conversion of sparql data for import
Resolved	Lokal_Profil	T165759 [Wikidata2MonumentsDB] Store harvested wikidata back into monuments_all
Resolved	Lokal_Profil	T171300 Ensure reports work for Wikidata harvests
Resolved	Lokal_Profil	T172690 Investigate time-out issues with Wikidata harvests
Resolved	Lokal_Profil	T172691 Ensure wikidata harvest is possible for very large datasets
Resolved	Lokal_Profil	T172841 Ensure all output works for Wikidata harvests
Resolved	Lokal_Profil	T172842 Harvest monument_article via sparql
Open	None	T172973 Handle multiple values for a property
Resolved	Lokal_Profil	T174146 Merge Wikidata branch into master
Resolved	Lokal_Profil	T174334 Add "no_wikidata" argument for harvester

Event Timeline

JeanFred created this task.Jun 25 2016, 3:27 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 25 2016, 3:27 PM

JeanFred moved this task from Backlog to Future on the Wiki-Loves-Monuments-Database board.Jun 25 2016, 3:27 PM

JeanFred mentioned this in T131494: [Epic] Wiki Loves Monuments with Wikidata as a backend and deprecation the Monuments database.

Change 299891 had a related patch set uploaded (by Jean-Frédéric):
Harvest Wikidata item for Canada in English ca_(en)

https://gerrit.wikimedia.org/r/299891

gerritbot added a project: Patch-For-Review.Jul 19 2016, 10:45 PM

JeanFred removed a project: Patch-For-Review.Jul 19 2016, 10:48 PM

JeanFred mentioned this in T148990: Harvest Wikidata item into the monuments database from the linked article.Oct 24 2016, 6:45 PM

Alicia_Fagerving_WMSE subscribed.Feb 16 2017, 12:44 PM

@JeanFred I sent the following to you but I have a sneaky feeling it was to an e-mail you don't actually read =). Anyhow here it is

I thought I'd sketch out my thoughts (and our offsite talk) about a strategy for supporting Wikidata harvesting into the monuments database (in parallel with list harvesting). If you think the overall schema looks ok then I'll set up a Milestone and some overarching/tracking tasks and try and document this somewhere.

Let me know if it sounds ok. Otherwise this will be my starting point for the hackathon.

So the base plan is to support harvesting while keeping the majority of today's structure intact. Long term changes (such as making heritage into a custom api on Wikidata without its own database) are out of scope for now, but should be had in mind when implementing Wikidata harvesting.

The flow I envisage is the following:

A config file, similar to those for list harvesting, specifying the criteria needed for inclusion in a dataset as well as which values/properties to harvest.
A mechanism translating the config into a sparql query.
A mechanism for running the sparql query and handling the results (including any formatting needs).
A mechanism for storing these in a table.
Ensuring that the mechanism for pulling dataset tables into monuments_all supports the table mentioned in 4.

Affecting 2, 4 and 5 is that I would like to limit the data that we store in the dataset table to only be values which are used in monuments_all. This way we should be able to rely on default values a lot of the time and 5 becomes merely a formality.

This plan lends itself quite well to a separation where 1+2 can be developed a bit separately from 3 which in turn can be separated from 4+5.

Is in itself largely useless (unlike its equivalent for the lists which is the first time we have the list info in a structured form) but for the sake of not having to maintain two separate mechanisms I believe it can be quite nice.

Many of these steps will invariably require some re-factoring of the existing code and/or modifications to allow for new assumptions. E.g. in the api do we link to the wikidataobject for the municipality or should we try to resolve it to a wikipedia-link using the config language? How about multilingual lists (i.e. separate per wiki) which we can now combine into one config?

Lokal_Profil mentioned this in T165377: André at Wikimedia Hackathon 2017.May 15 2017, 6:23 PM

I'm interested in working on this too. Added the hackathon project

Discussion notes:

In the monuments DB:

Design generic SPARQL query, with mapping of “fields we want” <--> “WD properties”
Have country-specific config with the bits needed to customise the above SPARQL query
Write the harvesting script using the SPARQL/config (likely mainly a pywikibot wrapper)
Layer to translate Wikidata bits to the output expected by the API

On Wikipedias:

Make tool to add Wikidata identifiers to Wikipedia lists (probably reuse the commonscat script thingy)

On Wikidata

Enhance the Contraint reports on Wikidata about monuments

Bonus points:

Design a gating test to avoid replacing data with incomplete data (because of error during harvesting)

Lokal_Profil created subtask T165757: [Wikidata2MonumentsDB] Config to Sparql.May 19 2017, 1:45 PM

Lokal_Profil created subtask T165758: [Wikidata2MonumentsDB] Run sparql and format results to fit with legacy data.

Lokal_Profil created subtask T165759: [Wikidata2MonumentsDB] Store harvested wikidata back into monuments_all.May 19 2017, 1:49 PM

SELECT DISTINCT ?item ?itemLabel ?id ?admin ?adminLabel ?image ?commonscat ?address ?lat ?lon WHERE {
  # Make it properties and filter out end time
  { ?item wdt:P359 [] } UNION 
  { ?item wdt:P1435 wd:Q916333 } UNION 
  { ?item wdt:P1435 wd:Q13423591 } UNION 
  { ?item wdt:P1435 wd:Q17698911 } .
  OPTIONAL { ?item wdt:P359 ?id } .
  OPTIONAL { ?item wdt:P131 ?admin } .
  OPTIONAL { ?item wdt:P18 ?image } .
  OPTIONAL { ?item wdt:P373 ?commonscat } .
  OPTIONAL { ?item wdt:P969 ?address } .
 OPTIONAL { ?item wdt:P625 ?coordinate } . 
  #OPTIONAL { ?adm3 wdt:P131 ?adm2 } .
  #OPTIONAL { ?adm2 wdt:P131 ?adm1 } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "nl"  }
  
  } LIMIT 1000

DROP TABLE IF EXISTS `monuments_{}_({})`;
CREATE TABLE IF NOT EXISTS `monuments_{}_({})` (
  `id` int(11) NOT NULL DEFAULT  0,
  `admin` varchar(255) NOT NULL DEFAULT '',
  `commonscat` varchar(255) NOT NULL DEFAULT '',
  `lat` double DEFAULT NULL,
  `lon` double DEFAULT NULL,
  `image` varchar(255) NOT NULL DEFAULT '',
  `source` varchar(510) NOT NULL DEFAULT '',
  `changed` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `monument_article` varchar(255) NOT NULL DEFAULT '',
  `registrant_url` varchar(255) NOT NULL DEFAULT '',
  `wd_item` varchar(255) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `latitude` (`lat`),
  KEY `longitude` (`lon`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

@Multichill @JeanFred:

There is now a separate branch (wikidata) on heritage against which commits can be done (you do it from this dashboard). To use do:

git fetch --all
git checkout -b wikidata origin/wikidata
git checkout -b <random-new-topic-name>
<commit code>
git review -R wikidata

(possibly gerrit/wikidata depending on how you have set it up)

(Also ping @Sebastian_Berlin-WMSE because we talked about the same thing in the context of Wikispeech (Tool for suggesting missing lexicon entries))

Lokal_Profil created subtask T171300: Ensure reports work for Wikidata harvests.Jul 21 2017, 12:26 PM

Lokal_Profil created subtask T172690: Investigate time-out issues with Wikidata harvests.Aug 7 2017, 2:14 PM

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptAug 7 2017, 2:14 PM

Lokal_Profil created subtask T172691: Ensure wikidata harvest is possible for very large datasets.Aug 7 2017, 2:18 PM

Lokal_Profil mentioned this in T172772: Wlm/heritage/erfgoedbot sprint at Wikimania Hackathon 2017.Aug 8 2017, 8:16 AM

Lokal_Profil created subtask T172841: Ensure all output works for Wikidata harvests.Aug 9 2017, 1:50 AM

Lokal_Profil created subtask T172842: Harvest monument_article via sparql.Aug 9 2017, 2:38 AM

Lokal_Profil closed subtask T172842: Harvest monument_article via sparql as Resolved.Aug 10 2017, 12:04 AM

Lokal_Profil created subtask T172973: Handle multiple values for a property.Aug 10 2017, 12:21 AM

Lokal_Profil closed subtask T172690: Investigate time-out issues with Wikidata harvests as Resolved.Aug 17 2017, 8:48 PM

Lokal_Profil created subtask T174146: Merge Wikidata branch into master.Aug 25 2017, 10:35 AM

Lokal_Profil closed subtask T172841: Ensure all output works for Wikidata harvests as Resolved.

Lokal_Profil closed subtask T171300: Ensure reports work for Wikidata harvests as Resolved.Aug 25 2017, 10:39 AM

Lokal_Profil closed subtask T165759: [Wikidata2MonumentsDB] Store harvested wikidata back into monuments_all as Resolved.Sep 1 2017, 7:01 PM

Lokal_Profil closed subtask T165758: [Wikidata2MonumentsDB] Run sparql and format results to fit with legacy data as Resolved.

Lokal_Profil closed subtask T165757: [Wikidata2MonumentsDB] Config to Sparql as Resolved.

With the Wikidata branch merged into master I'll delete it from Gerrit unless someone objects

Lokal_Profil closed subtask T174146: Merge Wikidata branch into master as Resolved.Sep 2 2017, 4:23 PM

Lokal_Profil mentioned this in T182861: Break out lexicon tool code to a separate branch while under development.May 3 2018, 9:57 AM

Effeietsanders closed subtask T172691: Ensure wikidata harvest is possible for very large datasets as Resolved.Nov 18 2018, 12:15 AM

JeanFred reopened subtask T172690: Investigate time-out issues with Wikidata harvests as Open.Aug 26 2019, 5:47 PM

Change 532658 had a related patch set uploaded (by Jean-Frédéric; owner: Jean-Frédéric):
[labs/tools/heritage@master] Run Wikidata harvesting as part of normal harvest

https://gerrit.wikimedia.org/r/532658

gerritbot added a project: Patch-For-Review.Aug 27 2019, 9:31 AM

Change 532658 merged by jenkins-bot:
[labs/tools/heritage@master] Run Wikidata harvesting as part of normal harvest

https://gerrit.wikimedia.org/r/532658

Mentioned in SAL (#wikimedia-cloud) [2019-09-08T22:20:54Z] <JeanFred> Deploy latest from Git master: d36f393 (T138668)

Change 535292 had a related patch set uploaded (by Jean-Frédéric; owner: Jean-Frédéric):
[labs/tools/heritage@master] Do not skip Wikidata datasets when making statistis

https://gerrit.wikimedia.org/r/535292

Change 535292 merged by jenkins-bot:
[labs/tools/heritage@master] Do not skip Wikidata datasets when making statistics

https://gerrit.wikimedia.org/r/535292

Mentioned in SAL (#wikimedia-cloud) [2019-09-09T21:42:35Z] <JeanFred> Deploy latest from Git master: 0a4fbae, dff30f0 (T232394, T138668)

Aklapper closed subtask T172690: Investigate time-out issues with Wikidata harvests as Resolved.Oct 14 2020, 7:16 PM

All patches were merged. Can this be closed as resolved?

No response, closing.

Harvest Wikidata into the Monuments database Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Harvest Wikidata into the Monuments database
Closed, ResolvedPublic
Actions

Related Objects
Search...