Ensure wikidata harvest is possible for very large datasets
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Lokal_Profil
	Aug 7 2017, 2:18 PM

Description

To ensure Wikidata harvesting is possible even for the largest of our datasets we should set up a harvest of a very large dataset (e.g. se-fornmin_(sv)) via Wikidata.

There are potential for time-out issues on both our connection to the sparql endpoint and for the sparql endpoint itself.

Details

Subject	Repo	Branch	Lines +/-
Investigate harvest of large wikidata dataset	labs/tools/heritage	master	+31 -0
Batch entry of sparql results into database	labs/tools/heritage	master	+61 -49
Investigate harvest of large wikidata dataset	labs/tools/heritage	wikidata	+30 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T138668 Harvest Wikidata into the Monuments database
		Resolved		Lokal_Profil	T172691 Ensure wikidata harvest is possible for very large datasets

Event Timeline

Lokal_Profil created this task.Aug 7 2017, 2:18 PM

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptAug 7 2017, 2:18 PM

Harvesting se-fornmin gives the following error:

ERROR: Unknown error occurred when processing country se-fornmin-wd in lang sv
(1205, 'Lock wait timeout exceeded; try restarting transaction')

Note though that sparql finished happily but that the sql connection seems to fizzle.

Stickning set innodb_lock_wait_timeout=100 (or some large number) in top should deal with the timeout. We might also need to ensure each country is inserted before starting the next.

Ok so sticking that in front doesn't seem to help. You still get the same error. And, after bumping it up to 600s, it seems as though the time before getting the error has nothing to do with the innodb_lock_wait_timeout so likely that didn't work at all.

In T172691#3518326, @Lokal_Profil wrote:

Ok so sticking that in front doesn't seem to help. You still get the same error. And, after bumping it up to 600s, it seems as though the time before getting the error has nothing to do with the innodb_lock_wait_timeout so likely that didn't work at all.

That is because I'm silly. Trying it again. ~~An alternative might be to batch the "sql update" calls as we now do all monuments in a dataset in one transaction rather than one transaction per monument.~~
(edit: no we don't)

Change 371744 had a related patch set uploaded (by Lokal Profil; owner: Lokal Profil):
[labs/tools/heritage@wikidata] Investigate harvest of large wikidata dataset

https://gerrit.wikimedia.org/r/371744

gerritbot added a project: Patch-For-Review.Aug 13 2017, 7:33 PM

Lokal_Profil mentioned this in T174146: Merge Wikidata branch into master.Aug 25 2017, 10:35 AM

Lokal_Profil mentioned this in T172772: Wlm/heritage/erfgoedbot sprint at Wikimania Hackathon 2017.Aug 25 2017, 10:42 AM

I hit the same issue locally, which means it’s not due to @Lokal_Profil’s local setup

(1205, 'Lock wait timeout exceeded; try restarting transaction')
process_monument_wikidata: Error executing query
REPLACE INTO `monuments_se-fornmin-wd_(sv)`(`monument_article`, `name`, `source`, `admin`, `image`, `lon`, `wd_item`, `lat`, `id`) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
[u'Hundhamra', u'Hundhamra', u'http://www.wikidata.org/entity/Q10527173', u'Botkyrka kommun', u'Norsborg 2007 1.jpg', u'59.25183', u'Q10527173', u'17.80438', 10000603880001]

I think there is something wrong with the dataset − not a single statement goes through

For comparison, harvesting nl-wd gives at the speed of light:

Sparql endpoint being slow, giving it a moment...
Processing SPARQL results...
process_monument_wikidata 0
process_monument_wikidata 1
process_monument_wikidata 2
process_monument_wikidata 3
process_monument_wikidata 4
process_monument_wikidata 5
process_monument_wikidata 6
process_monument_wikidata 7
process_monument_wikidata 8
...

Processing se-forming:

Processing SPARQL results...
process_monument_wikidata 0
process_monument_wikidata: Error executing query
REPLACE INTO `monuments_se-fornmin-wd_(sv)`(`monument_article`, `name`, `source`, `admin`, `image`, `lon`, `wd_item`, `lat`, `commonscat`, `id`) VALUES (R%C3%B6kstenen, Rökstenen, http://www.wikidata.org/entity/Q472975, Ödeshögs kommun, RÃ¶kstenen 1.JPG, 58.295, Q472975, 14.775555555, Rökstenen, 10048200010001)
0
(1205, 'Lock wait timeout exceeded; try restarting transaction')

But now I tried a second time and got :

...
process_monument_wikidata 42141
process_monument_wikidata: Error executing query
REPLACE INTO `monuments_se-fornmin-wd_(sv)`(`name`, `source`, `admin`, `lon`, `wd_item`, `lat`, `id`) VALUES (Virestad 205:2, http://www.wikidata.org/entity/Q29461231, Älmhults kommun, 56.532622, Q29461231, 14.328568, 10076702050002)
42141
(2013, 'Lost connection to MySQL server during query')
process_monument_wikidata 42142
process_monument_wikidata: Error executing query
REPLACE INTO `monuments_se-fornmin-wd_(sv)`(`name`, `source`, `admin`, `lon`, `wd_item`, `lat`, `id`) VALUES (Virestad 8:1, http://www.wikidata.org/entity/Q29461146, Älmhults kommun, 56.703272, Q29461146, 14.350956, 10076700080001)
42142
(2006, 'MySQL server has gone away')
process_monument_wikidata 42143

It's not some silly encoding/escaping issue? (Looking at you Älmhult).

Even if not the case there is definitely something funky going on with encodings in

[...] VALUES (R%C3%B6kstenen, Rökstenen, http://www.wikidata.org/entity/Q472975, Ödeshögs kommun, RÃ¶kstenen 1.JPG, 58.295, Q472975, 14.775555555, Rökstenen, 10048200010001)

Change 375529 had a related patch set uploaded (by Lokal Profil; owner: Lokal Profil):
[labs/tools/heritage@master] Investigate harvest of large wikidata dataset

https://gerrit.wikimedia.org/r/375529

Change 371744 abandoned by Lokal Profil:
Investigate harvest of large wikidata dataset

Reason:
Abandoned in favour of https://gerrit.wikimedia.org/r/#/c/375529/ (on master branch)

https://gerrit.wikimedia.org/r/371744

Abandoned the patch on the Wikidata branch in favour of a new identical one against master.

@Legoktm Could I possibly pick your brain on this one during the Wikimania Hackathon?

From @Legoktm

Rebuild the merge logic (fill_table_monuments_all) so that the select happens in python

Batch update (instead of replace) into the temp table and then rename on successful finish.

In T172691#4435365, @Lokal_Profil wrote:

From @Legoktm

Rebuild the merge logic (fill_table_monuments_all) so that the select happens in python

Batch update (instead of replace) into the temp table and then rename on successful finish.

Looked a bit more into this.
This is actually crashing in the creation of the monuments_se-fornmin-wd_(sv) table (so in update_database.py)

As an aside:
monuments_all_tmp is empty at the start of fill_table_monuments_all so the REPLACE INTO acts as INSERT INTO so changing this to UPDATE wont work. Since each dataset should have entries with a unique id and each dataset should have a unique country/lang combination we should be able to just use INSERT instead.

Rebuilt the logic to use cursor.executemany() and batching it with a conn.commit() at the end of each batch. This is successful in that it processes all of the sparql data without timing out (yay!)

Some further detective work revealed that an incorrect assumption on the default type for the id field (int(11)) was wreaking havoc on the harvest of the Swedish dataset.

Will clean up a few other Wikidata harvest things that have broken due to development on pywikibot then come with a new patch.

Change 446977 had a related patch set uploaded (by Lokal Profil; owner: Lokal Profil):
[labs/tools/heritage@master] Batch entry of sparql results into database

https://gerrit.wikimedia.org/r/446977

Many thanks for the help @Legoktm this unblocks adding at least Australia, Namibia, Tanzania and Bosnia and Herzegovina to the monuments database (thus making them accessible through the same tooling)

Yaaaay :D

JeanFred awarded a token.Jul 19 2018, 10:07 PM

Change 446977 merged by jenkins-bot:
[labs/tools/heritage@master] Batch entry of sparql results into database

https://gerrit.wikimedia.org/r/446977

The se-fornmin-wd_(sv) patch can be merged later for completeness if we want to as part of e.g. T200112

Effeietsanders closed this task as Resolved.Nov 18 2018, 12:15 AM

Ensure wikidata harvest is possible for very large datasetsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Ensure wikidata harvest is possible for very large datasets
Closed, ResolvedPublic
Actions

Related Objects
Search...