Page MenuHomePhabricator

Ensure wikidata harvest is possible for very large datasets
Closed, ResolvedPublic

Description

To ensure Wikidata harvesting is possible even for the largest of our datasets we should set up a harvest of a very large dataset (e.g. se-fornmin_(sv)) via Wikidata.

There are potential for time-out issues on both our connection to the sparql endpoint and for the sparql endpoint itself.

Event Timeline

Harvesting se-fornmin gives the following error:

ERROR: Unknown error occurred when processing country se-fornmin-wd in lang sv
(1205, 'Lock wait timeout exceeded; try restarting transaction')

Note though that sparql finished happily but that the sql connection seems to fizzle.

Stickning set innodb_lock_wait_timeout=100 (or some large number) in top should deal with the timeout. We might also need to ensure each country is inserted before starting the next.

Ok so sticking that in front doesn't seem to help. You still get the same error. And, after bumping it up to 600s, it seems as though the time before getting the error has nothing to do with the innodb_lock_wait_timeout so likely that didn't work at all.

Ok so sticking that in front doesn't seem to help. You still get the same error. And, after bumping it up to 600s, it seems as though the time before getting the error has nothing to do with the innodb_lock_wait_timeout so likely that didn't work at all.

That is because I'm silly. Trying it again. An alternative might be to batch the "sql update" calls as we now do all monuments in a dataset in one transaction rather than one transaction per monument.
(edit: no we don't)

Change 371744 had a related patch set uploaded (by Lokal Profil; owner: Lokal Profil):
[labs/tools/heritage@wikidata] Investigate harvest of large wikidata dataset

https://gerrit.wikimedia.org/r/371744

I hit the same issue locally, which means it’s not due to @Lokal_Profil’s local setup

(1205, 'Lock wait timeout exceeded; try restarting transaction')
process_monument_wikidata: Error executing query
REPLACE INTO `monuments_se-fornmin-wd_(sv)`(`monument_article`, `name`, `source`, `admin`, `image`, `lon`, `wd_item`, `lat`, `id`) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
[u'Hundhamra', u'Hundhamra', u'http://www.wikidata.org/entity/Q10527173', u'Botkyrka kommun', u'Norsborg 2007 1.jpg', u'59.25183', u'Q10527173', u'17.80438', 10000603880001]

I think there is something wrong with the dataset − not a single statement goes through

For comparison, harvesting nl-wd gives at the speed of light:

Sparql endpoint being slow, giving it a moment...
Processing SPARQL results...
process_monument_wikidata 0
process_monument_wikidata 1
process_monument_wikidata 2
process_monument_wikidata 3
process_monument_wikidata 4
process_monument_wikidata 5
process_monument_wikidata 6
process_monument_wikidata 7
process_monument_wikidata 8
...

Processing se-forming:

Processing SPARQL results...
process_monument_wikidata 0
process_monument_wikidata: Error executing query
REPLACE INTO `monuments_se-fornmin-wd_(sv)`(`monument_article`, `name`, `source`, `admin`, `image`, `lon`, `wd_item`, `lat`, `commonscat`, `id`) VALUES (R%C3%B6kstenen, Rökstenen, http://www.wikidata.org/entity/Q472975, Ödeshögs kommun, Rökstenen 1.JPG, 58.295, Q472975, 14.775555555, Rökstenen, 10048200010001)
0
(1205, 'Lock wait timeout exceeded; try restarting transaction')

But now I tried a second time and got :

...
process_monument_wikidata 42141
process_monument_wikidata: Error executing query
REPLACE INTO `monuments_se-fornmin-wd_(sv)`(`name`, `source`, `admin`, `lon`, `wd_item`, `lat`, `id`) VALUES (Virestad 205:2, http://www.wikidata.org/entity/Q29461231, Älmhults kommun, 56.532622, Q29461231, 14.328568, 10076702050002)
42141
(2013, 'Lost connection to MySQL server during query')
process_monument_wikidata 42142
process_monument_wikidata: Error executing query
REPLACE INTO `monuments_se-fornmin-wd_(sv)`(`name`, `source`, `admin`, `lon`, `wd_item`, `lat`, `id`) VALUES (Virestad 8:1, http://www.wikidata.org/entity/Q29461146, Älmhults kommun, 56.703272, Q29461146, 14.350956, 10076700080001)
42142
(2006, 'MySQL server has gone away')
process_monument_wikidata 42143

It's not some silly encoding/escaping issue? (Looking at you Älmhult).

Even if not the case there is definitely something funky going on with encodings in

[...] VALUES (R%C3%B6kstenen, Rökstenen, http://www.wikidata.org/entity/Q472975, Ödeshögs kommun, Rökstenen 1.JPG, 58.295, Q472975, 14.775555555, Rökstenen, 10048200010001)

Change 375529 had a related patch set uploaded (by Lokal Profil; owner: Lokal Profil):
[labs/tools/heritage@master] Investigate harvest of large wikidata dataset

https://gerrit.wikimedia.org/r/375529

Change 371744 abandoned by Lokal Profil:
Investigate harvest of large wikidata dataset

Reason:
Abandoned in favour of https://gerrit.wikimedia.org/r/#/c/375529/ (on master branch)

https://gerrit.wikimedia.org/r/371744

Abandoned the patch on the Wikidata branch in favour of a new identical one against master.

@Legoktm Could I possibly pick your brain on this one during the Wikimania Hackathon?

From @Legoktm

Rebuild the merge logic (fill_table_monuments_all) so that the select happens in python

Batch update (instead of replace) into the temp table and then rename on successful finish.

From @Legoktm

Rebuild the merge logic (fill_table_monuments_all) so that the select happens in python

Batch update (instead of replace) into the temp table and then rename on successful finish.

Looked a bit more into this.
This is actually crashing in the creation of the monuments_se-fornmin-wd_(sv) table (so in update_database.py)

As an aside:
monuments_all_tmp is empty at the start of fill_table_monuments_all so the REPLACE INTO acts as INSERT INTO so changing this to UPDATE wont work. Since each dataset should have entries with a unique id and each dataset should have a unique country/lang combination we should be able to just use INSERT instead.

Rebuilt the logic to use cursor.executemany() and batching it with a conn.commit() at the end of each batch. This is successful in that it processes all of the sparql data without timing out (yay!)

Some further detective work revealed that an incorrect assumption on the default type for the id field (int(11)) was wreaking havoc on the harvest of the Swedish dataset.

Will clean up a few other Wikidata harvest things that have broken due to development on pywikibot then come with a new patch.

Change 446977 had a related patch set uploaded (by Lokal Profil; owner: Lokal Profil):
[labs/tools/heritage@master] Batch entry of sparql results into database

https://gerrit.wikimedia.org/r/446977

Many thanks for the help @Legoktm this unblocks adding at least Australia, Namibia, Tanzania and Bosnia and Herzegovina to the monuments database (thus making them accessible through the same tooling)

Change 446977 merged by jenkins-bot:
[labs/tools/heritage@master] Batch entry of sparql results into database

https://gerrit.wikimedia.org/r/446977

Lokal_Profil moved this task from Backlog to Done - in July on the Wiki-Loves-Monuments (2018) board.

The se-fornmin-wd_(sv) patch can be merged later for completeness if we want to as part of e.g. T200112