Page MenuHomePhabricator

Investigate gb-sct disappearing
Closed, ResolvedPublic

Description

From the 2018-07-24 logs

ERROR: Unknown error occurred when processing country gb-sct in lang en
(2006, 'MySQL server has gone away')

We should check if it is still missing after the next harvest. If so we should figure out why. If not it might be worth considering a long-term fix for how to handle such hick-ups.

Event Timeline

Not sure if this is relevant, but the tool seems to be picking up some monuments that are neither Wikidata P709 nor P718 (the only two that currently should be used).

All the data for the UK campaigns are on Wikidata. We have not used the Monuments database for some years.

Thanks for the response. With "the tool" are you refering to a UK specific WLM tool, Monumental or some other tool?

Is the UK campaign primarilly making use of the Wikidata items or the Wikipedia lists? If the former then it would make sense for us to switch the Monuments database over to harvesting from Wikidata instead. Could be a candidate for something similar to T200112: Add wikidata-only datasets to MonumentsDatabase.

Lokal_Profil added a subscriber: Multichill.
In T203348, @Multichill wrote:

I seem to have added them back in 2012. According to https://commons.wikimedia.org/w/index.php?title=Commons:Monuments_database/Statistics&oldid=204208075 it worked some time ago, missing on https://commons.wikimedia.org/wiki/Commons:Monuments_database/Statistics (current version). Configuration is at https://github.com/wikimedia/labs-tools-heritage/blob/master/erfgoedbot/monuments_config/gb-sct_en.json . Some things to check:

Thanks @Multichill for finding the source for this.

Not finding the header doesn't stop the page from being harvested, but not finding the row template does. In both cases redirects are not resolved so those entries are skipped.

{{HB Scotland header}} and {{HB Scotland row}} are still referred to from the template documentation so its unclear if a search/replace is desired.

I'll update the config to use the two target "HS listed building header" at least.

Change 457100 had a related patch set uploaded (by Lokal Profil; owner: Lokal Profil):
[labs/tools/heritage@master] Update gb-sct header template

https://gerrit.wikimedia.org/r/457100

Thanks for the response. With "the tool" are you refering to a UK specific WLM tool, Monumental or some other tool?

Is the UK campaign primarilly making use of the Wikidata items or the Wikipedia lists? If the former then it would make sense for us to switch the Monuments database over to harvesting from Wikidata instead. Could be a candidate for something similar to T200112: Add wikidata-only datasets to MonumentsDatabase.

I'm referring to Monumental here. I'm not sure where Monumental gets its data from, but it should be reading Wikidata only, for all four UK campaigns. Since 2014 the UK tool (written by Magnus) has directly used datasets on Wikidata, uploaded from the original listing authority datasets. We never take anything from Wikipedia. The mapping we use is as follows:

wlm-gb-wls (Wales) maps to P1459
wlm-gb-nir (Northern Ireland) maps to P1460
wlm-gb-eng (England) maps to P1216
wlm-gb-sct (Scotland) maps to P709 (also P718 but this has few uses).

As the official datasets are updated, we try to reflect that by re-importing to Wikidata.

We'd prefer both the UK-specific and the Monumental tools to show exactly the same data, and that should be the Wikidata datasets mentioned above.

Can Monumental read directly from Wikidata? If not, could you please amend the working database to be a direct copy of the Wikidata datasets?

Paweł has been doing the recent setup work to get Monumental working for the UK campaigns, and there's recent email correspondence on the above points.

Thanks @Multichill for finding the source for this.

Not finding the header doesn't stop the page from being harvested, but not finding the row template does. In both cases redirects are not resolved so those entries are skipped.

{{HB Scotland header}} and {{HB Scotland row}} are still referred to from the template documentation so its unclear if a search/replace is desired.

I'll update the config to use the two target "HS listed building header" at least.

Stuff got merged, but nobody seems to have done the clean up run. Doing that now (example edit: https://en.wikipedia.org/w/index.php?title=List_of_listed_buildings_in_Ruthven,_Angus&diff=prev&oldid=857875184 ). Can you keep an eye on it to see what you get in tomorrow?

What is don't get is why https://en.wikipedia.org/wiki/List_of_Category_A_listed_buildings_in_Aberdeen wasn't harvested.

Change 457100 merged by jenkins-bot:
[labs/tools/heritage@master] Update gb-sct header template

https://gerrit.wikimedia.org/r/457100

Can Monumental read directly from Wikidata? If not, could you please amend the working database to be a direct copy of the Wikidata datasets?

For clarity : we (the WLM international team) support two independent data sources and associated tools.

  • Monumental is based on Wikidata. It does not make use of the Monuments database at all.
  • the Monuments Database (Wiki-Loves-Monuments-Database, and the associated tools) is based on the Wikipedia lists, and does not use Wikidata.
    • Although, @Lokal_Profil and I are are trying to change that to some degree (T200112) in order to ease the migration path to Wikidata.

So in summary:

  • if your data is fully hosted on Wikidata, you should not have to worry about the monuments database :)
  • If you are transitioning your data from the Wikipedia lists to Wikidata, then T200112 may be of interest if you are heavily relying on MonumentsDB-tooling (mostly ErfgoedBot).

This just happened again:

WARNING: 46 primkey(s) missing on List_of_listed_buildings_in_Glasgow/13 (monuments_gb-sct_(en))
ERROR: Unknown error occurred when processing country gb-sct in lang en
(2006, 'MySQL server has gone away')

I cannot reproduce locally:

$ docker-compose run --rm bot python erfgoedbot/update_database.py -countrycode:gb-sct -langcode:en -log
<snip>
WARNING: 46 primkey(s) missing on List_of_listed_buildings_in_Glasgow/13 (monuments_gb-sct_(en))
WARNING: 8 primkey(s) missing on List_of_listed_buildings_in_Kirkgunzeon,_Dumfries_and_Galloway (monuments_gb-sct_(en))
> Terminates in 4m42s

By the way, when a contestant photographs a monument that isn't yet on Wikidata, Monumental (or the underlying code) is updating Wikidata automatically with the new image, isn't it?

I don't believe it is, for the simple reason that it is hard to judge which images are representative. But @Yarl can give a definite answer.

  • If you are transitioning your data from the Wikipedia lists to Wikidata, then T200112 may be of interest if you are heavily relying on MonumentsDB-tooling (mostly ErfgoedBot).

To clarify @JeanFred's comment. Tooling here primarilly refers to

  • automatic categorisation of uploaded images
  • lists of images of monument where that monument does not yet have an image (in the list)

@MichaelMaggs The lists of UK monuments that are on Wikipedia today, are they completely abandoned? I.e. do they receive none of the updates that are made to Wikidata? If so I would recommend Wiki-Loves-Monuments-Database to stop harvesting these. We can try switching over to a Wikidata harvest to get some of the above mentioned tooling working anyway.

We noticed a lot of warnings for missing prim keys.

This is because the config looks for hb, but some/many/all pages actually use hbnum

We should search/replace the template parameters on all lists.

We tried running harvesting just on gb-sct, and it crashed again (so it’s not due to some weird timing linked to full harvest).

It always crashes at the same place − the last error was List of listed buildings in Glasgow/13. The one after is List of Category A listed buildings in Aberdeenshire. Checking that one − it uses hb − it is the first list in thousands to do so.

While ErfgoedBot was going through all these pages with no primkey, it was not touching the database at all. When it reaches Aberdeenshire, it finally tries to insert something, but by then the connection to the database had timed out.

I think I must be missing something fundamental here, as I really can't understand why Scottish monuments continue to be harvested from Wikipedia into the Monuments database even after I explained above that none of the UK campaigns (including Scotland) have used the Monuments database since 2014.

Since that date all UK effort has gone into updating Wikidata directly from the official government lists. We don't use the Monuments database, do not intend to, and do not want to encourage anyone to start uploading images based on the largely unchecked and unreliable Wikipedia lists - to say nothing of the fact that Wikipedia's lists omit virtually all the Grade II sites - around 500,000 or so - which we have on Wikidata.

Attempting to maintain partial and incorrect UK lists as part of the old Monuments database is, so far as I can understand, simply a waste of volunteer time.

Have I missed something?

@MichaelMaggs this task is partly about the Scottish dataset, partly about finding the underlying problem and ensuring it doesn't show up somewhere else.

If as you say the lists are unmaintained then I too agree that the Monuments Database should switch over to harvesting these monuments from Wikidata. We've had some problems with so large datasets in the past but it would be worth a try, it would need a separate task for setting up the necessary mapping though. I assume this applies to the Wales, England and NIR lists as well?

That said unless you've set up separate tools for categorizing the images on Commons and suggesting new images which can be used on Wikidata/Wikipedia then you are still relying on the Monuments Database, it's just been running on broken and/or outdated data.

The lists all still provide an upload link which feeds in to the WLM competition and don't note that they are out of date so I wouldn't be surprised if some participants are still making use of them.

Have I missed something?

The monuments database is not only for Wiki Loves Monuments. It's a representation of what is available on Wikipedia. In my view Wikidata doesn't replace Wikipedia. These lists should probably get some attention from editors to make them better.

The automatic generated lists are just a starting point. These should be expanded and articles should be written (see also https://commons.wikimedia.org/wiki/Commons:Wiki_Loves_Monuments/Philosophy#Help_Wikipedia ). Look at lists like https://en.wikipedia.org/wiki/National_Register_of_Historic_Places_listings_in_Nassau_County,_Florida to see a possible end result.

@Multichill, @Lokal_Profil thanks for the feedback. I do think that the that the Monuments Database should switch over to harvesting UK monuments from Wikidata (Scotland, Wales, England and NIR lists), assuming you still think it useful to have UK sites in it at all. The WP lists of UK sites are unmaintained by us, and have no connection whatsoever with the WLM-UK campaigns. They also omit the vast majority of grade II buildings. Of course the WP lists may receive odd edits from individual editors but those aren't reflected back to Wikidata.

The primary tool we use to select new candidate images for Wikidata is here: https://tools.wmflabs.org/fist/file_candidates/#/candidates/?source=COMMONS&group=ON%20WIKIDATA&size=1000&commonscat=Images%20from%20Wiki%20Loves%20Monuments%202018%20in%20the%20United%20Kingdom.
That tool reads Commons categories, and adds images to Wikidata. it does not use anything from Wikipedia nor the Monuments database.

The WP lists of UK sites are unmaintained by us, and have no connection whatsoever with the WLM-UK campaigns. They also omit the vast majority of grade II buildings. Of course the WP lists may receive odd edits from individual editors but those aren't reflected back to Wikidata.

This is a good question in general − does the fact that WLM UK decided to run its campaign from an other platform should mean that these lists should be in general abandoned − potentially deleted from WP?


On the topic of Wikidata-reliance: I would need some more data to be definitive but some thoughts:

Upload

If I understood you correctly, WLM-UK is using Monumental as its primary upload tool?
However it looks like direct uploads from Monumental accounts for less than 1%. Good old Upload Campaigns still account for 95% of uploads. These may come from different sources than monuments lists of course − do I misunderstand and the Wikidata-based upload tool sends back to the upload campaigns?

Processing

The primary tool we use to select new candidate images for Wikidata is here: https://tools.wmflabs.org/fist/file_candidates/#/candidates/?source=COMMONS&group=ON%20WIKIDATA&size=1000&commonscat=Images%20from%20Wiki%20Loves%20Monuments%202018%20in%20the%20United%20Kingdom.
That tool reads Commons categories, and adds images to Wikidata. it does not use anything from Wikipedia nor the Monuments database.

This looks good ; however you are overlooking other maintenance processes :) Between September 9th and September 17th, ErfgoedBot, relying on data from the Monuments Database, categorized 334 images from the WLM-UK campaign (222 from England, 111 from Wales, and 1 from Northern Ireland) (There is no data readily available on the categorisation done from Sep 1st to 9th) This is a modest share of the total uploads but probably around 10%.

Anyhow, I actually find it great that WLM-UK runs off Wikidata − it is where all campaigns should be heading to eventually and someone had to start :) But I just wanted to point out that you are probably relying on Wikidata less than what one might think :)

@JeanFred, just some more background it case it's useful.

The map that the UK campaigns use ( https://tools.wmflabs.org/fist/file_candidates/#/candidates/?source=COMMONS&group=ON%20WIKIDATA&size=1000&commonscat=Images%20from%20Wiki%20Loves%20Monuments%202018%20in%20the%20United%20Kingdom) draws its information entirely from Wikidata. When a contestant clicks a pin to upload, they are automatically taken to a pre-filled version of the relevant campaign upload wizard (England, Scotland, Wales or NI depending on the location selected). There, they can edit the filename and description before uploading. Most contestants come via that route. Users are allowed to upload via Monumental if they wish, but we de-emphasise that in our publicity as Monumental does not allow the user to choose a filename, nor to edit the description.

I'm not necessarily suggesting that you do anything different, but just wanted to let you have the background as it seemed that the team might be putting quite a lot of volunteer time and effort into attempting to get some Scottish monuments into the database when that's of perhaps rather marginal benefit.

Looks like that did the trick: 47673 gb-sct monuments popped up in the monuments database on latest harvest :) commons.wikimedia.org/wiki/Special:Diff/324109067

Last part is to check enwp template with our configuration and do some last tweaks (removing multiple options for id fields and that kind of things). I hope to be doing that soon.

And the table should probably be renamed from gb-sct to gb-sct-lb because we also have scheduled monuments, see T207067

Am I correct that this ticket can be marked as 'done'?

Multichill claimed this task.

Investigation done. Mostly fixed