Page MenuHomePhabricator

Create priority lists for ancient monuments with no photos
Closed, ResolvedPublic

Assigned To
Authored By
Ainali
Jul 25 2017, 9:26 AM
Referenced Files
F9199066: fmi-imageless.txt.bz2
Aug 29 2017, 7:04 AM
F9096588: compare.py
Aug 17 2017, 9:32 AM
F9096327: wlm-fmi-imageless.zip
Aug 17 2017, 9:32 AM
F9089079: fmi-imageless.txt.bz2
Aug 16 2017, 3:08 PM

Description

There are some ancient monuments that have neither images in FMIS or Wikimedia Commons. They will be prioritized, with a sponsored special prize from RAÄ.

Create a list that makes it easier to for the competitors to find these.

This list should be similar to the one in T167421 except it is for objects in FMIS instead of BeBR.

Event Timeline

Hi!

I generated the list of imageless protected buildings from BeBR for T167421. I used code adapted from a previous project to cache relevant objects from SOCH and then run a SPARQL query over the cache to identify buildings without images at any level of BeBR's object hierarchy. However, my code from the previous project had been written to work on small-ish sets of triples: it fetches and mungs the SOCH data quite happily, but is very slow at serialising the resulting RDF so it can be imported/queried. Like, really slow: at one point it took almost a week to serialise a set of 1¼ million triples. (The BeBR dataset on SOCH contains around 14 million triples in total, and serialisation time does not scale linearly with graph size.)

I intend to rewrite my caching code to obviate the serialisation step and thus speed things up, but I don't know when I'll have time to devote to that, or when it might be finished. I could run the same code as it is for the FMIS objects, but as outlined above, that will probably take a very long time to complete. However the time-consuming cacheing/serialisation is really just a means to an end: get SOCH data into a state that allows for SPARQL queries to be run on it.

If I remember correctly, @Abbe98 maintains a local cache of SOCH data. Albin: Does your cache support SPARQL? Does it contain triples for the links in SOCH's UGC hub in addition to the main SOCH index? If so, could I perhaps provide you with a query to produce a CSV list of imageless monuments from FMIS? I think that would be the quickest solution in this case.

Question for @Ainali: Are you interested in all monuments in FMIS? Or only those with a particular status (antikvarisk bedömning) e.g. only protected ancient monuments, not other cultural remains or natural landscape features? Or only certain monument-types?

/Marcus

@Carwash: We are only intersted in a small subset of everything in FMIS. The selection is (according to this discussion):

Lämningstyp
Begravningsplats
Begravningsplats, enstaka
Bildristning
Borg
Brunn/kallkälla
Fiskeläge
Fornborg
Fäbod
Fästning/skans
Grav markerad av sten/block
Grav övrig
Grav- och boplatsområde
Gravfält
Gravhägnad
Gravklot
Husgrund, förhistorisk/medeltida
Husgrund, historisk tid
Hällbild
Hällmålning
Hällristning
Hög
Järnåldersdös
Kloster
Kvarn
Kyrka/kapell
Kyrkstad
Lägenhetsbebyggelse
Minnesmärke
Röse
Ristning, medeltid/historisk tid
Runristning
Slott/herresäte
Spärranordning
Stadsbefästning
Stadsvall/stadsmur
Stenkammargrav
Stenkistgrav
Stenkrets
Stensättning
Stridsvärn

Also, if "Skadestatus" or "Undersökningsstatus" is any of "Övertäckt, Förstörd, Flyttad, Ej återfunnen" they can be removed.
(Did I get that right @Lokal_Profil ?

Abbe98 moved this task from Inbox to WiP on the User-Abbe98 board.

A few questions,

  • Is this limited to items with an specific itemSuperType?
  • Is the list of values above from itemClassName and if so is the case how to deal with images without it(itemLabel?)?

An alternative approach might be to use the WLM database as the point of origin and then just check if the item has an image defined in Kulturarvsdata?

https://tools.wmflabs.org/heritage/api/api.php?action=search&srcountry=se-fornmin&srwithoutimages=1&userlang=en&format=xml

EDIT: just matching the output from the WLM query above with a thumbnailExists=n one?

The alternative approach is a good shortcut, because the list I provided above was the criteria for inclusion into the WLM database. So matching those two seems to be to be a good approach of creating the list.

140 000 items without images in the WLM compared to 900 000 items without images in FMIS.

700 000 items without images from FMIS if only those with coordinates are counted. @Lokal_Profil can I assume that all se-fornmin coordinates origins from FMIS?

That seems strange. There are supposedly 145 237 ancient monument items in the WLM database for Sweden so the number without images should be less than that.

@Ainali my bad, the number is 142 672 items without images in the WLM database.

@Carwash: We are only intersted in a small subset of everything in FMIS. The selection is (according to this discussion):

Lämningstyp

[...]

Also, if "Skadestatus" or "Undersökningsstatus" is any of "Övertäckt, Förstörd, Flyttad, Ej återfunnen" they can be removed.
(Did I get that right @Lokal_Profil ?

Yes I think that is correct. But additionally they were filtered on Antikvarisk bedömning = Fornlämning. And of course all of this was based on whatever those values were in 2012

700 000 items without images from FMIS if only those with coordinates are counted. @Lokal_Profil can I assume that all se-fornmin coordinates origins from FMIS?

Of course not =) It's a wiki anything could have happened. That said I would expect only a handful to have been changed.

Hi! I've been away for the past week, so I've missed some of the discussion here. Sorry about that.

A few questions,

  • Is this limited to items with an specific itemSuperType?

I hadn't planned to filter based on itemSuperType, since we're only interested in objects from serviceName=fmi, but I think they should all be of type object.

  • Is the list of values above from itemClassName and if so is the case how to deal with images without it(itemLabel?)?

The list of values is from the monuments types thesaurus. They should in theory map directly to itemClassName in SOCH, but I will double-check to be sure. :)

An alternative approach might be to use the WLM database as the point of origin and then just check if the item has an image defined in Kulturarvsdata?

https://tools.wmflabs.org/heritage/api/api.php?action=search&srcountry=se-fornmin&srwithoutimages=1&userlang=en&format=xml

This is an excellent idea! I think I was still a bit hung up on having fetched the BeBR data. There it was necessary to take a copy of the whole graph, because the BeBR data model is hierarchical, so an an object may be depicted in an image ostensibly of another object further up/down the hierarchy. But the monuments data is not like that: it should be sufficient just to check if the object itself has image or isVisualizedBy links in either SOCH or the UGC hub (which includes links to Commons), and if the answer is "no" for all of those, add it to the list. So perhaps that's what I should do! ;)

EDIT: just matching the output from the WLM query above with a thumbnailExists=n one?

That alone would not suffice. For example, DR 363 has no thumbnail, but has 13 images linked against it in the UGC hub.

Okay, with that realisation, this doesn't sound like it will be as time-consuming as I had feared. Unless @Abbe98 already has a solution in place, I'll try to get this done at the weekend. Is that okay?

@Carwash I did start on solution but for now it seams partly blocked by T172247 and its subtasks.

If the solution would be a HTTP based one, one could speed the process up by exploiting Kringla.

I attach here a first attempt at a priority list of ancient monuments with no photos.

Why a "first attempt" and not just a list according to the criteria described in this thread? Well, because I wasn't able to satisfy all of those criteria. :(

This list is of 142,676 URIs for objects from FMIS, exposed via SOCH, which are:

  • monuments (kulturlämningar), AND
  • classified as one of the monument-types listed above, AND
  • have no associated images (no triples with predicates soch:isVisualizedBy/soch:visualizes, soch:lowresSource, or soch:thumbnailSource) AND
  • have no such linked-image triples in SOCH's UGC-hub either (including existing links to images on Commons)

SOCH does not expose FMIS' fields Antikvarisk bedömning, Skadestatus, or Undersökningsstatus so for criteria based on those fields I ran additional filters using the data-exports from FMIS. However, I was only able to filter to exclude objects which:

  • have Antikvarisk bedömning ≠ "Fornlämning" OR
  • have Skadestatus = "Förstörd" OR = "Övertäckt"

This is because the other values you want to filter on do not appear to exist in those fields in FMIS:

Also, if "Skadestatus" or "Undersökningsstatus" is any of "Övertäckt, Förstörd, Flyttad, Ej återfunnen" they can be removed.

For reference, here is a complete list of values that occur for those fields in FMIS, according to the data-exports:

Skadestatus:

  • Ev. i beskrivning
  • Förstörd
  • Restaurerad
  • Skadad
  • Undersökt och borttagen
  • Uppgift saknas
  • Välbevarad
  • Övertäckt

Undersökningsstatus:

  • Delundersökt
  • Ej undersökt
  • Ev. i beskrivning
  • Undersökt och borttagen
  • Uppgift saknas

I also excluded all objects which completely lack any geometry (not even a point, no coordinates at all) as I figured that would make them tricky to photograph. ;)

142,676 is spookily close to the 142,672 imageless objects in WLM. It's weird, because I know for a fact that several hundred image links have been added to the UGC-hub in the past couple of months alone, many for objects which otherwise lack images.

Anyway, as I've explained, I wasn't able to fulfil all the criteria above. Please advise.

If you use the wlm-fmi-imageless.txt provided below as origin you can skip checking for Antikvarisk bedömning and Skadestatus.

@Abbe98 Thanks for the list! Actually @Ainali and I just tried something similar this morning, with a export of WLM URIs from Wikidata. It made no difference to the resulting list, so I think I'll stick with the FMIS data dumps for the time being: it takes almost no time at all to iterate over the tables, and the data is more up to date. It did however reveal that a roughly equal number of objects have been disqualified (i.e. they either have images when they did not before, have changed one of the other attributes we filter on) as have been added (i.e. new objects, or existing imageless objects which have changed one of the other attributes we filter on) which would appear to explain why the numbers above are so eerily similar.

Thanks also for the handy compare script! I find that comm(1) is very useful for doing left/right joins and intersections on lists. :)

Great! If T172248 gets done one might need to update the list otherwise it should be valid.

compare.py was a reference for statements I never wrote, :-) comm is my choose everyday but you never know what OS the person in the other end is using :-)


I attach an updated list of monuments without photos. It is the same as the first list, but includes additional monuments of the types described as "Priority 1" in @Ainali's new priority list.

@Carwash, @Ainali @Abbe98 This tis get used during 2017 or is it something which could/should be re-used in 2018? Asking to se if the task should be closed or moved to WMSE-Wiki-Loves-2018

As far as I know, @Ainali referred WLM-2017 participants to the list in order to encourage them to take photos of otherwise unphotographed monuments, so in that sense it was "used". However, I don't know much it was actually "used" by the participants - did we get many such "new" photos?

As far as I know, @Ainali referred WLM-2017 participants to the list in order to encourage them to take photos of otherwise unphotographed monuments, so in that sense it was "used". However, I don't know much it was actually "used" by the participants - did we get many such "new" photos?

Thanks. I've added T187779: Evaluate 2017 priority lists for ancient monuments to WMSE-Wiki-Loves-2018 (WMSE Wiki Loves Monuments 2018) to follow up on this.