Page MenuHomePhabricator

Update Naturvårdsregistret-data in WikiData and upload shapefiles to Commons
Closed, ResolvedPublic

Event Timeline

Minimum statistics: number of pages/entries updated or created. Global metrics.

@kalle -- you can add any statistics to T246691 so that they're all in one place. That way there's one Global metrics entry for all the work we did with the data from Naturvårdsv.

And to add to the previous comment… If you're editing an existing naturminne on Wikidata it should not be counted, as it is included in the 1361 that are already in the Global metrics. But you do count the national parks/reserves since they haven't been touched within this project yet.

Looking at a test edit at https://www.wikidata.org/w/index.php?title=Q30180845&oldid=1148665467

We've been discussing whether to update items with new areas etc if the values in the current dataset differ from the ones we uploaded a couple years ago. In this particular case, the differences are quite small.

If we do that, it's necessary to use the point in time qualifier to make it clear that both values are correct, just at different points in time. Example. This means that the _old_ value has to be edited as well, to add the qualifier.

Now, whether we should do this in the first place, I think depends on the reasons for the changes: did the actual area of the natural reserve change (the borders were moved) or was it just measured more carefully? We can't tell from the data.

Because of that I'm mildly leaning towards replacing the old value rather than keeping it. Also it's worth noting that since we already have several area values (total, forest, land, water), adding extra ones might make it confusing for the users.

Separate comment about the shape file.

The bot also has to create a Talk page for every file, since that's how pages in the Data namespace are categorized. Example.

There is a category for Sweden-related map data: https://commons.wikimedia.org/wiki/Category:Map_data_of_Sweden but separate categories for our content have to be created as well. E.g. Map data of protected areas of Sweden → Map data of national parks of Sweden.

Since I'm messing with that object, here is a screenshot of what Alicia references to:

Screenshot from 2020-04-02 17-24-02.png (742×943 px, 42 KB)

@Alicia_Fagerving_WMSE

https://commons.wikimedia.org/wiki/Data_talk:/Sweden/Nature_reserves/2020/Johannisberg/2001895.map

Did you mean something like that?

I.e. category Sweden; category protected areas in Sweden; category specifically the stereotype of protected areas in Sweden known as nature reserves.

I cleaned up the categories a little.

A rule to keep in mind is to avoid over-categorization. So if a page is in the category "nature reserves of Sweden", then it shouldn't also be in the category "Sweden".

I also added a sorting key to the page to make sure that when you open the category Map data of nature reserves of Sweden, the page shows under the letter J. When we have lots of pages in the category, that will make it easier to navigate.

The sorting key should also be added by the bot, though it's not crucial -- but again, it helps make the category easy to browse once it contains many pages.

Regarding areas, should I go ahead and replace them? If so, do you want me to add a point in time for the new items, or is it enough with the timestamp in the reference?

If point in time is to be added, perhaps it's a good idea to also add that to operator in the case where it has changed?

Regarding statistics, the bot contains a progress state which allows it to pause and continue processing items, reprocessing items that caused an error in the bot, etc. This mean that I keep statistics of each and every processed item that can be summed up in any way we want. I added a whole bunch of metrics just in case, it literally takes less than a minute to add a metric. Let me know if there is something more you'd like to see.

The per item data currently contains:

  • NVRID
  • Fatal error message that caused the bot to abort processing.
  • Wikidata identity
  • If the item was skipped.
  • Time processing item started.
  • Time processing item ended.
  • Whether or not the item was created at Wikidata.
  • Whether or not the item was updated at Wikidata.
  • Claims created in Wikidata, e.g. "area water", "operator", "coordinate".
  • Claims deleted in Wikidata.
  • Whether or not the item geoshape was created at Commons.
  • Whether or not the item geoshape was updated at Commons.

This is simply a JSON object { "nvrid1" : { ... }, "nvrid2": { ... } } that I load from disk when starting and save after each item processed. Quick and dirty but does the job.

Regarding areas, should I go ahead and replace them? If so, do you want me to add a point in time for the new items, or is it enough with the timestamp in the reference?

If point in time is to be added, perhaps it's a good idea to also add that to operator in the case where it has changed?

I think it makes sense to replace the areas, yes. In this case I don't think "point in time" is necessary. As for the operator – normally if there's only one value, "point in time" is not used, otherwise you would see it a lot on all kinds of statements that theoretically could change. IF the operator has changed then both the old and the new one should have a qualifier, but adding it if there's only one value looks like noise to me.

There are indeed operator changes. What point in time should I add to the previous claim? Perhaps same as the creation date of the previous claim?

There are indeed operator changes. What point in time should I add to the previous claim? Perhaps same as the creation date of the previous claim?

That makes sense. Ideally we would like to use "start time" / "end time" but since we don't actually know whenthe changes took place, "point in time" is the best option to indicate why we have two values there.

@kalle if you have question I guess you can ask them in Wikipedia:Projekt naturreservat you also have some questions what Karlwettinbot did tonight....

Ygers comment earlier that is not answered

image.png (508×2 px, 169 KB)

Thanks @kalle for you fast actions at Wikipedia:Projekt naturreservat

Another related thing: Kalle do you have any thoughts how we can somehow use this external datasource and detect vandalism in WD.

Kalle as I guess you understand Yger is checking everything always ;-) but I guess it would be great to do this check with lists comparing an external source with the data in Wikidata. as the data in Wikidata is free and open and can be changed by everyone... and as more and more Wikipedia articles use WD data a small vandalism can appear everywhere....

I did a test with the Nobelprize data and SPARQL federation 2018 see Listeria list User:Salgo60/ListeriaNobelData3 and T200668: Set up Nobel Data as federated search with Wikidata

The problem with that test is that now has the Nobelprize people sadly stopped maintain its SPARQL endpoint [T234811#5552880] ...

cc: @Larske if you have any thoughts....

Stats from nature reserves:

Item processed 5146
Created Commons geoshape 5130
Created Wikidata item 88
Modified Wikidata claim geoshape 5142
Modified Wikidata claim area land 4680
Modified Wikidata claim area water 4667
Modified Wikidata claim inception date 4441
Modified Wikidata claim coordinate 1151
Modified Wikidata claim area forest 830
Modified Wikidata claim area 630
Modified Wikidata claim iucn category 306
Modified Wikidata claim country 88
Modified Wikidata claim operator 83
Failed to process 2

Number of created geoshapes at commons is off due to killing bot when turning off my laptop and not logging to state. The real number is exactly as many as there are items processed.
Modified claims includes both previously missing claims and previously existing claims that have changed.
Some area counters are off by 10-20 due to manual handling of bugs.

Stats from national parks:

Item processed 30
Created Commons geoshape 30
Modified Wikidata claim area water 30
Modified Wikidata claim geoshape 30
Modified Wikidata claim coordinate 28
Modified Wikidata claim area land 26
Modified Wikidata claim inception date 15
Modified Wikidata claim area forest 7
Modified Wikidata claim area 4
Modified Wikidata claim operator 4

Stats from natural monuments:

Item processed 1360
Created Wikidata item 8
Created Commons geoshape 1360
Modified Wikidata claim inception date 1360
Modified Wikidata claim geoshape 1360
Modified Wikidata claim area forest 301
Modified Wikidata claim area water 301
Modified Wikidata claim coordinate 173
Modified Wikidata claim area land 132
Modified Wikidata claim area 8
Modified Wikidata claim country 8
Failed to process 1

The one that failed has been manually handled.