Page MenuHomePhabricator

Move WLM-IR data to Wikidata
Closed, ResolvedPublic

Description

We started the prep for transferring WLM-IR data to Wikidata in Wikimania. It would be great to finish the work for this transfer. I'm especially excited about this as Tehran's list of monuments now is almost fully geotagged and I'd love if maps.wikilovesmonuments.org can surface those monuments on the map for easy spotting.

Event Timeline

LilyOfTheWest created this task.

@Lokal_Profil @Alicia_Fagerving_WMSE can you let me know what I can do to help you make this happen? :)

I think that the only extra thing that might be needed for the IR dataset is related to T142897: Heritage Monuments database/API seems to be lacking Wiki page links. I.e. a simple mechanism for checking if the mange exists as a page (covered to a wikidata entry). The same is needed for at least the of the South American datasets.

@LilyOfTheWest Are you guys fairly sure that the majority of the auto generated links (on the name field) goes to the right article? Wondering since we can either migrate the data with that assumption or ignore any such links.

@Lokal_Profil honestly, we need to do proper testing to answer this question with more confidence. In the absence of that test, I'd say:

  • There are many redlinks in the tables now.
  • In those that are blue links, I sometimes can spot an issue, but it's not too frequent. If I have to guess, I'd say 5%.

If the ~5% guesstimate is acceptable, I'd say let's move the data to Wikidata and we will fix issues as we spot them. Does this work?

We'll move them either way. The main decision is if a new item is always created or if, when there is a blue link, the statements can be added to that item instead of creating a new one. [If there is an item on wikidata using the same id then that will be used either way]

If you believe 5% of the blue ones are wrong then I'd suggest always creating new ones (since merging is way easier than splitting). Can then maybe put together a list of manual merge candidates for volunteers to work through.

@LilyOfTheWest Follow up question on street addresses. Are these mainly real addresses or also "how to get there" descriptions? If there is a mixture is there some pattern to isolate the real addresses (e.g. they always contain a number).

@LilyOfTheWest Follow up question on street addresses. Are these mainly real addresses or also "how to get there" descriptions? If there is a mixture is there some pattern to isolate the real addresses (e.g. they always contain a number).

They are not "how to get there" addresses as in they don't have statements such as "make two lefts, three rights, etc.". However, if they are real addresses or not depends on your definition of real. :) These addresses are rarely to the house number level. In highly populated cities such as Tehran, there are more specific, but in many cases, they are statements such as n km in highway x, for example. Are these considered real? I guess they are real enough that help people explore and find them but a mailman would not take a letter there with that address. :)

We'll move them either way. The main decision is if a new item is always created or if, when there is a blue link, the statements can be added to that item instead of creating a new one. [If there is an item on wikidata using the same id then that will be used either way]

If you believe 5% of the blue ones are wrong then I'd suggest always creating new ones (since merging is way easier than splitting). Can then maybe put together a list of manual merge candidates for volunteers to work through.

this sounds good. Let's go with your proposal then. :)

@LilyOfTheWest Follow up question on street addresses. Are these mainly real addresses or also "how to get there" descriptions? If there is a mixture is there some pattern to isolate the real addresses (e.g. they always contain a number).

They are not "how to get there" addresses as in they don't have statements such as "make two lefts, three rights, etc.". However, if they are real addresses or not depends on your definition of real. :) These addresses are rarely to the house number level. In highly populated cities such as Tehran, there are more specific, but in many cases, they are statements such as n km in highway x, for example. Are these considered real? I guess they are real enough that help people explore and find them but a mailman would not take a letter there with that address. :)

Then they're not real ;) In order to be used as the value of located at street address (P969), we'd need a consistent(-ish) pattern of "X street, Y number, (Z city)" (the last one doesn't have to be in the 'address' field in the db, if it can be fetched from another field). And since we have strings like "x km from..." then I guess they contain digits, in which case they'd be hard to isolate from real street addresses, which also contain digits. So unless there's some reliable way to tell those apart, the safest thing would be to ignore the field whatsoever...

So unless there's some reliable way to tell those apart, the safest thing would be to ignore the field whatsoever...

Here is my question: suppose we do what you suggest. Then how should the participants in the contest find the monuments at all? There is some information in the address field that can help a human get somewhere with that information, while if we don't expose that information at all, it becomes really hard for people to find these monuments. (given that we don't have geo-coordinates for the majority of the monuments).

So unless there's some reliable way to tell those apart, the safest thing would be to ignore the field whatsoever...

Here is my question: suppose we do what you suggest. Then how should the participants in the contest find the monuments at all? There is some information in the address field that can help a human get somewhere with that information, while if we don't expose that information at all, it becomes really hard for people to find these monuments. (given that we don't have geo-coordinates for the majority of the monuments).

This is a valid concern and I understand the problem. Loss of information is not something we want to happen during the migration.

The source of the problem is that with the way Wikidata is structured, there isn't a place for non-structured information to live.

I have found a property called Directions, which might be of use here. However, this introduces another problem -- how do we know, from looking at the content of the address field, which property it should be assigned to. I mean, obviously if you're human and you know the language, then you just know it :) But can it be done automatically?

@Alicia_Fagerving_WMSE @Lokal_Profil I reviewed our discussion about addresses and whether we should keep that field or drop it. I also reviewed the data more thoroughly. I do believe we should keep the address information in an appropriate field. These /are/ addresses, though they are not mailing addresses. A mailman won't deliver a letter to such an address in many countries but the addresses are specific enough, in many cases, to allow a human to get to the vicinity of the monument. You at least have the province and city name, and then you usually have street names, intersections, squares, etc. In some cases, you have the actual street number. I would go with address as a field and then allow the data to get improved over time when relevant. Do you think this can work?

@Alicia_Fagerving_WMSE @Lokal_Profil I reviewed our discussion about addresses and whether we should keep that field or drop it. I also reviewed the data more thoroughly. I do believe we should keep the address information in an appropriate field. These /are/ addresses, though they are not mailing addresses. A mailman won't deliver a letter to such an address in many countries but the addresses are specific enough, in many cases, to allow a human to get to the vicinity of the monument. You at least have the province and city name, and then you usually have street names, intersections, squares, etc. In some cases, you have the actual street number. I would go with address as a field and then allow the data to get improved over time when relevant. Do you think this can work?

The problem is that P969 is pretty strictly defined. this discusses some of the problems with it and [[ https://www.wikidata.org/wiki/Property_talk:P969#What_can_we_do_for_multi_language_of_this.3F | this iterates the "what you should write on an envelope" ] definition].

P2795 on the other hand seems to fullfill our need for anything which is meant to be used as a guide for finding the object. I would even suggest always using is as the fallback for P2795 when we fail to identify a "real" address. @Alicia_Fagerving_WMSE @LilyOfTheWest: any objections to this?

@Alicia_Fagerving_WMSE @Lokal_Profil I reviewed our discussion about addresses and whether we should keep that field or drop it. I also reviewed the data more thoroughly. I do believe we should keep the address information in an appropriate field. These /are/ addresses, though they are not mailing addresses. A mailman won't deliver a letter to such an address in many countries but the addresses are specific enough, in many cases, to allow a human to get to the vicinity of the monument. You at least have the province and city name, and then you usually have street names, intersections, squares, etc. In some cases, you have the actual street number. I would go with address as a field and then allow the data to get improved over time when relevant. Do you think this can work?

The problem is that P969 is pretty strictly defined. this discusses some of the problems with it and [[ https://www.wikidata.org/wiki/Property_talk:P969#What_can_we_do_for_multi_language_of_this.3F | this iterates the "what you should write on an envelope" ] definition].

P2795 on the other hand seems to fullfill our need for anything which is meant to be used as a guide for finding the object. I would even suggest always using is as the fallback for P2795 when we fail to identify a "real" address. @Alicia_Fagerving_WMSE @LilyOfTheWest: any objections to this?

It can work in this dataset, since we can tell that the directions are directions. But do you mean 'always' as in across datasets? Then I am not 100% sure, but then of course it depends on how loosely you define 'directions' :) I've seen plenty of values (in languagues I'm able to read) that I'd describe as 'descriptions' but not 'directions', if that makes sense. It's a very subjective classification.

True it might not work across the board but should be usable in many (most?) of the datasets

I'm updating the Iran code with it, in any case.

looks good to me @Alicia_Fagerving_WMSE . Thanks.

Just FYI: if we make the transition happen soon, we can still use it in this year's WLM. Tehran is almost fully geo-tagged and we can encourage people to use Monumental for Tehran. Please help us make it to that goal if possible! :)

@LilyOfTheWest Sorry for not getting back to you before. We had a sudden influx of things to do meaning we did not manage to get the Iran data up before the end of WLM.

The last (pre-bot request) step for Iran is taking a quick look at :d:Wikidata:WikiProject_WLM/Mapping_tables/ir_(fa)/matches. This gives the "instance of" value for any item which has been matched to a monument in the Iranian list. The purpose is to figure out if we need to filter out any of these. E.g. 6 items link to items that claim that they are religions, so likely we should add "religion" to our blacklist of types of matched items.

If you (or anyone else who is familiar with the data) wouldn't mind giving the list a quick glance and see if you can spot any other weird ones then we'll update the blacklist, merge the patch and file the bot request.

After a long hiatus I'm preparing a bot request and test upload for this.

Note that I spotted 411 items described as "bathrooms" I'm guessing these should actually have been tagged with Hamam instead. I pinged the user who imported them to correct.

A ping about http://tinyurl.com/ycfl2x2j, some of the imported monuments seem to have coordinates outside of Iran. Likely this is due to typos in the list

@Lokal_Profil sorry for my long delay and happy to see that this is done. I will spend some time and review what amazing things have happened here while I was away. I'll do this in the next few days and ping with questions if they come up. Thank you SO much! :)