Page MenuHomePhabricator

Identify and release data on similar Wikidata items
Open, Stalled, HighPublic


In order to solve the inter-language conflicts for the recommendation API (T207406) we need to identify similar Wikidata items.


  • Create a script that identifies similar Wikidata items. Here's a proof of concept code that promises promising results.

Script to get similar Wikidata items

  • Release data for manual testing of the results

See T210433#4778507

Event Timeline

bmansurov triaged this task as High priority.Nov 26 2018, 6:55 PM
bmansurov created this task.
bmansurov updated the task description. (Show Details)Nov 27 2018, 5:04 PM

@leila @diego can you help manually verify whether these similar Wikidata items make sense?

For @leila, enwiki:

For @diego, eswiki:

For me, ruwiki:

The files contain the following columns:

  • source_title: articles title that redirects to 'target_title'
  • source_id: Wikidata ID of source_title
  • source_label: Wikidata label of source_id
  • source_description: Wikidata description of source_id
  • target_title: Wikpedia article title that is the target for source_title
  • target_id: Wikidata ID of target_title
  • target_description: Wikidata description of target_id
  • cosine_similarity: Cosine similarity between source_description and target_description. Only cosine similarites of more than 0.75 are included in the results. The higher the number, the more similar source and target descriptions are.

The task is to verify whether these similar items are indeed similar and point out obviously un-similar items. Thanks for helping.

bmansurov updated the task description. (Show Details)Nov 27 2018, 5:56 PM
diego added a comment.Nov 27 2018, 6:26 PM

Hey @bmansurov , the list in Spanish it's over 11K. Maybe you could sample by cosine similarity, and create an stratified sample. Doing 11K sounds not realistic for me.

Another thing, I see a lot of scientific names for plants, that I don't really know how similar the are.

Finally, are you looking for similar or identical items? For example I see 'abuelo' (grandfather) and 'abuela' (grandmother), that are similar but no identical. I also see some identical like "Nueva York" and "New York".

bmansurov added a comment.EditedNov 27 2018, 9:19 PM

@diego how can I easily stratify this data? I was hoping that we could eyeball the results. I know that some of the results don't make sense and that's because we're only relying on item descriptions. Ideally we would rely on other features too. Also with time, as Wikidata gets more descriptions and labels, we'll be able to make better predictions.

We're looking for something like "Tumor" -> "Neoplasm". An article that redirects to another article on Wikipedia. "New York" redirecting to "Nueva York" is a perfect example because when recommending articles for creation we don't want to recommend "New York" to eswiki users because it already exists as "Nueva York".

@bmansurov , eyeballing I can say:

  • The 92% of cases are "especie de plantas (species of plant)", I don't have the knowledge to say if they are duplicated or not, but my first try would be to say that no, they are similar species but not the same, for example:

Tillandsia favillosa Q3528457 Tillandsia favillosa especie de planta Tillandsia paleacea Q7802665
Ophrys flavicans Q15432339 Ophrys flavicans especie de planta Ophrys bertolonii Q828375
Alpinia schumanniana Q10929017 Alpinia schumanniana especie de planta Alpinia zerumbet Q2703227

I would suggest to remove all of them.

  • Then in the remaining 8%, I see people, and for all that I have check they correspond to different persons, with same (or similar) name. My suggestion is that for all items with property P31:Q5 (humans) you check compare the birthdays (P569). If two persons have similar name and same birthday they are very likely to be the same. In the cases that I've check, they were not.

Michael mayer Q1928565 Michael Mayer actor alemán Michael Mayer Q70378
Michael Fox Q2555623 Michael Fox actor estadounidense Michael J. Fox Q395274

  • I've also found some art pieces like albums (Q482994) , they were also different. I would those cases I would check for performer ( P175)

Walk With Me Q10392994 Walk with Me álbum de Dog Eat Dog Walk with Me Q4017797 álbum de Jamelia
Dark Side Of The Moon Q1763889 Dark Side of the Moon álbum de Dream Theater The Dark Side of the Moon Q150901 álbum de la banda Pink Floyd

The cases that I've found to be what you are looking for are usually translation like "New York" and "Nueva York" or Q1140669 Croatia Croacia Q16110249, but I just saw few cases like that.

My main suggestion would be to split the items by P31, and for the main values that you found there, find the most popular / significant properties (like P569 for Q5) and compare those values.

@diego thanks for the feedback. I agree with you that we can improve the results by adding more features.

I briefly talked with @leila about this and at this stage, because of time constraints, I think we're fine with decreasing false positives at the cost of increasing false negatives. At a later stage we should think this through so that don't create hard coded rules per item type and find a general solution that can help with all item types.

leila added a comment.Nov 28 2018, 8:38 PM

@bmansurov I did some eye-balling and here are my observations:

  • There are a lot of false-positives in the set: items that are suggested to be the same while they're not. For the initial application which is article creation, this can be fine as there are many missing articles and not recommending some more of the missing ones can be okay. This, of course, can create bias, but again, as long we document it clearly and in the presence of so many articles missing /and/ the cost of recommending an article that already exists being high, we can perhaps accept this false-positive error.
  • The above being said, I wonder: How does the result change if you do an exact match (minus lower/upper case changes) between fields in "Also known as" and item name. For example, I see in Q1216998 that we can clearly pick up tumor by just looking at the "Also known as" field.

Overall, my recommendation is to check Also known as. If that doesn't give you what you need, then implement improvements based on what you currently have in this task and update the API. In the latter case, please make sure we clearly state that we are removing more articles than we should, and we will come back to it to improve the models later (this will be a continuous improvement case). With this level of false-positive in the data, however, I don't recommend putting effort for releasing it for now until we clean it further.

@diego and @leila thanks for the feedback. This time around I looked at Wikidata aliases to identify similar items (as suggested by Leila). I think the results are better, but not good enough IMO. The main complication comes from the fact that when a description is short, it looks similar to another item's description. Word2vec works better when descriptions are longer. Attaching the results for enwiki and eswiki.

bmansurov changed the task status from Open to Stalled.Dec 5 2018, 10:00 PM

I've tried using doc2vec too, but the results aren't great. The task is on hold until we figure out another approach.

leila added a comment.Dec 5 2018, 10:17 PM

@bmansurov I'm assigning it to myself based on what we discussed yesterday and will get back to you when I have a clearer idea.

leila claimed this task.Dec 5 2018, 10:17 PM
leila moved this task from Staged to Time Sensitive on the Research board.
leila moved this task from Time Sensitive to Staged on the Research board.Jul 11 2019, 12:41 AM
leila edited projects, added Research-Backlog; removed Research.