Page MenuHomePhabricator

[Curious Facts] take separators into account for single value constraints
Closed, ResolvedPublic

Description

Problem:
A lot of issues where Curious Facts thinks there should only be a single value are in fact ok. This is because we don't seem to take into account the seperator of the single value constraint. We should take it into account.

Example:
https://www.wikidata.org/wiki/Q15759459#P236 shows up as a curious fact because it should only have one ISSN. However two are ok if they are separated by a qualifier "distribution method". See the constraint definition here: https://www.wikidata.org/wiki/Property:P236#P236$a4e1524c-4191-7ca4-b275-b6096ec05cba

Also with PubMed ID. Example: Polymyositis: An overdiagnosed entity (Q58490803) has 2 values for PubMed ID (P698).

Acceptance criteria:

  • Seperators are taken into account for single value constraint checks

Event Timeline

@Lydia_Pintscher Done. The following separator values (as defined here) where taken into account and any item/property pair which makes use of them was filtered out from the result set:

  • P518
  • P580
  • P582
  • P407
  • P437
  • P123
  • P1810

However, I think that additional improvements are still possible.

Take for example Prague (Q1085): it has two LAU (P782) values differentiated by object has role (P3831) and subject has role (P2868). To me that makes some sense (please correct me if I am wrong), however neither of the two qualifiers are recognized as separators nor any of them is mentioned in the definition as provided in this ticket.

Also, Jota Mayúscula (Q5946641) has two Discogs artist ID (P1953) values - both using named as (P1810) with different values. Again, I might not be right but to me that also makes sense, while named as (P1810) is still not found in the definition of the filter that we are using here.

Thank you!
I am not sure but I think we have a bit of a misunderstanding. Let me try to clarify:

Does your system take this into account or only the separators from ISNN? Does this help clarify it? If not we can have a quick call.

In general if the constraint definitions are not complete then I would rather our system still shows them as errors so there is an incentive to improve the constraints definitions.

@Lydia_Pintscher

Does this help clarify it?

I think I understand completelly what you are saying. Thank you.

Does your system take this into account or only the separators from ISNN?

Nope, I have obviosuly solved a simplified problem only from my incorrect assumption that there is a general set of constraints that applies to all properties.
I am on it as soon some (small, not time consuming) priorities are settled.

Change 697686 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):

[analytics/wmde/WD/WikidataAnalytics@master] T277564

https://gerrit.wikimedia.org/r/697686

Change 697686 merged by GoranSMilovanovic:

[analytics/wmde/WD/WikidataAnalytics@master] T277564

https://gerrit.wikimedia.org/r/697686

I've tested it and appears to be working fine.

@amy_rc Thank you. Shall we close this ticket then?

As far as I can tell, yes. However, I ran into a situation where the data was being retrieved incorrectly. This has happened couple of times. For instance:
Qurious Facts: Silver-Russell syndrome (Q2142496) has 2 values for property: OMIM ID (P492). ( This fact was established on: 2021-06-01 22:59:24, and is based on the 2021-05-17 snapshot in hdfs of the Wikidata JSON dump. Edits made after that date are not taken into account.)

But, https://www.wikidata.org/wiki/Q2142496 has three OMIM ID and was last edited on 30 November 2020, at 02:52.

Can you please take a look at this?

@amy_rc That is rather strange. I am running a full system update now in relation to T277551; let's wait for the new update and then check out if the problem persists. I will perform the tests and let you know if the problem is systematic or (hopefully) not. Thank you for catching this!

@amy_rc

However, I ran into a situation where the data was being retrieved incorrectly. This has happened couple of times. For instance: Qurious Facts: Silver-Russell syndrome (Q2142496) has 2 values for property: OMIM ID (P492). ( This fact was established on: 2021-06-01 22:59:24, and is based on the 2021-05-17 snapshot in hdfs of the Wikidata JSON dump. Edits made after that date are not taken into account.) But, https://www.wikidata.org/wiki/Q2142496 has three OMIM ID and was last edited on 30 November 2020, at 02:52.

We have a full system update of the Qurator Curious Facts system now. Please let me know if anything similar to what you have described in T277564#7138891 happens again. It should not; however, if it does: please let me know. Thank you.

I found one more case where Q21127479 has 4 values for Property OMIM ID and not two. This page was last edited on 24 January 2021, at 19:36.

image.png (168×970 px, 21 KB)

With regard to my prior statement T277564#7157895, we observed that the tool only considers values containing qualifiers. Even though there are multiple values, properties like OMIM ID and Orphanet ID must have unique values. I hope this information is useful.

image.png (508×1 px, 32 KB)

This is concerning the issue T277564#7138891 that I ran into again yesterday, and it explains what I saw in this scenario. Is this now clear? If not, we can make a quick call. Please let me know.

@amy_rc

However, I ran into a situation where the data was being retrieved incorrectly. This has happened couple of times. For instance: Qurious Facts: Silver-Russell syndrome (Q2142496) has 2 values for property: OMIM ID (P492). ( This fact was established on: 2021-06-01 22:59:24, and is based on the 2021-05-17 snapshot in hdfs of the Wikidata JSON dump. Edits made after that date are not taken into account.) But, https://www.wikidata.org/wiki/Q2142496 has three OMIM ID and was last edited on 30 November 2020, at 02:52.

We have a full system update of the Qurator Curious Facts system now. Please let me know if anything similar to what you have described in T277564#7138891 happens again. It should not; however, if it does: please let me know. Thank you.

@amy_rc The part unclear to me is the following one:

... we observed that the tool only considers values containing qualifiers.

From the docs:

A qualifier can be defined as separator (P4155). This allows multiple values when using such qualifiers.

So of course that Qurator Curious Facts considers only values containing qualifiers.

In that respect I did not understand how T277564#7158984 should affect T277564#7158984?

@GoranSMilovanovic Thank you for the quick call. As we discussed during our meeting, I will await your further comments :)

@amy_rc @Lydia_Pintscher

Could it be the case that mapping relation type is treated a separator - which overrides the single value constraint - and the Curious Facts system then recognizes that only two out of four present values are violations of the single value constraint?

Because the Cole-Carpenter syndrome (the item in your example) really has four values for OMIM ID (as observed in T277564#7158984), but two out of these four values are qualified by mapping relation type, and then 4 - 2 = 2 is reported (ie. two values violate the single value constraint while two are qualified by separators and thus do not represent single value constraint violations)?

@amy_rc I think I've found the cause of things like T277564#7157895. It definitely has to do with the following observation of yours:

... we observed that the tool only considers values containing qualifiers.

Now the problem that I need to solve is how to separately search through possible cases where separators are used and cases of real single value constraint valuations in a single ETL... I will be reporting back here as soon as I have this solved. This might turn out to be a bit tricky, however I hope that it will not take me too much time to figure it out.

Change 701494 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):

[analytics/wmde/WD/WikidataAnalytics@master] T277564

https://gerrit.wikimedia.org/r/701494

Change 701494 merged by GoranSMilovanovic:

[analytics/wmde/WD/WikidataAnalytics@master] T277564

https://gerrit.wikimedia.org/r/701494

@amy_rc @Lydia_Pintscher

Could some please take a look at this ticket and let me know if we can finally resolve it?
Thank you!

It's here: Qurator Curious Facts : )

Thank you for the reminder :) I tested and found a similar case. Hope this helps

  1. restrictive cardiomyopathy has 2 values for Property Orphanet ID but is generally expected to have only one.

image.png (567×1 px, 37 KB)

  1. hypoparathyroidism, familial isolated has 4 values for Property OMIM ID but is generally expected to have only one.

image.png (754×1 px, 53 KB)

@amy_rc I see. I have also tested myself and found more similar cases.

@Manuel @Tobi_WMDE_SW

Upon numerous attempts to solve this problem now I need to declare that all general approaches have failed.

This must be, I believe, a consequence of some quite complicated mapping between our JSON data model and the structure of its hdfs copy - the wmf.wikidata_entity table in Hadoop.

I will need to invest quite some time, perform thorough "manual" tests, and sit quietly in a pen-and-paper setting to figure this out.

It is strange, because problems of this type are really rare across the Curious Facts system - but again, they seem to be persistent. There is no other way to solve this but to look for the structure of this rare failures on a case-by-case basis, analyze, and generalize a rule. Then only I will know how fix the (already complex) ETL code so to have a general solution.

So, I will abandon a top-down approach (study the general mapping from JSON datamodel -> hdfs dump -> Python/R structures, implement it as a whole) for a while in place of an empirical study.

Next steps:

  • interactive Pyspark (Jypiter/Analytics Cluster) approach:
    • generate M3 (single value constraint violations) solutions from the hdfs dump;
    • write out a direct test against WDQS;
    • sample the "suspects" - it typically happens w. OMID ID, Orphanet, and when the "mapping relation" qualifier is used;
    • perform case-by-case analysis and validate against the direct test results;
    • fix code, iterate.

Hello, @GoranSMilovanovic. Thank you for taking the time to explain everything. We believe that if the issue descriptions can be changed to reflect the new approach, this ticket can be closed for the time being.

Current - hypoparathyroidism familial isolated has 4 values for Property OMIM ID, but is generally expected to have only one.

Suggested - hypoparathyroidism familial isolated has multiple values for Property OMIM ID, but is generally expected to have only one.

The issue description is changed. tested.

@amy_rc Ok. Closing the ticket as resolved.