Page MenuHomePhabricator

Quantify additional information available via external identifiers
Open, Needs TriagePublic

Description

Wikidata has a lot of external identifiers. They link to other database, projects, catalogs etc. They make a lot of additional information available that isn't in Wikidata itself. We should try to find ways to quantify this information. We should track it over time and publish this information.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 10 2019, 6:21 PM

@Lydia_Pintscher

We should try to find ways to quantify this information.

Would you allow me to become creative in that respect and try to figure out what statistics we could offer publicly?

We should track it over time and publish this information.

When T209655 (Copy Wikidata dumps to HDFs) runs into production,

  • we will be able to produce a regular update on any external identifier statistics;
  • most of the ETL procedures that we need are (most probably) already implicit in the code developed for the WD Identifier Landscape project;
  • so, the task most probably boils down merely to the question of what useful statistics should be extract and visualize.

If you don't mind I would be ready to claim this and take care of it as soon as possible.

@Lydia_Pintscher

We should try to find ways to quantify this information.

Would you allow me to become creative in that respect and try to figure out what statistics we could offer publicly?

Sure :D
@Denny also had some idea around counting triples where available and coming up with pseudo-triples where not.

We should track it over time and publish this information.

When T209655 (Copy Wikidata dumps to HDFs) runs into production,

  • we will be able to produce a regular update on any external identifier statistics;
  • most of the ETL procedures that we need are (most probably) already implicit in the code developed for the WD Identifier Landscape project;
  • so, the task most probably boils down merely to the question of what useful statistics should be extract and visualize.

I think it's a bit more complicated than that. Basically we have an Item about X with so and so many statement. That's "our" information. And then we have links/external identifiers to say 3 libraries that also have information about X. We want to somehow quantify the latter for all of Wikidata's entities.

If you don't mind I would be ready to claim this and take care of it as soon as possible.

Sounds good. Though the data quality overview is more important.

GoranSMilovanovic added a subscriber: Halfak.EditedJul 15 2019, 11:41 AM

@Lydia_Pintscher

That's "our" information. And then we have links/external identifiers to say 3 libraries that also have information about X. We want to somehow quantify the latter for all of Wikidata's entities.

Q. Do I understand correctly: you would like to have some sort of comparison (a "ratio" of some form) between (a) knowledge on X in Wikidata and (b) knowledge on X in other (linked from Wikidata) databases?

Sounds good. Though the data quality overview is more important.

As of T195702, that does not depend upon me entirely; still doing my best to assist @Halfak to make his utility that produces the scores run.

@Denny also had some idea around counting triples where available and coming up with pseudo-triples where not.

I am sure that @Denny has a plenty of ideas. On the other hand, we don't have a plenty of Data Scientists... But I will be happy to take a look at whatever he suggests.

@Lydia_Pintscher

That's "our" information. And then we have links/external identifiers to say 3 libraries that also have information about X. We want to somehow quantify the latter for all of Wikidata's entities.

Q. Do I understand correctly: you would like to have some sort of comparison (a "ratio" of some form) between (a) knowledge on X in Wikidata and (b) knowledge on X in other (linked from Wikidata) databases?

My main goal is to be able to see how much additional information we're making available through our external identifiers. With this we can tell the world "Wikidata directly gives you access to so and so much information and then you can access an additional so and so much information easily" and continue to highlight how much benefit external identifiers bring. Tracking this over time then hopefully shows that we're making more and more information available every month/quarter.

Sounds good. Though the data quality overview is more important.

As of T195702, that does not depend upon me entirely; still doing my best to assist @Halfak to make his utility that produces the scores run.

Ok cool.

@Denny also had some idea around counting triples where available and coming up with pseudo-triples where not.

I am sure that @Denny has a plenty of ideas. On the other hand, we don't have a plenty of Data Scientists... But I will be happy to take a look at whatever he suggests.

Yeah we just had a quick chat about it. Take is as input and that's fine :)