Page MenuHomePhabricator

track quality of all/top 10000 Wikidata items over time
Closed, ResolvedPublic

Description

There is a lot of discussion about the quality of Wikidata's items and if they are getting better or worse over time. We should have data to have a rational discussion. ORES has quality scores for all items in Wikidata. We need a graph of:

  • the average quality score of all items over time
  • the average quality of the top 10000 items (as measured by amount of usage on Wikimedia projects) over time

This should be calculated at least monthly.

Related: T166427

Event Timeline

Sounds like an interesting idea. It might be easier to do a static set and measure how that evolves. Uses can change a lot through changes in templates.

Not sure if it will add much to the general discussions, these are generally mixed with many marginally important factors. Occasionally, we get the fallout from that at Wikidata with people trying to delete anything that has a "A" in the reference section, because in some Wikipedia discussion they came to the conclusion that references need to have an url that includes "B".

Vvjjkkii renamed this task from track quality of all/top 10000 Wikidata items over time to 56baaaaaaa.Jul 1 2018, 1:07 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from 56baaaaaaa to track quality of all/top 10000 Wikidata items over time.Jul 2 2018, 3:39 PM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
Harej added a subscriber: Harej.

@Lydia_Pintscher Does this depend on work from Scoring Platform?

@Lydia_Pintscher @RazShuty @Halfak

Ok, here's what I've got:

     item  revision             timestamp   usage
1  Q36524 924799644 2019-04-26 06:29:25.0 6791020
2  Q54919 929383859 2019-04-30 21:14:14.0 4376000
3 Q423048 919180363 2019-04-19 18:57:17.0 4252235
4  Q36578 866859320 2019-02-25 08:49:00.0 3692702
5 Q193563 919018095 2019-04-19 15:07:51.0 3389081
6 Q131454 928584935 2019-04-30 04:31:54.0 3353011

The table know about: items (item), their latest revision IDs (revision), the timestamp of the latest revision (timestamp), and their WDCM usage statistic (usage; measures the item's usage across the WMF projects).

That would be Pyspark. What I need now are the latest ORES scores, tagged by revision IDs, for all WD items, so that I can join them to this table.
Making millions of ORES API calls is obviously not feasible.

@Halfak Any ideas? If you have such a dataset could you please let me know where does it live? Many thanks.

@Ladsgroup @RazShuty @darthmon_wmde

Amir, right before our meeting, what we need here is simple:

  • take a look at the sample data set in T195702#5208632;
  • the usage column is the WDCM re-use statistic, saying just for reasons of clarity;
  • we need an additional column with the latest ORES prediction for the item quality (in terms of the A, B, C, D, or E Wikidata data quality categories);
  • and that is it.

In other words, if you can produce a table (.csv, .tsv) where:

  • the first column is item id,
  • the second column is the ORES prediction (A, B, C, D, or E),
  • so that I can join the predictions w. key = item id to existing data set,
  • the problem is solved.

For clarity, making millions of calls to ORES is totally feasible. We have a utility for doing just this. @GoranSMilovanovic has been using the ores score_revisions utility. If you create a json file with a field called "rev_id" containing the most recent rev_id for each item, the utility will be able to process that and get your dataset.

Status:

  • working on analytics/visualizations now;
  • next steps: dashboard.
  • Analytics/visualizations - DONE.

@Lydia_Pintscher @RazShuty @WMDE-leszek Here's a glimpse of what we've found out thus far:

  1. For all Wikidata items that have received an ORES quality prediction:
ORES Quality Prediction	Num. Items	Percent
A                       10730	        0.018323596524787
B                       1007744	        1.72092213012817
C                       23997946	40.9812376447003
D                       19136231	32.6788980288096
E                       14405722	24.6006185998371
  1. For top 10,000 most used Wikidata items that have received an ORES quality prediction:
ORES Quality Prediction	Num. items	Percent
A	                1253	        12.4466077282209
B	                1353	        13.4399523194596
C	                5527	        54.902155557763
D	                1741	        17.2941293334658
E	                193	        1.91715506109069
  1. Item Quality (ORES predictions) and Item Re-use statistics (WDCM): boxplots w. outliers included (i.e. the points above the boxes are representing items that are outliers in a sense of being used more frequently than other items in the same quality class A, B, C, D, or E):

scoreUsage_BoxPlot_ggplot2.png (607×800 px, 33 KB)

Roughly, item quality is (luckily) correlated with item re-use across the Wikimedia projects (i.e. lower the item quality class A --> E, less the items get re-used). However, we see a lot of outliers.

  1. The same as (3) but with outliers removed category-wise (i.e. outliers were detected and removed separately from all class A items, class B items, etc, e.g. relative to their own quality classes):

scoreUsage_BoxPlot_ggplot2_outliers_removed.png (607×800 px, 17 KB)

It should now be more obvious that higher item quality implies less item re-use.

  1. This is interesting but it takes a bit more to explain. Each point represents many Wikidata items: (1) we focus on one class, A for example, and then we (2) pick only unique values of item re-use statistics, so that any number on the y-axis represents a unique value for item re-use. The points receive a horizontal jitter in the plot so that they can be easier recognized. If we take a look at the A class, we see that the items belonging to it take many different re-use values. As we move along the quality axis (A --> E), we see that there are less and less different, unique item re-use values; in categories D and E we find only three. Once again: each point represents many Wikidata items (all items, in fact, that have the same re-use statistic).

scoreUsage_BoxPlot_ggplot2_outliers_removed_unique_values.png (607×800 px, 58 KB)

Why is this interesting? Without making a bold statement, the way items are re-used in category A looks like a human or human|machine activity (higher diversification of the re-use statistic), while the item re-use in categories D and E looks suspiciously like a consequence of pure machine activity. Let me remind that the items in the A category (highest quality) are also rare in comparison to other quality classes. If this hypothesis is correct, we are possibly facing a situation where bots make use of Wikidata items of low quality (classes D, E) across the projects. In case the hypothesis holds: should we have a policy about this?

  1. The distribution of the WDCM item re-use statistic across the quality classes:

distribution_Quality_USage.png (607×800 px, 30 KB)

  1. I have compiled a list of of top 1,000 most used items from each quality category B, C, D, E (excluding the top A quality class) that are recognized as outliers in general (meaning: they are outliers in terms of being used more relative to the whole set of items considered, not relative to the items from their own quality classes). These items are critical and need immediate attention simply because they leave space for improvement (i.e. they are not in the A class) while they are used widely across the projects. The list will be available from the Wikidata Quality Dashboard (under development).

NEXT STEPS

  • Put the analytics procedure on a regular update schedule;
  • Develop the dashboard.

@Lydia_Pintscher @RazShuty @WMDE-leszek

Here's a prototype of a Wikidata Quality Report.

NEXT STEPS:

  • Include a bit more info on (1) ORES and (2) the quality grading scheme in the report intro;
  • Put the analytics procedure on a regular update schedule.

Thanks for all the work! I have a question: what dimensions of data quality (completeness, accuracy, consistency...) are you guys considering when you speak of "quality" in this scope? The term "quality" is a buzzword used by people to name things that sometimes have no relationship to each other, so I'm not sure what it means here in practical terms, I don't know what factors are included in the equation (and which are excluded and should be measured separately).

Here is a new version of the report with the Grading Scheme for Wikidata items included:

@abian Thanks!

what dimensions of data quality (completeness, accuracy, consistency...) are you guys considering when you speak of "quality" in this scope?

The Grading Scheme explains the criteria used.

The term "quality" is a buzzword used by people to name things that sometimes have no relationship to each other

I agree up to .7 with you (I am a Bayesian, so consider .7 to be a subjective measure of degree of belief)

so I'm not sure what it means here in practical terms,

In practical terms, and in the scope of this Report, it signifies exactly what the Grading Scheme defines as item quality (in conjunction with the full definition of the mathematical model in the ORES system, of course). I know that this answer provides a pure, maybe not so useful operational definition (if not even a pure ostensive definition). However, faced with the fact that data quality is a concept of immense complexity, a subject of immense discussions, and on the other hand faced with a need to start reporting on Wikidata item quality, this is the best answer that we can provide right now.

I image that the quality assessment system will be re-designed one day following tons of philosophical, methodological, and (hopefully) practical discussions. Until then, this Report is what we have.

I don't know what factors are included in the equation (and which are excluded and should be measured separately)

This is a question for the ORES team: I am sure that @Halfak can provide additional information in that respect. I have only a conceptual understanding of ORES (i.e. what ML approach does it take), but the details of its feature engineering (and your question @abian seems to be pointing right there) are beyond my knowledge.

@abian In WikidataCon 2019 we will have a Data quality panel, as well as a Data quality meetup. I also hope to learn more about the possible ways of Wikidata quality assessment there. See you in Berlin this October maybe?

@Lydia_Pintscher Here is the final version of the Report, including the timeline of the latest revids made for A, B, C, D, and E class items:

The temporal tracking of data quality is enabled, and we will start getting more data as soon as the ORES updates are placed on a regular schedule. Once monthly we can almost certainly have, as of once weekly - depends on the estimate of how much does it take for the ORES utilities to update their predictions.

Please let me know if anything else is needed here.

As of the tech things, what remains to be done is to:

  • place the ORES update procedure on a regular update schedule;
  • host this report from a CloudVPS instance (we can use the same one that we use for WDCM);
  • sync everything (the ORES update, the WDCM update, the wmf.mediawiki_history table snapshots).

@abian, ORES models directly a measure "completeness". However, it turns out that accuracy and consistency strongly correlate to these measures of "completeness" so it also a good and useful proxy measure of "consistency" and "accuracy". I'd like to know when and where that breaks down so that we can model that better, but in the meantime, I think it is good and useful for measuring "quality" as a general concept. Consuming any measurements comes with caveats and I think this is a good one to highlight so thank you for raising it.

For our work developing better strategies for tracking the type and frequency of mismatches between predictions and reality, see https://mediawiki.org/wiki/Jade

@Lydia_Pintscher A slightly adjusted version of the report:

  • no qualitative differences in the results/conclusions;
  • addition: taking care to eliminate all items that might have existed at some point but were deleted in the meantime.

Thank you both! :-)

I have several concerns about how users may use and understand this indicator; I'll list the main ones in case you find them helpful, of course without any intention of hindering your work or preventing us from having metrics that help us better understand how useful the data could be. My main concerns have to do with the specification and with its possibly unrealistic ambitions.

  • The specification is intentionally ambiguous ("most appropriate", "applicable", "some important", "significant and relevant", "solid", "high quality", "non-trivial"...) for AI to solve these ambiguities, and, since it was created as part of a project with AI, leaving AI aside was definitely not an option. Otherwise, it would have been possible to keep the natural ambiguities in the short descriptions (to let users understand them easily) while avoiding many ambiguities in the detailed wording. The fact that the specification is so ambiguous and must be disambiguated by AI makes the ambiguity-free specification a black box from the beginning, a model that is not explainable in terms of how ambiguities are solved by AI, hard to test and prone to training problems that can be difficult to detect and fix.
  • Some ambiguities directly won't be resolved at any time, since AI cannot be provided with all the data it would need. Some ambiguities will inevitably be ignored or wrongly considered (and it won't be easy to detect which ones). In these terms the specification is too ambitious, it makes AI "bite more than it can chew" while it does not consider other aspects important to measure completeness.
  • Gut feeling: I have the impression that the specification is complex enough to cause too much cognitive load on the users who make judgments to train the model. This means the judgments that train the model probably can't take into account all the required criteria at the same time.
  • The specification is not formally agreed or approved and still has the template {{Draft}}, added by the main author. If we started now to use the indicator derived from this specification to track the quality of Wikidata Items over time, we wouldn't be able to significantly improve any part of the specification and implementation stack, since every important change would prevent comparing historical data with current data.
  • If the resulting indicator, whose formula would not be explainable, were called "data quality" and published in a way that could be queried (T166427), users would indeed trust The Indicator as a synonym for "data quality". They would use it to sort and prioritize their work, perhaps ignoring the best rated Items (which would have relevant problems not considered by The Indicator, such as vandalism, outdated data, structural inconsistencies, constraint violations, etc.) and focusing on the worst-rated ones (even if they were really good in the criteria wrongly quantified, undervalued or not considered, and even if these Items had no impact on the project). People would probably use less of their personal reasoning and exploration criteria to start letting The Indicator guide their efforts as if it were an oracle. In my opinion, this could divert the efforts of some users down the wrong path and, given the deficiencies that this indicator would have, be more counterproductive than positive for the project.

When the specification was being designed I kept in touch with its author and we talked about these problems in a more or less superficial way, but probably the constraints of the academic project didn't leave much room for action. Now we don't have those constraints anymore and, if we want to use the specification, I think we should improve it. I've checked some of the Items listed in the reports and I unfortunately think the resulting indicator is not better than Recoin when it comes to measuring relative completeness and it's definitely below property constraints when it comes to measuring consistency. If the decision to start using this indicator without changes has already been made, I would suggest calling it "ORES completeness" or similar as a workaround to try to avoid some of the effects of possible misuse.

I hope I'm not sounding like the troll that appears in my profile picture (it's actually an enemy from the first The Legend of Zelda) :-) and that these comments can help in some way.

@abian In WikidataCon 2019 we will have a Data quality panel, as well as a Data quality meetup. I also hope to learn more about the possible ways of Wikidata quality assessment there. See you in Berlin this October maybe?

Hopefully we'll meet there, yes!

@Lydia_Pintscher I guess this task is completed now.

However, we might need a new ticket in relation to this:

  • to re-factor most of the data engineering code to work in the analytics cluster
  • (it is now done in R on a single server by a process that east up to 50Gb RAM memory on stat1007 - the Analytics-Engineering keep on killing it, and for a reason;
  • everything should migrate to Pyspark and run in the cluster);
  • before this is done no further updates of the Data Quality Report will be possible.