Page MenuHomePhabricator

Measure maturity/quality of Wikidata items in a way it can be queried
Open, MediumPublic

Description

We were talking about this on IRC. Items mature from an empty item to a featured item of time Some tools work really well on baby items, but completely mess up mature items
Like for example using petscan to get some initial statements to sort things out works really well, but if you do those edits on already filled items you'll probably mess things up
So if you expose the maturity in a way it can be used to query and filter, tools can focus on the right items to work on
So you could just setup a query like https://www.wikidata.org/wiki/User:Multichill/Empty_items_with_Dutch_label to find not very mature items in a certain field
A possible technical implementation would be to measure it, say scale from 0 (empty) to 100 (most mature item) and store it in the page_props table. That also exposes it in SPARQL

ORES could and should probably be used to do the actual scoring.

We already try to do this in different ways, for example with the paintings:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Maybe someone who maintains the Wikidata-Query-Service can look into integrating the ORES predictions. @Smalyshev? What do you think. Here's what a single prediction looks like:

Predicting the quality of the most recent version of Q2451450 https://ores.wikimedia.org/v3/scores/wikidatawiki/492705837/item_quality

{
  "wikidatawiki": {
    "models": {
      "item_quality": {
        "version": "0.1.0"
      }
    },
    "scores": {
      "492705837": {
        "item_quality": {
          "score": {
            "prediction": "C",
            "probability": {
              "A": 0.0,
              "B": 0.4939343938092204,
              "C": 0.49814893952411293,
              "D": 0.007916666666666666,
              "E": 0.0
            }
          }
        }
      }
    }
  }
}

Here's how I'd turn it into a score between 0 and 100.

>>> VALUES = {'E': 0, "D": 1, "C": 2, "B": 3, "A": 5}
>>> def weighted_score(probas):
...     return sum(VALUES[k]*p for k, p in probas.items())
... 
>>> probas = {
...               "A": 0.0,
...               "B": 0.4939343938092204,
...               "C": 0.49814893952411293,
...               "D": 0.007916666666666666,
...               "E": 0.0
...             }
>>> weighted_score(probas)
2.4860177271425536
>>> weighted_score(probas) * 20
49.72035454285107

Note that the weighted_score() returns a value between 0 and 5. By simply multiplying by 20, we scale it up to between 0 and 100.

If it's possible to put this in page props somehow, then it's easy to pick up. Though since we don't have page props modification notifications, update should either be synchronous with edits or we could have stale data there, see T145712.

Depending on the field you are trying to cover, I think wikibase:statements and wikibase:sitelinks can already do a lot (as in your sample).

For a step further: T150116

T175757: Store wp10 predictions in the MediaWiki database. and this are similar but a proper decision has to be made, page_props is not the greatest way to implement it, it's not super complex though (just define a job that pushes the scores into the page_props table), other options include: ores_classification table (but it will explode and we need to implement it in a way that updates the old row instead of making new one), elastic search, etc. I think the ores_classification is the cleanest one. Let me play with it a little.

Change 418877 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/ORES@master] Build a system that allows deleting old scores when new ones have arrived

https://gerrit.wikimedia.org/r/418877

Change 418877 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Build a system that allows deleting old scores when new ones have arrived

https://gerrit.wikimedia.org/r/418877

It'd be great if we could get an update on the current status here. Thanks! :)

It'd be great if we could get an update on the current status here. Thanks! :)

(putting my ORES hat) I'm working on it and hopefully it will reach your wiki really soon. Mostly blocked on T194297

We can pick this up now I think. It might need some extension work but I think it's possible.

It looks like the item quality model isn't loaded yet. See https://quarry.wmflabs.org/query/28639 Are there any data storage/scaling concerns with having a quality prediction for every single item in Wikidata? It looks like that would be about 51 million rows.

Next Q. If we have the data in ores_classification, can Wikidata-Query-Service easily pull this in for querying? (ping @Smalyshev)

@Halfak the most preferred way would be to get it into page_properties. But if it's in a DB accessible to RDF export code, it probably can pick it up, though I am not sure how technically complicated and efficient it would be to retrieve those. Probably not a huge deal anyway if it's in the database.
Once RDF export can have it, WDQS would pick it up automatically.

Ladsgroup unsubscribed.

As we try to split the graph as much as we can I think the proper approach to this would be store this data into a dedicated graph exposed through its own sparql endpoint and connected to wdqs through sparql federation. In general I would advise against using the wikibase RDF dumps to expose such information but rather create a separate dataset.