Measure maturity/quality of Wikidata items in a way it can be queried
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Multichill
	May 27 2017, 12:20 PM

Description

We were talking about this on IRC. Items mature from an empty item to a featured item of time Some tools work really well on baby items, but completely mess up mature items
Like for example using petscan to get some initial statements to sort things out works really well, but if you do those edits on already filled items you'll probably mess things up
So if you expose the maturity in a way it can be used to query and filter, tools can focus on the right items to work on
So you could just setup a query like https://www.wikidata.org/wiki/User:Multichill/Empty_items_with_Dutch_label to find not very mature items in a certain field
A possible technical implementation would be to measure it, say scale from 0 (empty) to 100 (most mature item) and store it in the page_props table. That also exposes it in SPARQL

ORES could and should probably be used to do the actual scoring.

We already try to do this in different ways, for example with the paintings:

https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Possible_paintings empty baby items
https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Wiki_monitor very infant items
All sorts of reports spread all over the place to improve certain aspects

Details

	Subject	Repo	Branch	Lines +/-
	Build a system that allows deleting old scores when new ones have arrived	mediawiki/extensions/ORES	master	+199 -10

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Halfak	T146718 [Discuss] Hosting the monthly article quality dataset on labsDB
Open	None	T166427 Measure maturity/quality of Wikidata items in a way it can be queried
Declined	None	T200716 Include ORES predictions in RDF export

Event Timeline

Multichill created this task.May 27 2017, 12:20 PM

Restricted Application added a project: Machine-Learning-Team. · View Herald TranscriptMay 27 2017, 12:20 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Halfak added a parent task: T146718: [Discuss] Hosting the monthly article quality dataset on labsDB.May 27 2017, 12:21 PM

Halfak mentioned this in T146718: [Discuss] Hosting the monthly article quality dataset on labsDB.

Halfak added a project: Wikidata-Query-Service.Jun 1 2017, 2:20 PM

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptJun 1 2017, 2:20 PM

Maybe someone who maintains the Wikidata-Query-Service can look into integrating the ORES predictions. @Smalyshev? What do you think. Here's what a single prediction looks like:

Predicting the quality of the most recent version of Q2451450 https://ores.wikimedia.org/v3/scores/wikidatawiki/492705837/item_quality

{
  "wikidatawiki": {
    "models": {
      "item_quality": {
        "version": "0.1.0"
      }
    },
    "scores": {
      "492705837": {
        "item_quality": {
          "score": {
            "prediction": "C",
            "probability": {
              "A": 0.0,
              "B": 0.4939343938092204,
              "C": 0.49814893952411293,
              "D": 0.007916666666666666,
              "E": 0.0
            }
          }
        }
      }
    }
  }
}

Here's how I'd turn it into a score between 0 and 100.

>>> VALUES = {'E': 0, "D": 1, "C": 2, "B": 3, "A": 5}
>>> def weighted_score(probas):
...     return sum(VALUES[k]*p for k, p in probas.items())
... 
>>> probas = {
...               "A": 0.0,
...               "B": 0.4939343938092204,
...               "C": 0.49814893952411293,
...               "D": 0.007916666666666666,
...               "E": 0.0
...             }
>>> weighted_score(probas)
2.4860177271425536
>>> weighted_score(probas) * 20
49.72035454285107

Note that the weighted_score() returns a value between 0 and 5. By simply multiplying by 20, we scale it up to between 0 and 100.

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Jun 1 2017, 2:28 PM

Halfak moved this task from Parked to Monitor (long term) on the Machine-Learning-Team (Active Tasks) board.Jun 1 2017, 2:51 PM

If it's possible to put this in page props somehow, then it's easy to pick up. Though since we don't have page props modification notifications, update should either be synchronous with edits or we could have stale data there, see T145712.

Lydia_Pintscher triaged this task as Medium priority.Jun 11 2017, 5:10 PM

Lydia_Pintscher moved this task from incoming to needs discussion or investigation on the Wikidata board.

Depending on the field you are trying to cover, I think wikibase:statements and wikibase:sitelinks can already do a lot (as in your sample).

For a step further: T150116

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptAug 11 2017, 12:38 PM

Halfak edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).Sep 6 2017, 5:23 PM

awight moved this task from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.Sep 6 2017, 5:26 PM

T175757: Store wp10 predictions in the MediaWiki database. and this are similar but a proper decision has to be made, page_props is not the greatest way to implement it, it's not super complex though (just define a job that pushes the scores into the page_props table), other options include: ores_classification table (but it will explode and we need to implement it in a way that updates the old row instead of making new one), elastic search, etc. I think the ores_classification is the cleanest one. Let me play with it a little.

Ladsgroup mentioned this in T156820: Implement ORES articlequality predictions in PageAssessments tool.Nov 20 2017, 4:11 PM

Ladsgroup claimed this task.Mar 12 2018, 9:53 AM

Ladsgroup edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptMar 12 2018, 9:53 AM

Change 418877 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/ORES@master] Build a system that allows deleting old scores when new ones have arrived

https://gerrit.wikimedia.org/r/418877

gerritbot added a project: Patch-For-Review.Mar 12 2018, 9:56 AM

Change 418877 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Build a system that allows deleting old scores when new ones have arrived

https://gerrit.wikimedia.org/r/418877

It'd be great if we could get an update on the current status here. Thanks! :)

Lydia_Pintscher mentioned this in T195702: track quality of all/top 10000 Wikidata items over time.May 27 2018, 2:40 PM

In T166427#4234672, @Lydia_Pintscher wrote:

It'd be great if we could get an update on the current status here. Thanks! :)

(putting my ORES hat) I'm working on it and hopefully it will reach your wiki really soon. Mostly blocked on T194297

Ladsgroup moved this task from needs discussion or investigation to consider for next sprint on the Wikidata board.Jun 28 2018, 3:48 PM

Smalyshev moved this task from Incoming to Need investigation on the Wikidata-Query-Service board.Jul 27 2018, 9:00 PM

We can pick this up now I think. It might need some extension work but I think it's possible.

It looks like the item quality model isn't loaded yet. See https://quarry.wmflabs.org/query/28639 Are there any data storage/scaling concerns with having a quality prediction for every single item in Wikidata? It looks like that would be about 51 million rows.

Next Q. If we have the data in ores_classification, can Wikidata-Query-Service easily pull this in for querying? (ping @Smalyshev)

@Halfak the most preferred way would be to get it into page_properties. But if it's in a DB accessible to RDF export code, it probably can pick it up, though I am not sure how technically complicated and efficient it would be to retrieve those. Probably not a huge deal anyway if it's in the database.
Once RDF export can have it, WDQS would pick it up automatically.

Great! I created T200716 to track that work. :)

Ladsgroup removed a project: Patch-For-Review.May 28 2019, 3:37 PM

Ls1g subscribed.Nov 18 2019, 9:52 AM

Gehel moved this task from Need investigation to Watching / Waiting on the Wikidata-Query-Service board.Feb 19 2020, 1:48 PM

• 1sollo25 moved this task from consider for next sprint to incoming on the Wikidata board.Feb 26 2020, 6:47 AM

Lydia_Pintscher moved this task from incoming to consider for next sprint on the Wikidata board.Feb 26 2020, 8:45 AM

Gehel moved this task from Watching / Waiting to Small Tasks on the Wikidata-Query-Service board.Feb 26 2020, 1:44 PM

Ladsgroup removed Ladsgroup as the assignee of this task.Mar 18 2020, 5:37 PM

Ladsgroup unsubscribed.

As we try to split the graph as much as we can I think the proper approach to this would be store this data into a dedicated graph exposed through its own sparql endpoint and connected to wdqs through sparql federation. In general I would advise against using the wikibase RDF dumps to expose such information but rather create a separate dataset.

MPhamWMF moved this task from Small Tasks to Feature Requests on the Wikidata-Query-Service board.Jun 24 2021, 3:56 PM

Gehel moved this task from Feature Requests to Watching / Waiting on the Wikidata-Query-Service board.Oct 14 2021, 7:47 PM

calbon moved this task from Active Tasks to Backlog/ORES on the Machine-Learning-Team board.Oct 25 2022, 6:24 PM

calbon edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).

elukey closed subtask T200716: Include ORES predictions in RDF export as Declined.May 29 2023, 9:19 AM