Page MenuHomePhabricator

Update Wikidata reference tracking
Closed, ResolvedPublic8 Estimate Story Points

Description

As a Wikidata advocate, I want to have accurate statistics about the amount and quality of Wikidata’s references when discussing Wikidata with other projects.

Problem:
The wikidata-datamodel-references dashboard currently claims that only some 3.8% of Wikidata references are “Wikimedia” references. While this sounds awesome, I don’t think it can possibly be true, based on my own experience with Wikidata references. Another panel on the same dashboard, meanwhile, names P248 as the most common P143 as the fourth most common property for references, and that property is exclusive to Wikimedia sources nowadays.

A closer look at the code generating these statistics ([MetricsProcessor in analytics/wmde/toolkit-analyzer](https://github.com/wikimedia/analytics-wmde-toolkit-analyzer/blob/master/analyzer/src/main/java/org/wikidata/analyzer/Processor/MetricProcessor.java)) reveals that it uses hard-coded lists of properties and items, which have not been updated for at least three years. This desperately needs to be reworked.

As well as updating the code, we will also have to get a new version deployed.
To do this the build jar file needs to be updated in https://github.com/wikimedia/analytics-wmde-toolkit-analyzer-build

Acceptance criteria:

  • The dashboard’s data seems plausible to Wikidata people

Open questions:

Details

Related Gerrit Patches:
analytics/wmde/toolkit-analyzer-build : productionAdd build for deployment
analytics/wmde/toolkit-analyzer-build : masterAdd build for deployment
operations/puppet : productionAdd proxy info to toolkit analyzer cron job
analytics/wmde/toolkit-analyzer : masterUpdate metric's items and properties automatically

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 13 2018, 5:39 PM
Nikki added a subscriber: Nikki.Nov 13 2018, 9:39 PM

Magnus has some stats at https://tools.wmflabs.org/wikidata-todo/stats.php?reverse

Cleaning up P143 is still ongoing, there's still a lot of non-Wikimedia stuff in there. I'm not sure how much since queries tend to time out.

Addshore triaged this task as Medium priority.Nov 20 2018, 7:59 AM
Addshore set the point value for this task to 8.Nov 20 2018, 2:46 PM
Addshore updated the task description. (Show Details)Nov 20 2018, 3:10 PM
Addshore moved this task from incoming to in progress on the Wikidata board.Nov 21 2018, 8:19 AM
Michael added a comment.EditedNov 22 2018, 4:19 PM

NB: apparently there are regressions both in the current openjdk release and the current surefire release that cause the following error message during the testing step in mvn package:

Error: Could not find or load main class org.apache.maven.surefire.booter.ForkedBooter
Caused by: java.lang.ClassNotFoundException: org.apache.maven.surefire.booter.ForkedBooter

The workaround(2) from this SO-Answer didn't work for me, but the one from that SO answer did work somewhat. Now I'm having test errors from IllegalArgumentExceptions, but I consider it progress.

Update: Scrapping that approach and downgrading to the last known-good version of openjdk-8 resolved that problem completely

Restricted Application added a project: User-Michael. · View Herald TranscriptNov 27 2018, 5:12 PM

I think the gerrit bot might be confused by the blank line between the “Bug” and “Change-Id” lines, and expect all these lines to be in one block?

Change 475807 had a related patch set uploaded (by Michael Große; owner: Michael Große):
[analytics/wmde/toolkit-analyzer@master] Update metric's items and properties automatically

https://gerrit.wikimedia.org/r/475807

Change 475807 merged by jenkins-bot:
[analytics/wmde/toolkit-analyzer@master] Update metric's items and properties automatically

https://gerrit.wikimedia.org/r/475807

Do we need to do anything to get this change deployed or does it happen automatically?

Do we need to do anything to get this change deployed or does it happen automatically?

Yes, we might need to build it and push it to another repo. The description says:

Description of T209399

As well as updating the code, we will also have to get a new version deployed.
To do this the build jar file needs to be updated in https://github.com/wikimedia/analytics-wmde-toolkit-analyzer-build

I could look into this next week?

Change 480036 had a related patch set uploaded (by Michael Große; owner: Michael Große):
[analytics/wmde/toolkit-analyzer-build@master] Add build for deployment

https://gerrit.wikimedia.org/r/480036

Change 480510 had a related patch set uploaded (by Michael Große; owner: Michael Große):
[operations/puppet@production] Add proxy info to toolkit analyzer cron job

https://gerrit.wikimedia.org/r/480510

Change 480510 merged by Ottomata:
[operations/puppet@production] Add proxy info to toolkit analyzer cron job

https://gerrit.wikimedia.org/r/480510

Change 480036 merged by jenkins-bot:
[analytics/wmde/toolkit-analyzer-build@master] Add build for deployment

https://gerrit.wikimedia.org/r/480036

Change 482668 had a related patch set uploaded (by Addshore; owner: Michael Große):
[analytics/wmde/toolkit-analyzer-build@production] Add build for deployment

https://gerrit.wikimedia.org/r/482668

Change 482668 merged by jenkins-bot:
[analytics/wmde/toolkit-analyzer-build@production] Add build for deployment

https://gerrit.wikimedia.org/r/482668

The merge of this onto the production branch (which will trigger the deployment) was missed.
I just merged it so we should see the data update from the next run which will be with the new jar.

Addshore closed this task as Resolved.Jan 21 2019, 10:41 AM

Looks fixed to me

It would be great to one day replace this Java thing with a more structured analysis in hadoop.