Page MenuHomePhabricator

Create dashboard to show growth of structured data on Commons over time
Closed, ResolvedPublic

Description

As a product manager, I need a dashboard where I can see the growth of structured data over time so that I can report on the success of the project and know what improvements to prioritize.

The success of Structured Data on Commons, and information about its weakness and areas of biggest growth, will inform the SD team's decisions and as we expand Structured Data from Commons to the Wikipedias. We want to ensure that we learn and improve as we dramatically expand the use of Structured Data across our projects. Additionally, the Image Recommendations project depends on the richness of Structured Data on Commons to provide high quality image matches for articles, and this type of data will help us learn where we need to invest in improving SDC for that purpose.

The goal of this ticket is to create a regularly updated visual dashboard (in superset or something like it?) where I can easily see the number of files with at least one structured data element, as well as the other data in the notebooks listed below, e.g.: Media files containing structured fields in non-English languages; Number of files with non-English captions; Number of files with English captions; Number of files that had captions added; How quickly after creation does non-English captions get added?; and Time to edit for SDC; Comparison of number of SDC and non-SDC edits. dashboard that showed the growth of structured data on Commons over time.

This could be based on the analytics work that was done for the grant:

https://github.com/wikimedia-research/SDC-metrics-2019/blob/master/T231952-part-1.ipynb

https://github.com/wikimedia-research/SDC-metrics-2019/blob/master/T231952-part-2.ipynb

https://github.com/wikimedia-research/SDC-metrics-2019/blob/master/T231952-part-3.ipynb


(DRAFT) dashboard in Superset: https://superset.wikimedia.org/superset/dashboard/310/

The metrics include:

  • Overview
    • Number (%) of files with at least one structured data element
    • Median number of structured data elements per file
    • Number of files with license
    • Number of files with depicts
    • Number of files with captions (en vs. non-en vs. both)
  • Captions
    • Number of files with captions added monthly
    • Number of files with non-English/English captions added monthly
    • Number of files with both captions added monthly
    • How quickly after creation do non-English captions get added?
  • SDC edits
    • Number of SDC and non-SDC edits monthly
    • Time to edit for SDC

Event Timeline

LGoto triaged this task as Medium priority.May 12 2020, 5:05 PM
LGoto moved this task from Triage to Current Quarter on the Product-Analytics board.

@CBogen, @Ramsey-WMF and @Abit : head's up that this task isn't likely to go anywhere during the current quarter given that MediaSearch and Computer-Aided-Tagging are priorities. Keeping it in the "Upcoming Quarter" column on the PA board for now, to be revisited in Q2.

The goal of this task is unclear, which makes it difficult to prioritize it. If we are to undertake this work, we need to know more about how building this dashboard will help inform the org or drive decisions. This is partly because there's already tracking of the number of files with at least one structured data element (T238878#5692516, productionized in T239565, see also T247101), which arguably tracks the growth of SDC. For now, I'll move this to the upcoming quarter for a future revisit.

The goal of this task is unclear, which makes it difficult to prioritize it. If we are to undertake this work, we need to know more about how building this dashboard will help inform the org or drive decisions. This is partly because there's already tracking of the number of files with at least one structured data element (T238878#5692516, productionized in T239565, see also T247101), which arguably tracks the growth of SDC. For now, I'll move this to the upcoming quarter for a future revisit.

The goal of this ticket is to have a regularly updated visual dashboard (in superset or something like it?) where a PM or other interested party can easily see the number of files with at least one structured data element, as well as the other data in the notebooks listed in the description, e.g.: Media files containing structured fields in non-English languages; Number of files with non-English captions; Number of files with English captions; Number of files that had captions added; How quickly after creation does non-English captions get added?; and Time to edit for SDC; Comparison of number of SDC and non-SDC edits.

I'll update the description to add this.

CBogen updated the task description. (Show Details)

@CBogen and I discussed this in our sync meeting today, as reflected in the updated task description. I've added T258834 as a subtask of this one, because having the SDC data queryable in the Data Lake will help us answer key questions about the richness of it.

ldelench_wmf changed the task status from Open to Stalled.Jul 19 2021, 4:46 PM

The number of media files with M-entities has been measured monthly for the past year and half and you can get that from commonswiki_mediainfo_slots.tsv on analytics.wikimedia.org (phab task for that was T239565). Looks like there’s also a Grafana dashboard that tracks the number of pages on Commons with Wikidata entity usage (found through T238878). The number is considerably higher than the number of media files, as one would expect.

kzimmerman added subscribers: nettrom_WMF, kzimmerman.

Reassigning to Connie as she's now supporting Structured Data

The draft dashboard have the following monthly metrics available from Jun 2021 - Dec 2021:

Captions

  • Number of files with captions added monthly
  • Number of files with non-English/English captions added monthly
  • Number of files with both captions added monthly
  • How quickly after creation do non-English captions get added?

SDC edits

  • Number of SDC and non-SDC edits monthly
  • Time to edit for SDC

For caption-related metrics, we use mediawiki_history table to calculate the captions added per month. For a total number of files with captions, we decide to calculate it after the entity table is created.

For SDC edits metrics, I chose to ignore whether these edits were reverted as I've previously found that reverts of these are rare (as are removals). For wikitext edits, they have to also not be a revert nor be reverted (within the common timeframe of 48 hours) as those are either reinstating an old version of the page or an unproductive edit.
In addition, these are all non-bot edits, and we only count edits to files that have not been deleted.

cchen changed the task status from Stalled to Open.Mar 21 2022, 9:16 PM
cchen updated the task description. (Show Details)
cchen moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

Following metrics were added to the draft dashboard.

  • Number (%) of files with at least one structured data element
  • Number of files with license
  • Number of files with depicts
  • Number of files with captions (en vs. non-en vs. both)

The metrics are from commons_entity table. We have the data back to Dec 2021.

Question: the "SDC edits" tab mentions "only look[ing] at non-bot edits to files".
MediaInfo submits all edits with a bot: 1 flag via the API (code for statements, code for captions) (the reason is simply "because that's how Wikidata does it")
Does that mean that those edits (via the regular SDC UI) are not included in this data, or does that param have no impact on how a "bot edit" is determined?

Question: the "SDC edits" tab mentions "only look[ing] at non-bot edits to files".
MediaInfo submits all edits with a bot: 1 flag via the API (code for statements, code for captions) (the reason is simply "because that's how Wikidata does it")
Does that mean that those edits (via the regular SDC UI) are not included in this data, or does that param have no impact on how a "bot edit" is determined?

For SDC edits, we used edit comments in mediawiki_history to identify structured data edits (notebook). And the way we flagged bot is based on event_user_is_bot_by in mediawiki_history.

cchen updated the task description. (Show Details)

Reviewed dashboard with the structured data team, add Median number of structured data elements per file to dashboard.