Page MenuHomePhabricator

[REQUEST] SDC metrics
Closed, ResolvedPublic

Description

What's requested:
For upload/editing/viewing features:

  1. Quarterly comparison of metadata on files with a common template, such as the information template and artwork template.
  2. Comparison of editing on file pages with and without structured data 60 days or more after the file is uploaded.
  3. Quarterly measurement media containing structured fields using non-English languages.

For search features:

  1. Clickthrough rate
  2. Clicks or scrolls to more results

Why it's requested:
To evaluate how we're doing on these grant metrics:
https://commons.wikimedia.org/wiki/Commons:Structured_data/Annual_Plan_2018-19#CDP_Targets

When it's requested:
As early as possible in Q2, and then again in Jan 2020.

Other helpful information:
I feel like I made a task for this a few months ago, but now I can't find it. If there is another task with the same requests, please do merge :)

Related Objects

Event Timeline

kzimmerman triaged this task as Medium priority.

Discussed with Megan on 10/30 regarding:

  1. Clickthrough rate
  2. Clicks or scrolls to more results

Will refer to following notes for understanding the requirement and pulling the metrics required.
https://phabricator.wikimedia.org/T213597 : use this to start
https://meta.wikimedia.org/wiki/Research:Baseline_Metrics_for_Structured_Data_on_Wikimedia_Commons
https://phabricator.wikimedia.org/T188421#4897244
https://github.com/MeganNeisler/SDoC-Baseline-Metrics-Redux/tree/master/T187827/Jan2019Report
https://phabricator.wikimedia.org/T174519

Discuss with Mikhail regarding the following metric:

  1. Quarterly comparison of metadata on files with a common template, such as the information template.
  2. Quarterly comparison of when metadata on file pages are edited.
  3. Quarterly measurement media containing structured fields using non-English languages.

seems like this may be affected by issues with counting the number of files with structured data...see T238878: Data about how many file pages on Commons contain at least one structured data element for details.

Discussed with Mikhail and Morten on the Quarterly comparison metrics. Meeting scheduled with Amanda next week 11/26 for a few clarifications on the metrics.

Meeting with Amanda and Ramsey helped clarify some outstanding questions we had. Meeting minutes are provided in this document.
Some highlights from the discussion:

  • for Quarterly comparison of metadata on files with a common template : Amanda and Ramsey are specifically looking for Information template and Artwork template only
  • for Quarterly comparison of when metadata on file pages are edited : Did the addition of SD encourage users to visit the file page and update it after 60 days? Does that number vary between quarters? [task description has been changed accordingly].
  • SDC team is interested to know : Does the presence of structured data features make people come back to old files. “Commonists add more and richer metadata to 1% of Common’s media files by the end of FY18-19.”
  • Amanda was okay to go ahead with the grant proposal without these numbers, and so we have a little more time beyond Dec 6.
  • Priority of the Metrics : First - quarterly number of files metrics, then Search metrics

I've now completed a preliminary analysis of question 3, quarterly measurement media containing structured fields using non-English languages. As discussed in our meeting last week, this translates to "files with captions in a non-English language". The code behind the analysis can be found in this notebook on GitHub.

From what I could tell, it's not straightforward to identify what articles contain captions in specific languages using the replicated MediaWiki databases. I would expect that the wbc_entity_usage table provides this information, but inspecting the entries there for a couple of pages reveals that it contains captions ("labels") that are not shown on the corresponding page on Commons. Potential follow-up work would be asking the developers what code logic is behind choosing whether or not to show a caption (e.g. if the caption doesn't actually exist, how is that determined?)

Instead, I reused approaches from T230581 that identifies a page getting a label added, changed, or deleted through edit comments and uses the mediawiki_history table in the Data Lake as the source of truth. I found that by far the majority of these operations are additions (about 1.5 million edits through November 2019), changes and deletions are two orders of magnitude fewer. Due to this huge difference, I chose to ignore all changes/deletions and instead accept that an estimate of might be off by about 10,000 pages because the number of pages is much larger than that.

Using this approach, I first calculated the number of files with and without captions per the end of November 2019:

  • Number of files with captions: 1,365,092
  • Number of files with non-English captions: 731,900
  • Number of files with English captions: 776,401
  • Number of files with both (sum of the latter two minus the first): 143,209

I then also estimated the number of files with non-English captions at the end of each of the first three quarters of 2019:

QuarterNumber of files with non-English captionsDiff since previous quarter
Q1165,905
Q2386,269220,364
Q3600,719214,450

@Abit and @Ramsey-WMF : I hope these numbers answer Q3, and please let me know what questions you have about this analysis! :)

[Edited on 2019-12-05: added diffs since previous quarter to the table of files with non-English captions.

I would expect that the wbc_entity_usage table provides this information, but inspecting the entries there for a couple of pages reveals that it contains captions ("labels") that are not shown on the corresponding page on Commons. Potential follow-up work would be asking the developers what code logic is behind choosing whether or not to show a caption (e.g. if the caption doesn't actually exist, how is that determined?)

@matthiasmullie or @Cparle , our caption developers, any idea what's going on here?

I hope these numbers answer Q3, and please let me know what questions you have about this analysis! :)

It does, thanks so much @nettrom_WMF !

From what I could tell, it's not straightforward to identify what articles contain captions in specific languages using the replicated MediaWiki databases. I would expect that the wbc_entity_usage table provides this information, but inspecting the entries there for a couple of pages reveals that it contains captions ("labels") that are not shown on the corresponding page on Commons. Potential follow-up work would be asking the developers what code logic is behind choosing whether or not to show a caption (e.g. if the caption doesn't actually exist, how is that determined?)

I'm not sure I fully understand the context, but let's try.

So: wbc_entity_usage does not hold information about all entities. There are (many) entities that do not appear in wbc_entity_usage.
wbc_entity_usage (much like like categorylinks or templatelinks) track usage of entities via Lua. Files can have MediaInfo entities (with captions and/or statements), but they might not show up in wbc_entity_usage unless they're being fetched in some Lua script (and then only the aspects that are being used in Lua will be recorded in wbc_entity_usage - i.e. an entity can have both labels and statements, but if only a label is used via Lua, the statements won't show up in wbc_entity_usage)
I'm not sure how to interpret "articles contain captions" - does "contain" here mean "file *has* a caption" or "file caption is used via Lua"?
wbc_entity_usage can be used for the latter, not the former (and I can't really think of any convenient method for doing so off hand - it'll likely require fetching & deserializing the mediainfo slot's contents)

As for entries in wbc_entity_usage showing caption (label) usage in languages where the caption doesn't appear to exist: (I would need more examples to say with 100% certainty, but) this is likely similar to T238484: there appears to be a bug where an English caption is carried over to a user's interface language via Lua, and I'm guessing that's what's causing it to register invalid languages in wbc_entity_usage (they *are* being used, but in reality they don't exist)
I'll be looking into fixing that bug. If you happen to run into more examples, please comment on T238484 - they might be helpful for further debugging!

Here is the Github notebook that has numbers for files containing common templates like Information and Artwork.
I am yet to merge it with the wikimedia-research/SDC-metrics-2019 as it is pending review from @nettrom_WMF

Amanda has also requested that in addition to the number of files they want to compare the metadata in them. Here's the mapping of template vs. sd metadata...

Screen Shot 2019-12-05 at 4.23.46 PM.png (354×1 px, 136 KB)

We are not entirely sure if it is possible to get data regarding template parameters from the data sources (data lake, mariadb replicas etc. ) we currently use. Will discuss with everyone next week to decide on this.

Amanda has also requested that in addition to the number of files they want to compare the metadata in them. Here's the mapping of template vs. sd metadata...

Screen Shot 2019-12-05 at 4.23.46 PM.png (354×1 px, 136 KB)

We are not entirely sure if it is possible to get data regarding template parameters from the data sources (data lake, mariadb replicas etc. ) we currently use. Will discuss with everyone next week to decide on this.

A few comments:

  • Comparison of Structured data to {{Artwork}} template might be a bit pointless as In the ideal case file would only have P6243 property linking it to wikidata item with all the metadata and P7482 with qualifier P973 to store the source. With those 2 properties we can have Artwork template with no fields.
  • Comparison of Structured data to {{Information}} template makes more sense, but be aware that Information template is now implemented in Lua and it is already dipping into SDC. At the moment if a file is missing Description field, but has SDC caption it will display the caption where description suppose to be. I assume that in the future we will do the same with date, author, source, etc.

@Abit and @Ramsey-WMF : thanks for your patience with me getting this ready, turned out that I jumped the gun during our meeting and gave you numbers based on all edits, not just the ones made some number of days after upload. Anyways, we've got numbers, and while they might not be as impressive I think they're still positive!

The analysis of question 2 is found in this notebook. In this analysis, we look at additions of information, either through SDC or wikitext edits. For SDC edits, I chose to ignore whether these edits were reverted as I've previously found that reverts of these are rare (as are removals). For wikitext edits, they have to also not be a revert nor be reverted (within the common timeframe of 48 hours) as those are either reinstating an old version of the page or an unproductive edit.

In addition, these are all non-bot edits, and we only count edits to files that have not been deleted.

The first part of the analysis looks at how old the files were when these edits were made, aiming to find a reasonable threshold for determining when an edit happens to an "older" file. Here we find that SDC edits are either made at the time of upload (or shortly thereafter), or much later. There's definitely a lot of SDC additions happening to old files. For wikitext edits, there's some activity in first day, but it seems a lot of activity happens to older files (this is to some degree a result of the buckets being much wider). Since our focus is on SDC, I also looked at how the number of SDC edits change for each 30-day period from 30 to 150 days and found that there's not a lot of difference in the first 90 days. Based on the clear distinction of SDC edits, and that it also seems like a reasonable threshold for wikitext edits, I chose 30 days as the threshold.

[Edit 2019-12-22: I've removed the numbers as they are outdated. Instead, see T231952#5759736 for more correct numbers]

So: wbc_entity_usage does not hold information about all entities. There are (many) entities that do not appear in wbc_entity_usage.
wbc_entity_usage (much like like categorylinks or templatelinks) track usage of entities via Lua.

Thanks for taking the time to explain this to me @matthiasmullie, I greatly appreciate it! This helped me understand what data is and isn't available through the wbc_entity_usage table, clearing up my incorrect assumptions about it.

For 1. Quarterly comparison of metadata on files with a common template, such as the information template and artwork template.
@nettrom_WMF looked into using event_comment field for getting the files with structured data added to them and we observed that Properties (like P180, P160, P275 etc..) were not mentioned in all comments. Discussed with Amanda and Ramsey that we will need data dump to get this metric.
The following 2 actions have been finalized:

  1. Product Analytics will provide numbers from when the edit comments started getting properties (i.e. Aug 1st 2019 when UI went live) by 12/20 or before. This will be used in the grant proposal that is due on Jan 10, 2020.
  1. Product Analytics to Use Commons data dump to get numbers from Jan 2019 that are required for the Grant proposal to Sloan Foundation (due Feb 28th). @Abit will reach out to Ariel for making the dump available (for us to parse as well as in Query service) . We can also include template data in the dump for metadata comparison.

PA Deadline for metrics from data dump: mid February 2019. @Abit will open a separate phab task for this.
Dump will be provided on Jan 1st week by Ops team.

kzimmerman raised the priority of this task from Medium to High.Dec 12 2019, 11:41 PM

@Abit and @Ramsey-WMF : we discussed in our last meeting that the Information and Artwork template had been updated to pull SDC data in through Lua, and that you were interested in understanding the impact of that. Given the limited scope of this, I decided to go ahead and dig around in the data to see if I could figure it out. The work has been documented in this notebook on GitHub.

From what I could find, the Information template currently only pulls in captions from SDC, it doesn't use any other data types. The Artwork template appears to pull in a lot of different information, mainly from WikiData. It's not clear to me from looking at the template code which parts would be considered SDC and which would not. Given that the Information template only uses captions, and that the boundary between WikiData and SDC is blurry for Artwork, I decided to only look for references to captions.

Based on what I can find, there isn't a lot of usage of SDC through Lua yet. 2,013 files with the Information template pull in a caption through Lua, and 17 files with the Artwork template do.

@Abit and @Ramsey-WMF , as discussed in our meeting on 12-10-19, I have the number of files having the properties that you were interested in:
Caption P2096
Date of Creation P571
Date of Publication P577
Creator P170
License P275
Digital Representation of P6243
Depicts P180
It is available on this notebook. Please let me know your thoughts and if there any changes required in the way I’ve analyzed them.
I am also looking into getting : what are the number of properties (from the 7 properties given above) added to each commons file on average ?

Completed analysis of average number of properties added to each Commons file and posted to this notebook.
@Abit and @Ramsey-WMF , please let us know if anything further is needed for the numbers regarding the first 3 files metrics requested in this task.

Side note: it would be a good idea to close this ticket and open two new tickets - 1. for analysis of "Quarterly comparison of metadata on files with a common template, such as the information template and artwork template." on 2019 data that we are expected to do using the data dump we will get early next year; and 2. for the Search metrics that Amanda confirmed would be okay for us to provide in Jan 2020.

Thanks @Mayakp.wiki ! I will look at these before the break. Do you want
me to create the two new tickets?

@Abit , yes it would be great if you could create the two tickets.
It would also be useful to link or mention the Operations task (for the data dump) in the task you create for Metric 1. Thanks! :)

It would also be useful to link or mention the Operations task (for the data dump) in the task you create for Metric 1.

Here it is: T221917: Create RDF dump of structured data on Commons

it would be a good idea to close this ticket and open two new tickets - 1. for analysis of "Quarterly comparison of metadata on files with a common template, such as the information template and artwork template." on 2019 data that we are expected to do using the data dump we will get early next year; and 2. for the Search metrics that Amanda confirmed would be okay for us to provide in Jan 2020.

I realized it would make sense to just create one ticket, for the metrics through Dec 31 2019, including the quarterly comparison of metadata and the search metrics. It is T241286: [REQUEST] SDC metrics through Dec 31 2019.

@Abit and @Ramsey-WMF : I've today updated the notebook used for the analysis of Question 2, as the analysis I did with @Mayakp.wiki on Question 1 identified a couple additional types of edit comments that should be included. This mainly affects the number of SDC edits for the last two quarters of this year, and shifts the proportions quite drastically upwards (above 10%) for those quarters.

Apart from the two edit comments, the analysis is the same: they're all non-bot edits, have to be made to a file that's at least 30 days old, and for non-SDC edits cannot be a revert nor be reverted.

Using a 30-day threshold, we get the following overview of number of "positive" SDC and wikitext edits to "older" files for each of the quarters and overall in 2019:

TimeframeNumber of SDC editsNumber of wikitext edits% of SDC edits of both
Q129,1682,439,1631.2%
Q265,8872,098,8753.0%
Q3389,5892,205,64315.0%
Q4*250,5271,594,50913.6%
2019*735,1718,338,1908.1%

Note that Q4 and 2019 excludes the month of December as we don't yet have data for that month.

Let me know what questions you have about this, cheers!

Thanks @Mayakp.wiki ! I will look at these before the break. Do you want
me to create the two new tickets?

@Abit / @Ramsey-WMF , please let us know if you have any further questions on the analysis and the numbers provided. if everything looks good we would like to go ahead and close this ticket to continue work on T241286

Moving to 'Needs Review' since this task is completed and needs a final check from @Abit and @Ramsey-WMF before getting closed.

Hello! Thank you, Maya and Morten for doing this work. Looks good!

For my purposes, this data is good to go as is. Amanda will probably chime in this week about whether her needs for the final grant report are met.

One request: is the data for December available now? It would be great if that can be factored in for the sake of completion. 🙂

Hey @Ramsey-WMF , December 2019 data is available in the database now.
Would you like us to redo all the existing analysis and publish it with the new numbers (which will include Dec numbers) Or would you like to only see the December numbers alone for all the analysis ?

I'd prefer to see the existing analysis with the December numbers included so we can say more definitively "this is what 2019 looked like"

If it's a lot of trouble though, I'm also okay with looking at December separately. I don't want to stress anybody out over this 😄

Hey @Ramsey-WMF , December 2019 data is available in the database now.
Would you like us to redo all the existing analysis and publish it with the new numbers (which will include Dec numbers) Or would you like to only see the December numbers alone for all the analysis ?

Fundamentally it's the same effort for both approaches, but If we redo it with December alone, the queries will take much lesser time 🙂 but I understand that including it would help in giving you a better picture of 2019.

Please also let me know by when do you want the updated numbers.

Please also let me know by when do you want the updated numbers.

I'll leave that to Amanda (who is unfortunately out sick today) since she has more a schedule need than I do.

Please also let me know by when do you want the updated numbers.

I'll leave that to Amanda (who is unfortunately out sick today) since she has more a schedule need than I do.

Confirmed with Amanda in today's review meeting that this is needed by end of January. Created subtask T242816 for this.

@Abit & @Ramsey-WMF : I'm getting back into the swing of things, and am wondering about the priority and deadline for the two search related measurements above. Are they still needed, and if so, is the end of January also the deadline for those?

wondering about the priority and deadline for the two search related measurements above. Are they still needed, and if so, is the end of January also the deadline for those?

@nettrom_WMF Yep, end of January is the target. We could extend the deadline a week, but then would need to set aside time for concerted synchronous work the second week of January, maybe a day, where we could get quick feedback on our interpretations and articulation of the results.

@nettrom_WMF Yep, end of January is the target. We could extend the deadline a week, but then would need to set aside time for concerted synchronous work the second week of January, maybe a day, where we could get quick feedback on our interpretations and articulation of the results.

Thanks for clarifying that! I'll get to work on that tomorrow in order to try to get it completed by that deadline.

Please also let me know by when do you want the updated numbers.

I'll leave that to Amanda (who is unfortunately out sick today) since she has more a schedule need than I do.

Confirmed with Amanda in today's review meeting that this is needed by end of January. Created subtask T242816 for this.

This has been completed. Once merged, will move it to Needs Review and assign it to @Abit and @Ramsey-WMF

@Abit and @Ramsey-WMF : I've completed a first pass of an update to the search analysis, and am concerned about some of the underlying data. Since the last update of this analysis a year ago, EventLogging data is only available in Hive, so I updated the queries to work there. However, for some of the analysis, e.g. the the graph of daily search activity, shows a lack of data starting on 2019-12-10. I'm not sure if there were any changes to the EventLogging code for search at that point? If you or someone on your team knows, that would help me understand what's going on here and whether to dig further into the data.

I've put the updated graphs in this folder on GitHub.