Page MenuHomePhabricator

[REQUEST] Baselines for structured data on Commons
Closed, ResolvedPublic

Description

What's requested:

  • How often people add metadata to Commons files within two months after upload
  • Redo Chelsey’s analysis of the search clickthrough rate and clicks/scrolls to more results. This was done already, but I’m not sure if it got redone after discovering the error in clickthrough tracking.

Why it's requested:
We need to show that we are achieving the outcomes outlined in the SDC grant. To do so, we need these baselines, so we can measure ourselves against them.

When it's requested:
Statements are estimated to roll out end of January, and my understanding is that, by nature, the baselines need to be measured before then.

Other helpful information:
https://phabricator.wikimedia.org/T174519

Event Timeline

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptJan 12 2019, 12:59 AM
Abit renamed this task from [REQUEST] to [REQUEST] Baselines for structured data on Commons.Jan 12 2019, 1:00 AM

@Abit @Ramsey-WMF When do you need to see the data from our team? Would end of next week work for you?

@Abit @Ramsey-WMF When do you need to see the data from our team? Would end of next week work for you?

End of next week would be great :). Thanks!

@Ramsey-WMF @Abit: hi, I would like to clarify what "metadata" includes. Here's my initial list:

Or are you referring to the entire page as the metadata? i.e. the whole shebang:

Screen Shot 2019-01-16 at 10.12.32 AM.png (405×619 px, 50 KB)

And then any revisions that add bytes (including the newly released captions):

Screen Shot 2019-01-16 at 10.15.35 AM.png (430×1 px, 116 KB)

would make the file count towards the statistic? In that case, if a revision removes metadata and then another revision undoes it, does THAT count?

Furthermore, for clarification, are you specifically interested in:

  • when a file's metadata is augmented, which is to say when additional metadata is added to a file after it's uploaded and some metadata is there from the outset
  • OR in addition to metadata getting added in the first 2 months after upload, also when the initial upload includes metadata beyond the essential fields (description, date) that are required for the upload

Like, if someone is very thorough in their initial upload, does that file get included in the count? Or is it specifically revisions after the initial upload?

Also, I assume it does not matter who (or what) adds the metadata in the 2 months after the upload. Whether it's a bot adding a category or another person adding some other metadata, all that matters is that metadata is added.

And specifically added, not removed, right?

Responding to @mpopov as succinctly as possible:

Or are you referring to the entire page as the metadata? i.e. the whole shebang:

The whole page.

if someone is very thorough in their initial upload, does that file get included in the count? Or is it specifically revisions after the initial upload?

Specifically looking for revisions after the initial upload.

And specifically added, not removed, right?

Yes, added. We want to measure it against how often people add things to old files once statements are available.

Thanks!

Thanks for clarifying! Okay, one more question for @Abit & @Ramsey-WMF just so everyone is on the same page. The statistic you want is: the % of all uploaded files which have had additions to their pages in the first 2 months after upload.

No breakdown by file type or over time, just a count X and a total Y and the proportion X/Y, correct?

The statistic you want is: the % of all uploaded files which have had additions to their pages in the first 2 months after upload.

Indeedy weedy.

No breakdown by file type or over time, just a count X and a total Y and the proportion X/Y, correct?

If it's not too much trouble, a data set over time would be lovely (past 12 months maybe?). We don't have a need for breakdown by file type though.

Here's the query I used, which I would like someone in Product-Analytics (e.g. @chelsyx and @Neil_P._Quinn_WMF) to review:

Sure thing!

I noticed once big thing: it seems like your counts of file page edits (n_edits_total, n_additions_total, etc.) include the initial edit that creates the pages, so in the end you're getting the proportion of files which have metadata added in the first 2 months, including during the initial upload.

I tried excluding those initial creations (event_timestamp != page_creation_timestamp), and it looks like the proportion goes from 99% to 50%.

Query excluding intial creations

WITH summarized_revisions AS (
  SELECT
    page_id, TO_DATE(page_creation_timestamp) AS creation_date,
    COUNT(1) AS n_edits, -- not including reverts or reverted
    SUM(IF(event_timestamp != page_creation_timestamp, 1, 0)) as n_later_edits,
    SUM(IF(revision_text_bytes_diff > 0 AND DATEDIFF(event_timestamp, page_creation_timestamp) <= 60 AND event_timestamp != page_creation_timestamp, 1, 0)) AS n_additions_2mo
  FROM wmf.mediawiki_history
  WHERE snapshot = '2018-12'
    AND wiki_db = 'commonswiki'
    AND page_creation_timestamp between "2018-10-01" and "2018-10-08"
    AND event_entity = 'revision'
    AND page_namespace = 6
    AND NOT revision_is_identity_revert -- don't count edits that are reverts
    AND NOT revision_is_identity_reverted -- don't count edits that were reverted
    AND NOT revision_is_deleted -- don't counts edits moved to archive table
    AND page_id IS NOT NULL -- don't count deleted files
  GROUP BY page_id, TO_DATE(page_creation_timestamp)
)
SELECT
  creation_date,
  COUNT(1) AS n_uploaded, -- files uploaded
  SUM(IF(n_later_edits > 0, 1, 0)) AS n_later_edited, -- files whose pages were edited after upload
  SUM(IF(n_additions_2mo > 0, 1, 0)) AS n_added_to_2mo -- files that have had metadata added after creation and in first 2 months
  FROM summarized_revisions
GROUP BY creation_date;
creation_daten_uploadedn_later_editedn_added_to_2mo
2018-10-01233901330710248
2018-10-0218226113088947
2018-10-03227631680312142
2018-10-0417455128969088
2018-10-05173211139710261
2018-10-06201911245610558
2018-10-0721479115759853

Other comments

WITH summarized_revisions AS (
  SELECT
    page_id, TO_DATE(page_creation_timestamp) AS creation_date,
    COUNT(1) AS n_edits_total, -- not including reverts or reverted

I think this includes uploads of new file versions, not just metadata edits, but I don't think it would change the results much.

  SUM(IF(revision_text_bytes_diff > 0, 1, 0)) AS n_additions_total,
  SUM(IF(DATEDIFF(event_timestamp, page_creation_timestamp) <= 60, 1, 0)) AS n_edits_2mo,
  SUM(IF(revision_text_bytes_diff > 0 AND DATEDIFF(event_timestamp, page_creation_timestamp) <= 60, 1, 0)) AS n_additions_2mo
FROM wmf.mediawiki_history
WHERE snapshot = '2018-12'
  AND wiki_db = 'commonswiki'
  AND event_entity = 'revision'
  AND page_namespace = 6
  AND NOT revision_is_identity_revert -- don't count edits that are reverts
  AND NOT revision_is_identity_reverted -- don't count edits that were reverted
  AND NOT revision_is_deleted -- don't counts edits moved to archive table
  AND page_id IS NOT NULL -- don't count deleted files

I don't understand the point of this, since the NOT revision_is_deleted should have already removed deleted files. (Also the page_id isn't necessarily null for deleted pages; after all the MediaWiki archive table has ar_page_id.)

GROUP BY page_id, TO_DATE(page_creation_timestamp)

)
SELECT

creation_date,
COUNT(1) AS n_total, -- files uploaded
SUM(IF(n_edits_total > 0, 1, 0)) AS n_edited, -- files that have had metadata edited
SUM(IF(n_additions_total > 0, 1, 0)) AS n_added_to, -- files that have had metadata added
SUM(IF(n_edits_2mo > 0, 1, 0)) AS n_edited_2mo, -- files that have had metadata edited in first 2 months
SUM(IF(n_additions_2mo > 0, 1, 0)) AS n_added_to_2mo -- files that have had metadata added in first 2 months
FROM summarized_revisions

GROUP BY creation_date;

I redid the analysis of the desktop search metrics on Commons (clickthrough rate, zero results rate, and proportion of searches with clicks to see other pages of the search results) to determine changes following the bug fix deployed in September 2018. See summary of results posted to T188421#4897244. Please let me know if you have any questions or need any further details.

Codebase

In T213597#4893765, @Neil_P._Quinn_WMF wrote:

I noticed once big thing: it seems like your counts of file page edits (n_edits_total, n_additions_total, etc.) include the initial edit that creates the pages, so in the end you're getting the proportion of files which have metadata added in the first 2 months, including during the initial upload.

I tried excluding those initial creations (event_timestamp != page_creation_timestamp), and it looks like the proportion goes from 99% to 50%.

Thank you so much, @Neil_P._Quinn_WMF! Really appreciate you catching that and correcting. I had incorrectly assumed that initial metadata would not be included. I'm currently looking into your suggested method of filtering revisions and comparing it to using revision_parent_id > 0, which should theoretically yield the same result but is not the case in practice.

Correct numbers coming soon.

I don't understand the point of this, since the NOT revision_is_deleted should have already removed deleted files. (Also the page_id isn't necessarily null for deleted pages; after all the MediaWiki archive table has ar_page_id.)

https://commons.wikimedia.org/wiki/File:Box-Front.jpg is a deleted file with a null page_id and it gets included in summarized_revisions otherwise.

Okay, here are the numbers which were calculated with the following conditions:

  • Using the December 2018 snapshot of MediaWiki History in the Data Lake
  • Only files which have not been deleted are counted
  • Only revisions to the metadata which were not reverted AND which were not reverts AND which were not deleted
  • "Metadata augmented w/in 1st 2mo" means there was at least 1 byte-adding revision to the file's page within the first 60 days after creation

It looks like the baseline for % of files which have metadata added within the first 2 months is 45.6% overall:

Files since 2003Metadata augmented w/in 1st 2mo (60d)Proportion
52,640,74624,003,59345.599%

Here are the final numbers:

YearFiles uploaded that yearMetadata augmented w/in 1st 2mo (60d)Proportion
200417,6699,42353.331%
2005265,976108,44940.774%
2006648,025228,23035.219%
20071,205,884371,72930.826%
20081,403,480576,98741.111%
20091,927,836822,06142.642%
20102,333,372863,58837.010%
20113,884,6351,287,97233.156%
20123,490,9051,589,17345.523%
20134,591,2722,007,54743.725%
20144,715,3232,215,43746.984%
20155,683,9662,990,53552.614%
20166,312,0672,921,21446.280%
20178,182,2363,623,89744.290%
20187,978,0994,387,35154.992%
MonthFiles uploaded that monthMetadata augmented w/in 1st 2mo (60d)Proportion
January 2018652,863322,24649.359%
February 2018705,945399,70956.620%
March 2018784,484358,70345.725%
April 2018609,520276,23045.319%
May 2018714,875414,76558.019%
June 2018588,235363,86361.857%
July 2018650,022409,26162.961%
August 2018783,718515,03765.717%
September 2018817,719436,63253.396%
October 2018563,806296,13552.524%
November 2018573,655363,01763.281%
December 2018533,257231,75343.460%

Appendix

USE wmf;
WITH page_creation_timestamps AS (
  -- since page_creation_timestamp in mediawiki_history table is wrong:
  SELECT
    page_id,
    event_timestamp AS upload_timestamp
  FROM mediawiki_history
  WHERE snapshot = '${snapshot}'
    AND wiki_db = 'commonswiki'
    AND event_entity = 'revision'
    AND page_namespace = 6
    AND revision_parent_id = 0
    AND NOT revision_is_identity_revert -- don't count edits that are reverts
    AND NOT revision_is_identity_reverted -- don't count edits that were reverted
    AND NOT revision_is_deleted -- don't counts edits moved to archive table
    AND page_id IS NOT NULL -- don't count deleted files
), fixed_revision_history AS (
  SELECT
    page_creation_timestamps.page_id AS page_id,
    upload_timestamp,
    event_timestamp AS revision_timestamp,
    revision_parent_id,
    revision_text_bytes_diff
  FROM page_creation_timestamps
  LEFT JOIN mediawiki_history ON (
    page_creation_timestamps.page_id = mediawiki_history.page_id
    AND mediawiki_history.snapshot = '${snapshot}'
    AND mediawiki_history.wiki_db = 'commonswiki'
    AND NOT mediawiki_history.revision_is_identity_revert -- don't count edits that are reverts
    AND NOT mediawiki_history.revision_is_identity_reverted -- don't count edits that were reverted
    AND NOT mediawiki_history.revision_is_deleted -- don't counts edits moved to archive table
  )
), summarized_revisions AS (
  SELECT
    page_id, TO_DATE(upload_timestamp) AS creation_date,
    COUNT(1) AS n_edits,
    SUM(IF(revision_parent_id > 0, 1, 0)) as n_later_edits,
    SUM(IF(revision_text_bytes_diff > 0 AND DATEDIFF(revision_timestamp, upload_timestamp) <= 60 AND revision_parent_id > 0, 1, 0)) AS n_additions_2mo
  FROM fixed_revision_history
  GROUP BY page_id, TO_DATE(upload_timestamp)
)
SELECT
  creation_date,
  COUNT(1) AS n_uploaded, -- files uploaded
  SUM(IF(n_later_edits > 0, 1, 0)) AS n_later_edited, -- files whose pages were edited after upload
  SUM(IF(n_additions_2mo > 0, 1, 0)) AS n_added_to_2mo -- files that have had metadata added after creation and in first 2 months
FROM summarized_revisions
GROUP BY creation_date;

Thank you so much, @Neil_P._Quinn_WMF! Really appreciate you catching that and correcting. I had incorrectly assumed that initial metadata would not be included. I'm currently looking into your suggested method of filtering revisions and comparing it to using revision_parent_id > 0, which should theoretically yield the same result but is not the case in practice.

Glad to help! mediawiki_history is full of so many gotchas; I'll fallen into plenty myself!

In T213597#4893765, @Neil_P._Quinn_WMF wrote:

I don't understand the point of this, since the NOT revision_is_deleted should have already removed deleted files. (Also the page_id isn't necessarily null for deleted pages; after all the MediaWiki archive table has ar_page_id.)

https://commons.wikimedia.org/wiki/File:Box-Front.jpg is a deleted file with a null page_id and it gets included in summarized_revisions otherwise.

True, but its revisions do have revision_is_deleted set, so you've already filtered them out of your query.

In T213597#4900903, @Neil_P._Quinn_WMF wrote:

True, but its revisions do have revision_is_deleted set, so you've already filtered them out of your query.

Huh! Yeah, you're right! Haha, okay so I think what happened was I had checked the summarized_revisions table before I had the revision_is_deleted in the WHERE clause and then added both NOT revision_is_deleted AND page IS NOT NULL after seeing that example. Sorry for the confusion! You were right this whole time :)

@Abit @Ramsey-WMF in addition to T213597#4900741, here's the history of that metric with a 7-day rolling average to smooth the daily data a bit:

2019-01_checkin.png (600×1 px, 144 KB)

kzimmerman assigned this task to MNeisler.
kzimmerman moved this task from Triage to Doing on the Product-Analytics board.

Thanks guys! This should be all we need :)