Page MenuHomePhabricator

SQL definition for structure data in commons metrics
Closed, ResolvedPublic

Event Timeline

SNowick_WMF moved this task from Triage to Kanban on the Product-Analytics board.

Structured data on commons is defined as pieces of structured data attached to one particular item in commons stored in the MediaInfo database.

Metric definition. (Proposed)

  1. A X% increase of file pages on Commons contain structured data
  2. A X% increase in the percentage of file pages on Commons contain structured data

Currently, structured data on commons is not used in any other wikis except wiki commons. In the future, one more potential metric could be:

  1. A X% increase in structured data on common used (uptake) across wikis.

Computation
A X% increase of file pages on Commons contain structured data

  • SQL How many file pages on Commons contain at least one structured data element (statement or caption)? (ref: T239565)
SELECT COUNT(DISTINCT page_id)
FROM page 
# joining ON page_latest - we ONLY care about most recent
INNER JOIN slots ON slot_revision_id = page_latest 
# mediainfo slot must contain actual content
INNER JOIN content ON slot_content_id = content_id
	AND content_size > 122
INNER JOIN slot_roles ON role_id = slot_role_id
	AND role_name = 'mediainfo'
  • Growth Rate
datamediainfo_slot_countmonthly growth rate
2019-11-013033107NA
2019-12-0132324486.57%
2020-01-01460441342.44%
2020-02-01812735476.5%
2020-03-1087009357.06%

' A X% increase in the percentage of file pages on Commons contain structured data

  • SQL

Total content pages on wiki commons

USE wmf;

SELECT COUNT(DISTINCT page_id)
FROM mediawiki_page_history
WHERE snapshot = {YYYY-MM}
	AND wiki_db = 'commonswiki'
	AND NOT page_is_deleted
	AND page_namespace_is_content;
  • Growth Rate
datemediainfo_slot_countTotal content pages on commons% of pages on Commons contain structured dataMonthly Growth Rate
2019-11-013033107585309755.18 %NA
2019-12-013232448591588475.46 %5.44%
2020-01-014604413598289507.70 %40.8%
2020-02-0181273546045979113.44 %74.67%
2020-03-10 vs 03-0187009356127120914.20 %5.64%

@Nuria
Please review the initial calculation. Feel free to comment or discuss in our meeting.

Jennifer

SQL How many file pages on Commons contain at least one structured data element (statement or caption)? (ref: T239565)

This is the syntax if you run the queries on the analytics replica of mysql database. Probably it will be best to use this syntax which is the same query but in the scooped tables on the hadoop cluster (results are the same)

https://github.com/wikimedia/analytics-reportupdater-queries/blob/master/structured-data/commonswiki_mediainfo_slots

This data is automatically computed every month here (and we can augment this calculation to also calculate the "increase in the percentage of files on commons that contain structured data" let me know if you want to submit a code change for that).
https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/structured-data/

In this case i think it make sense to report the monthly growth rate on both files and monthly growth rate of structured data on files

Screen Shot 2020-03-11 at 1.32.38 PM.png (272×2 px, 173 KB)

If we want to have all files in commons to have structured data within three years I we should grow at a 3% rate every month for the percentage of files with SDC data

It's great that we have snapshot of the mediainfo slot table on hive! It's what I was looking for.
Agree with you that it's better to report the monthly growth rate on both files and monthly growth rate of structured data on files. Just want to tune the SQL and data a little bit to make it more accurate.

By cross checking the scooped table on hadoop with the table you provided me to refer, I understood that the date '2019-11-01' in this table did not mean the cutoff date, as I interpreted in the initial computation. Instead, it means the snapshot of '2019-11'. To match the correct month of mediainfo_slot and content page number, the adjusted table will be

Monthmediainfo_slot_countTotal content page% of content pages growth% of pages on Commons contain structured dataMonthly Growth Rate of % pages on commons contain SD
2019-113033107591588475.13%
2019-123232448598289501.13%5.40%5.38%
2020-014604413604597911.05%7.62%40.96%
2020-028127354612712091.34%13.26%74.17%

Also I want to exclude redirected page from counting, as we double count some content creation if we include redirected page. So the adjusted SQL for " How many file pages on Commons contain at least one structured data element (statement or caption)?" is

SELECT 
       COUNT(DISTINCT page_id) AS mediainfo_slot_count
       -- page excludes deleted pages (which are in archive)
FROM wmf_raw.mediawiki_page
           -- joining on page_latest - we only care about most recent
INNER JOIN wmf_raw.mediawiki_slots         ON slot_revision_id = page_latest
           -- mediainfo slot must contain actual content
INNER JOIN wmf_raw.mediawiki_content       ON slot_content_id = content_id
                                       AND content_size > 122
INNER JOIN wmf_raw.mediawiki_slot_roles    ON role_id = slot_role_id
                                       AND role_name = 'mediainfo'
WHERE mediawiki_page.wiki_db = 'commonswiki'
   AND mediawiki_slots.wiki_db = 'commonswiki'
   AND mediawiki_content.wiki_db = 'commonswiki'
   AND mediawiki_slot_roles.wiki_db = 'commonswiki'
   AND mediawiki_page.snapshot = '{MONTH}'
   AND mediawiki_slots.snapshot = '{MONTH}'
   AND mediawiki_content.snapshot = '{MONTH}'
   AND mediawiki_slot_roles.snapshot = '{MONTH}'
   AND NOT mediawiki_page.page_is_redirect

SQL for total content pages on wiki commons

SELECT COUNT(DISTINCT page_id) AS total_content_pages_commons
FROM wmf.mediawiki_page_history
WHERE snapshot = '{MONTH}'
    AND wiki_db = 'commonswiki'
    AND NOT page_is_deleted AND NOT page_is_redirect
    AND page_namespace_is_content;

The adjusted data shows as below. The growth rate excluding redirect page is similar to the trend in previous table

Monthmediainfo_slot_countTotal content page% of content pages growth% of pages on Commons contain structured data% pages on commons contain SD
2019-113033053576190255.3%
2019-123232383582744011.14%5.55%5.37%
2020-014601913588888911.05%7.81%40.88%
2020-028124835596834441.35%13.61%74.20%

Change 580136 had a related patch set uploaded (by Nuria; owner: Nuria):
[analytics/reportupdater-queries@master] Do not count pages that are just redirects

https://gerrit.wikimedia.org/r/580136

Also I want to exclude redirected page from counting,

Nice, thanks for the correction. Code patch submitted.

Change 580136 merged by Milimetric:
[analytics/reportupdater-queries@master] Do not count pages that are just redirects

https://gerrit.wikimedia.org/r/580136

Task is done and reviewed. Keep this ticket open until we reach the final decision.