| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Do not count pages that are just redirects | analytics/reportupdater-queries | master | +1 -1 |
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | • Nuria | T247100 Tech Tunning Session metrics | |||
| Resolved | jwang | T247099 SQL definition for wikidata metrics for tunning session | |||
| Resolved | jwang | T247101 SQL definition for structure data in commons metrics |
Event Timeline
Structured data on commons is defined as pieces of structured data attached to one particular item in commons stored in the MediaInfo database.
Metric definition. (Proposed)
- A X% increase of file pages on Commons contain structured data
- A X% increase in the percentage of file pages on Commons contain structured data
Currently, structured data on commons is not used in any other wikis except wiki commons. In the future, one more potential metric could be:
- A X% increase in structured data on common used (uptake) across wikis.
Computation
A X% increase of file pages on Commons contain structured data
- SQL How many file pages on Commons contain at least one structured data element (statement or caption)? (ref: T239565)
SELECT COUNT(DISTINCT page_id) FROM page # joining ON page_latest - we ONLY care about most recent INNER JOIN slots ON slot_revision_id = page_latest # mediainfo slot must contain actual content INNER JOIN content ON slot_content_id = content_id AND content_size > 122 INNER JOIN slot_roles ON role_id = slot_role_id AND role_name = 'mediainfo'
- Growth Rate
| data | mediainfo_slot_count | monthly growth rate |
|---|---|---|
| 2019-11-01 | 3033107 | NA |
| 2019-12-01 | 3232448 | 6.57% |
| 2020-01-01 | 4604413 | 42.44% |
| 2020-02-01 | 8127354 | 76.5% |
| 2020-03-10 | 8700935 | 7.06% |
' A X% increase in the percentage of file pages on Commons contain structured data
- SQL
Total content pages on wiki commons
USE wmf;
SELECT COUNT(DISTINCT page_id)
FROM mediawiki_page_history
WHERE snapshot = {YYYY-MM}
AND wiki_db = 'commonswiki'
AND NOT page_is_deleted
AND page_namespace_is_content;- Growth Rate
| date | mediainfo_slot_count | Total content pages on commons | % of pages on Commons contain structured data | Monthly Growth Rate |
|---|---|---|---|---|
| 2019-11-01 | 3033107 | 58530975 | 5.18 % | NA |
| 2019-12-01 | 3232448 | 59158847 | 5.46 % | 5.44% |
| 2020-01-01 | 4604413 | 59828950 | 7.70 % | 40.8% |
| 2020-02-01 | 8127354 | 60459791 | 13.44 % | 74.67% |
| 2020-03-10 vs 03-01 | 8700935 | 61271209 | 14.20 % | 5.64% |
@Nuria
Please review the initial calculation. Feel free to comment or discuss in our meeting.
Jennifer
SQL How many file pages on Commons contain at least one structured data element (statement or caption)? (ref: T239565)
This is the syntax if you run the queries on the analytics replica of mysql database. Probably it will be best to use this syntax which is the same query but in the scooped tables on the hadoop cluster (results are the same)
This data is automatically computed every month here (and we can augment this calculation to also calculate the "increase in the percentage of files on commons that contain structured data" let me know if you want to submit a code change for that).
https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/structured-data/
In this case i think it make sense to report the monthly growth rate on both files and monthly growth rate of structured data on files
If we want to have all files in commons to have structured data within three years I we should grow at a 3% rate every month for the percentage of files with SDC data
It's great that we have snapshot of the mediainfo slot table on hive! It's what I was looking for.
Agree with you that it's better to report the monthly growth rate on both files and monthly growth rate of structured data on files. Just want to tune the SQL and data a little bit to make it more accurate.
By cross checking the scooped table on hadoop with the table you provided me to refer, I understood that the date '2019-11-01' in this table did not mean the cutoff date, as I interpreted in the initial computation. Instead, it means the snapshot of '2019-11'. To match the correct month of mediainfo_slot and content page number, the adjusted table will be
| Month | mediainfo_slot_count | Total content page | % of content pages growth | % of pages on Commons contain structured data | Monthly Growth Rate of % pages on commons contain SD |
|---|---|---|---|---|---|
| 2019-11 | 3033107 | 59158847 | 5.13% | ||
| 2019-12 | 3232448 | 59828950 | 1.13% | 5.40% | 5.38% |
| 2020-01 | 4604413 | 60459791 | 1.05% | 7.62% | 40.96% |
| 2020-02 | 8127354 | 61271209 | 1.34% | 13.26% | 74.17% |
Also I want to exclude redirected page from counting, as we double count some content creation if we include redirected page. So the adjusted SQL for " How many file pages on Commons contain at least one structured data element (statement or caption)?" is
SELECT
COUNT(DISTINCT page_id) AS mediainfo_slot_count
-- page excludes deleted pages (which are in archive)
FROM wmf_raw.mediawiki_page
-- joining on page_latest - we only care about most recent
INNER JOIN wmf_raw.mediawiki_slots ON slot_revision_id = page_latest
-- mediainfo slot must contain actual content
INNER JOIN wmf_raw.mediawiki_content ON slot_content_id = content_id
AND content_size > 122
INNER JOIN wmf_raw.mediawiki_slot_roles ON role_id = slot_role_id
AND role_name = 'mediainfo'
WHERE mediawiki_page.wiki_db = 'commonswiki'
AND mediawiki_slots.wiki_db = 'commonswiki'
AND mediawiki_content.wiki_db = 'commonswiki'
AND mediawiki_slot_roles.wiki_db = 'commonswiki'
AND mediawiki_page.snapshot = '{MONTH}'
AND mediawiki_slots.snapshot = '{MONTH}'
AND mediawiki_content.snapshot = '{MONTH}'
AND mediawiki_slot_roles.snapshot = '{MONTH}'
AND NOT mediawiki_page.page_is_redirectSQL for total content pages on wiki commons
SELECT COUNT(DISTINCT page_id) AS total_content_pages_commons
FROM wmf.mediawiki_page_history
WHERE snapshot = '{MONTH}'
AND wiki_db = 'commonswiki'
AND NOT page_is_deleted AND NOT page_is_redirect
AND page_namespace_is_content;The adjusted data shows as below. The growth rate excluding redirect page is similar to the trend in previous table
| Month | mediainfo_slot_count | Total content page | % of content pages growth | % of pages on Commons contain structured data | % pages on commons contain SD |
|---|---|---|---|---|---|
| 2019-11 | 3033053 | 57619025 | 5.3% | ||
| 2019-12 | 3232383 | 58274401 | 1.14% | 5.55% | 5.37% |
| 2020-01 | 4601913 | 58888891 | 1.05% | 7.81% | 40.88% |
| 2020-02 | 8124835 | 59683444 | 1.35% | 13.61% | 74.20% |
Change 580136 had a related patch set uploaded (by Nuria; owner: Nuria):
[analytics/reportupdater-queries@master] Do not count pages that are just redirects
Also I want to exclude redirected page from counting,
Nice, thanks for the correction. Code patch submitted.
Change 580136 merged by Milimetric:
[analytics/reportupdater-queries@master] Do not count pages that are just redirects
