Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Usage of commons files for tech tunning session metrics | analytics/reportupdater-queries | master | +66 -0 |
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Nuria | T247100 Tech Tunning Session metrics | |||
Resolved | • Nuria | T247417 Definition of not text content metrics for tuning session (rich media, images, etc) | |||
Resolved | • Nuria | T249113 Import regularly via sqoop mediawiki_imagelinks table |
Event Timeline
For reference, the imagelinks table was sqooped in January to: wmf_raw.mediawiki_imagelinks (where snapshot='2019-12'). Docs at https://www.mediawiki.org/wiki/Manual:Imagelinks_table
Non-text contents are images, audio, video, documents (pdfs), and data (e.g. JSON, et. al). They are stored on wiki commons. To track how non-text content is used across wikis, it can be measured from reader's and editor's perspectives.
Metric definition. (Proposed)
- Regarding readership, the metric can be defined as how many non-text contents are viewed across wikis , and its growth rate
- Regarding edits, the metric can be defined as how many non-text contents are linked on content page across wikis, and its growth rate.
Computation
A X% increase in non-text contents viewed across wikis
SQL Total views of non-text contents across wikis
SELECT COUNT(DISTINCT base_name) AS num_wikicomm_view FROM wmf.mediarequest WHERE year ='{YEAR}' AND month = '{MONTH}' AND base_name like '%commons%' and (referer like '%wik%' OR referer LIKE '%internal%')
Result for previous 4 months.
Month | non-text content used viewed across wikis | Growth Rate |
---|---|---|
2019-11 | 48848907 | |
2019-12 | 48864420 | 0.0318% |
2020-01 | 50388634 | 3.1193% |
2020-02 | 50605063 | 0.4295% |
A X% increase in non-text contents used across wikis
Currently, we only have a snapshot as 2019-12 in our data base. We cannot calculate growth rate based on one month data yet. As for count, the total of using events (not distinct content) indicates how often using non-text content is happening. The total of used contents (distinct) indicates how many non-text contents are used. They are calculated as below.
SQL the total of non-text contents using events across wikis
SELECT COUNT(1) FROM wmf_raw.mediawiki_imagelinks AS m INNER JOIN wmf_raw.mediawiki_project_namespace_map AS ns ON ns.snapshot = '2020-02' AND ns.namespace_is_content=1 AND ns.dbname=m.wiki_db AND ns.namespace = m.il_from_namespace WHERE m.snapshot = '2019-12'
The number "2019-12" snapshot is : 649288559
SQL: The total of unique contents used
SELECT COUNT(DISTINCT il_to) FROM wmf_raw.mediawiki_imagelinks AS m INNER JOIN wmf_raw.mediawiki_project_namespace_map AS ns ON ns.snapshot = '2020-02' AND ns.namespace_is_content=1 AND ns.dbname=m.wiki_db AND ns.namespace = m.il_from_namespace WHERE m.snapshot = '2019-12'
The number of "2019-12" snapshot is : 30513920
We have added imagelinks to the list of tables that are imported every month to hadoop and the first import will be done at the end of this month. In the meantime I have scooped imagelinks to my home dir and run the two following queries
SELECT COUNT(1) FROM nuria.mediawiki_imagelinks AS m INNER JOIN wmf_raw.mediawiki_project_namespace_map AS ns ON ns.snapshot = '2020-02' AND ns.namespace_is_content=1 AND ns.dbname=m.wiki_db AND ns.namespace = m.il_from_namespace WHERE m.snapshot = '2020-03''
666297123
SELECT COUNT(DISTINCT il_to) FROM nuria.mediawiki_imagelinks AS m INNER JOIN wmf_raw.mediawiki_project_namespace_map AS ns ON ns.snapshot = '2020-02' AND ns.namespace_is_content=1 AND ns.dbname=m.wiki_db AND ns.namespace = m.il_from_namespace WHERE m.snapshot = '2020-03'
31220165
Change 606734 had a related patch set uploaded (by Nuria; owner: Nuria):
[analytics/reportupdater-queries@master] [WIP] Usage of commons files for tech tunning session metrics
Change 606734 merged by Nuria:
[analytics/reportupdater-queries@master] Usage of commons files for tech tunning session metrics
https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/ /606734
data will be available at: https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/structured-data/