Page MenuHomePhabricator

Definition of not text content metrics for tuning session (rich media, images, etc)
Closed, ResolvedPublic

Event Timeline

Hi @Nuria, please add a description explaining what is being asked for here.

This is part of the tree of tasks we are working together with @jwang

Nuria renamed this task from Definition of not rich tech content metrics for tunning session to Definition of not text content metrics for tunning session.Mar 18 2020, 2:43 PM
Nuria renamed this task from Definition of not text content metrics for tunning session to Definition of not text content metrics for tunning session (rich media,: images and the linke).

For reference, the imagelinks table was sqooped in January to: wmf_raw.mediawiki_imagelinks (where snapshot='2019-12'). Docs at https://www.mediawiki.org/wiki/Manual:Imagelinks_table

Non-text contents are images, audio, video, documents (pdfs), and data (e.g. JSON, et. al). They are stored on wiki commons. To track how non-text content is used across wikis, it can be measured from reader's and editor's perspectives.

Metric definition. (Proposed)

  1. Regarding readership, the metric can be defined as how many non-text contents are viewed across wikis , and its growth rate
  2. Regarding edits, the metric can be defined as how many non-text contents are linked on content page across wikis, and its growth rate.

Computation

A X% increase in non-text contents viewed across wikis
SQL Total views of non-text contents across wikis

SELECT  COUNT(DISTINCT base_name) AS num_wikicomm_view
FROM wmf.mediarequest
WHERE year ='{YEAR}' AND month = '{MONTH}' AND base_name like '%commons%'
and (referer  like '%wik%'  OR referer  LIKE '%internal%')

Result for previous 4 months.

Monthnon-text content used viewed across wikisGrowth Rate
2019-1148848907
2019-12488644200.0318%
2020-01503886343.1193%
2020-02506050630.4295%

A X% increase in non-text contents used across wikis
Currently, we only have a snapshot as 2019-12 in our data base. We cannot calculate growth rate based on one month data yet. As for count, the total of using events (not distinct content) indicates how often using non-text content is happening. The total of used contents (distinct) indicates how many non-text contents are used. They are calculated as below.

SQL the total of non-text contents using events across wikis

SELECT COUNT(1)
FROM wmf_raw.mediawiki_imagelinks AS m
INNER JOIN wmf_raw.mediawiki_project_namespace_map AS ns ON ns.snapshot = '2020-02' AND ns.namespace_is_content=1 AND ns.dbname=m.wiki_db 
    AND ns.namespace = m.il_from_namespace
WHERE m.snapshot = '2019-12'

The number "2019-12" snapshot is : 649288559

SQL: The total of unique contents used

SELECT COUNT(DISTINCT il_to)
FROM wmf_raw.mediawiki_imagelinks AS m
INNER JOIN wmf_raw.mediawiki_project_namespace_map AS ns ON ns.snapshot = '2020-02' AND ns.namespace_is_content=1 AND ns.dbname=m.wiki_db 
    AND ns.namespace = m.il_from_namespace
WHERE m.snapshot = '2019-12'

The number of "2019-12" snapshot is : 30513920

We have added imagelinks to the list of tables that are imported every month to hadoop and the first import will be done at the end of this month. In the meantime I have scooped imagelinks to my home dir and run the two following queries

SELECT COUNT(1)
FROM nuria.mediawiki_imagelinks AS m
INNER JOIN wmf_raw.mediawiki_project_namespace_map AS ns ON ns.snapshot = '2020-02' AND ns.namespace_is_content=1 AND ns.dbname=m.wiki_db 
    AND ns.namespace = m.il_from_namespace
WHERE m.snapshot = '2020-03''

666297123

SELECT COUNT(DISTINCT il_to)
FROM nuria.mediawiki_imagelinks AS m
INNER JOIN wmf_raw.mediawiki_project_namespace_map AS ns ON ns.snapshot = '2020-02' AND ns.namespace_is_content=1 AND ns.dbname=m.wiki_db 
    AND ns.namespace = m.il_from_namespace
WHERE m.snapshot = '2020-03'

31220165

@Nuria to submit reportupdater change to get these metrics every quarter

Change 606734 had a related patch set uploaded (by Nuria; owner: Nuria):
[analytics/reportupdater-queries@master] [WIP] Usage of commons files for tech tunning session metrics

https://gerrit.wikimedia.org/r/606734

Milimetric renamed this task from Definition of not text content metrics for tunning session (rich media,: images and the linke) to Definition of not text content metrics for tuning session (rich media, images, etc).Jun 22 2020, 9:11 PM

Change 606734 merged by Nuria:
[analytics/reportupdater-queries@master] Usage of commons files for tech tunning session metrics

https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/ /606734