Page MenuHomePhabricator

Definition of not text content metrics for tuning session (rich media, images, etc)
Closed, ResolvedPublic

Event Timeline

Nuria created this task.Mar 11 2020, 2:37 PM
Nuria added a subscriber: Bmueller.Mar 11 2020, 2:42 PM
mpopov added a subscriber: mpopov.EditedMar 17 2020, 2:45 PM

Hi @Nuria, please add a description explaining what is being asked for here.

Nuria added a subscriber: jwang.Mar 17 2020, 2:59 PM

This is part of the tree of tasks we are working together with @jwang

kzimmerman moved this task from Triage to Tracking on the Product-Analytics board.Mar 17 2020, 5:26 PM
Nuria renamed this task from Definition of not rich tech content metrics for tunning session to Definition of not text content metrics for tunning session.Mar 18 2020, 2:43 PM
Nuria renamed this task from Definition of not text content metrics for tunning session to Definition of not text content metrics for tunning session (rich media,: images and the linke).
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Mar 26 2020, 4:07 PM

For reference, the imagelinks table was sqooped in January to: wmf_raw.mediawiki_imagelinks (where snapshot='2019-12'). Docs at https://www.mediawiki.org/wiki/Manual:Imagelinks_table

mpopov removed a subscriber: mpopov.Mar 27 2020, 7:52 PM
fdans moved this task from Incoming to Radar on the Analytics board.Mar 30 2020, 4:26 PM
jwang added a comment.EditedApr 1 2020, 5:55 AM

Non-text contents are images, audio, video, documents (pdfs), and data (e.g. JSON, et. al). They are stored on wiki commons. To track how non-text content is used across wikis, it can be measured from reader's and editor's perspectives.

Metric definition. (Proposed)

  1. Regarding readership, the metric can be defined as how many non-text contents are viewed across wikis , and its growth rate
  2. Regarding edits, the metric can be defined as how many non-text contents are linked on content page across wikis, and its growth rate.

Computation

A X% increase in non-text contents viewed across wikis
SQL Total views of non-text contents across wikis

SELECT  COUNT(DISTINCT base_name) AS num_wikicomm_view
FROM wmf.mediarequest
WHERE year ='{YEAR}' AND month = '{MONTH}' AND base_name like '%commons%'
and (referer  like '%wik%'  OR referer  LIKE '%internal%')

Result for previous 4 months.

Monthnon-text content used viewed across wikisGrowth Rate
2019-1148848907
2019-12488644200.0318%
2020-01503886343.1193%
2020-02506050630.4295%

A X% increase in non-text contents used across wikis
Currently, we only have a snapshot as 2019-12 in our data base. We cannot calculate growth rate based on one month data yet. As for count, the total of using events (not distinct content) indicates how often using non-text content is happening. The total of used contents (distinct) indicates how many non-text contents are used. They are calculated as below.

SQL the total of non-text contents using events across wikis

SELECT COUNT(1)
FROM wmf_raw.mediawiki_imagelinks AS m
INNER JOIN wmf_raw.mediawiki_project_namespace_map AS ns ON ns.snapshot = '2020-02' AND ns.namespace_is_content=1 AND ns.dbname=m.wiki_db 
    AND ns.namespace = m.il_from_namespace
WHERE m.snapshot = '2019-12'

The number "2019-12" snapshot is : 649288559

SQL: The total of unique contents used

SELECT COUNT(DISTINCT il_to)
FROM wmf_raw.mediawiki_imagelinks AS m
INNER JOIN wmf_raw.mediawiki_project_namespace_map AS ns ON ns.snapshot = '2020-02' AND ns.namespace_is_content=1 AND ns.dbname=m.wiki_db 
    AND ns.namespace = m.il_from_namespace
WHERE m.snapshot = '2019-12'

The number of "2019-12" snapshot is : 30513920

Nuria added a comment.Apr 4 2020, 4:23 AM

We have added imagelinks to the list of tables that are imported every month to hadoop and the first import will be done at the end of this month. In the meantime I have scooped imagelinks to my home dir and run the two following queries

SELECT COUNT(1)
FROM nuria.mediawiki_imagelinks AS m
INNER JOIN wmf_raw.mediawiki_project_namespace_map AS ns ON ns.snapshot = '2020-02' AND ns.namespace_is_content=1 AND ns.dbname=m.wiki_db 
    AND ns.namespace = m.il_from_namespace
WHERE m.snapshot = '2020-03''

666297123

SELECT COUNT(DISTINCT il_to)
FROM nuria.mediawiki_imagelinks AS m
INNER JOIN wmf_raw.mediawiki_project_namespace_map AS ns ON ns.snapshot = '2020-02' AND ns.namespace_is_content=1 AND ns.dbname=m.wiki_db 
    AND ns.namespace = m.il_from_namespace
WHERE m.snapshot = '2020-03'

31220165

Nuria added a comment.Apr 29 2020, 3:23 AM

@Nuria to submit reportupdater change to get these metrics every quarter

Nuria moved this task from In Progress to Done on the Analytics-Kanban board.Jun 3 2020, 4:55 PM
Nuria moved this task from Done to In Progress on the Analytics-Kanban board.Jun 8 2020, 8:11 PM
Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM

Change 606734 had a related patch set uploaded (by Nuria; owner: Nuria):
[analytics/reportupdater-queries@master] [WIP] Usage of commons files for tech tunning session metrics

https://gerrit.wikimedia.org/r/606734

Milimetric renamed this task from Definition of not text content metrics for tunning session (rich media,: images and the linke) to Definition of not text content metrics for tuning session (rich media, images, etc).Jun 22 2020, 9:11 PM

Change 606734 merged by Nuria:
[analytics/reportupdater-queries@master] Usage of commons files for tech tunning session metrics

https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/ /606734

Nuria moved this task from In Progress to Done on the Analytics-Kanban board.Jul 6 2020, 11:13 PM
Nuria closed this task as Resolved.Jul 23 2020, 4:38 AM