Page MenuHomePhabricator

Statistics on media usage across Wikipedias
Open, MediumPublic


From the parent task: Commons isn't the only place where WMF projects host multimedia files. Many of the Wikipedias host their own files too, generally for Fair Use purposes (English Wikipedia alone has almost 890,000 files). We'd love a data view that allows us to compare usage of those "off-Commons" files vs. on Commons, per wiki.

Event Timeline

I've previously discussed something similar with @jwang in relation to T247417. We can do this on a monthly basis by using the sqooped tables in wmf_raw in the Data Lake. We'll left join mediawiki_imagelinks twice, first with the mediawiki_page table to identify local files, second with mediawiki_page table to identify files used from Commons. If a file isn't found in either of those it should be redlink, and we can mark it as such.

nettrom_WMF triaged this task as Medium priority.Oct 19 2020, 4:23 PM
nettrom_WMF moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

I spoke too soon! I've written up a query following the above mentioned idea, but this turns out to not work in practice. The issue is that a wiki can use a file from Commons but also have a local file description page. Attendekall.jpg on Nynorsk Wikipedia is an example of that. The actual file is on Commons, but it has a local description page to categorize it into the local programming category. This means that the page table isn't an authoritative source for whether a file exists locally on the wiki.

Instead, the image table is the authoritative source. Special:FileList on Nynorsk Wikipedia lists 16 files, and these are all in nnwiki's image table. From what I can tell, this table is not sqooped into the Data Lake on a monthly basis, so I'll need to file a task to get Analytics to do that. Once that is there, we can join mediawiki_imagelinks with mediawiki_image to do this. A possible alternative, or something to also do so it can be queried in the Data Lake, is to sqoop globalimagelinks from Commons. That table maps files on Commons to wikis they're used on, i.e. it's the source for "File usage on other wikis" on Commons page (here's an example).

It's not worth the effort to write Python and SQL to iterate over all wikis and grab this information from the replicas, so I'll move this task to "Blocked" until we have the sqooped table(s) available.

Data is now available in Hive; @nettrom_WMF to verify data before re-starting this task