Page MenuHomePhabricator

Statistics on media usage across Wikipedias
Closed, ResolvedPublic

Description

From the parent task: Commons isn't the only place where WMF projects host multimedia files. Many of the Wikipedias host their own files too, generally for Fair Use purposes (English Wikipedia alone has almost 890,000 files). We'd love a data view that allows us to compare usage of those "off-Commons" files vs. on Commons, per wiki.

A proof of concept for how to do this on a monthly basis using the database snapshots is found in this notebook on GitHub.

Event Timeline

I've previously discussed something similar with @jwang in relation to T247417. We can do this on a monthly basis by using the sqooped tables in wmf_raw in the Data Lake. We'll left join mediawiki_imagelinks twice, first with the mediawiki_page table to identify local files, second with mediawiki_page table to identify files used from Commons. If a file isn't found in either of those it should be redlink, and we can mark it as such.

nettrom_WMF moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

I spoke too soon! I've written up a query following the above mentioned idea, but this turns out to not work in practice. The issue is that a wiki can use a file from Commons but also have a local file description page. Attendekall.jpg on Nynorsk Wikipedia is an example of that. The actual file is on Commons, but it has a local description page to categorize it into the local programming category. This means that the page table isn't an authoritative source for whether a file exists locally on the wiki.

Instead, the image table is the authoritative source. Special:FileList on Nynorsk Wikipedia lists 16 files, and these are all in nnwiki's image table. From what I can tell, this table is not sqooped into the Data Lake on a monthly basis, so I'll need to file a task to get Analytics to do that. Once that is there, we can join mediawiki_imagelinks with mediawiki_image to do this. A possible alternative, or something to also do so it can be queried in the Data Lake, is to sqoop globalimagelinks from Commons. That table maps files on Commons to wikis they're used on, i.e. it's the source for "File usage on other wikis" on Commons page (here's an example).

It's not worth the effort to write Python and SQL to iterate over all wikis and grab this information from the replicas, so I'll move this task to "Blocked" until we have the sqooped table(s) available.

Data is now available in Hive; @nettrom_WMF to verify data before re-starting this task

Verified that data for several Wikipedias is available. This was also confirmed in T266077#6674242, with one exception: the image table from Commons is not present. I'm not sure this is an issue in this analysis because every image on Commons should have a corresponding file page, and the query I wrote uses the page table from Commons as its basis. Currently checking that assumption too.

@nettrom_WMF to update this task with a link to the notebook where this data is calculated, and then we'll close this task as resolved.

nettrom_WMF updated the task description. (Show Details)

I've updated the task description with a link to the notebook so it's easier to do this if questions related to this get asked and prioritized in the future. For now, closing as resolved.