Page MenuHomePhabricator

[EPIC] Learn about our databases and tools
Closed, ResolvedPublic20 Estimated Story Points

Description

Data:

  • MediaWiki Database: These tables exist for each individual wiki, e.g. for the English Wikipedia, for Wikimedia Commons. You can access them via mysql.
  • Eventlogging: Logs of how users interact with the UI. You can access them via mysql or Hive.
  • Webrequest: The webrequest datasets contain logs of all the hits to the WMF's servers. They are records of how users consume Wikimedia contents. You can access them via Hive.
  • Data dumps: A complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available. These snapshots are provided at the very least monthly and usually twice a month. Dump files are also available in various location on Labs & some Production servers.

More details about data access: https://wikitech.wikimedia.org/wiki/Analytics/Data_access

Tools:

  • Hadoop cluster: Our storage system for large amounts of data. See also: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster. You can submit your query on stats machine, or Hue.
  • Mysql: There are multiple ways to access mysql database.
    • Once ssh'd into stat1005 or stat1006: for mediawiki database, run mysql -h analytics-store.eqiad.wmnet, for event logging, run mysql -h analytics-slave.eqiad.wmnet, to open the mysql command line interface
    • In R, install our internal "wmf" package and use wmf::build_query() to execute queries and get that data into R. You can do this on stat1005, or on your local machine (example). Install wmf via devtools::install_git('https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf')
    • Use tools like Sequel Pro
  • Quarry: A public interface for querying a set of SQL replicas of the production MediaWiki databases. Various private information is redacted from these replicas. Also the machines running the queries are not super powerful.
  • PageView API: Public api allowing access to the pageview data on all wikis in a variety of ways. E.g. Daily viewcounts for an article like “Albert Einstein”. This data can also be queried using Hadoop.
  • Feel free to do the analysis with R/Python/whatever you are comfortable with. If you use R, see https://meta.wikimedia.org/wiki/Discovery/Analytics#Analysis_with_R for the internal packages we built.

You need to know what they are, their structures, how to operate within them, and how to get data out of them for analysis. So thorough knowledge is very important.

To get familiar with these data and tools, we want you to replicate the baseline metrics we computed for Structure Data on Wikimedia Commons (SDoC). See the original task, results and analysis codebase for more information. The presentation of the results does not have to be the same, as long as it answer the question. This task will be split up into the following sub-tasks:

  • T185365: File contributions by bots vs users
  • T186575: File type and deletion metrics
  • T187827: Search metrics on Wikimedia Commons

Event Timeline

chelsyx set the point value for this task to 10.
chelsyx changed the point value for this task from 10 to 20.
chelsyx updated the task description. (Show Details)
chelsyx added a subscriber: mpopov.