Scraping data dumps
The scraper helps us gain insights into how references/citations are used in wikis. We're also building some of our success metrics based on the aggregated data the scraper creates from the enterprise data dumps (which are not publicly available anymore since March 25). The data is now only available via the Enterprise API .
As a first step, we want to enhance the scraper to collect sub-ref specific data. We will then run it in a regular cadence that allows us to monitor and learn fast. Since the enterprise dumbs are available on the 1st and 20th of each month, this schedule is currently sufficient. The enterprise team has offered to discuss an alternative cadence, should we need another.
Links & resources
- Technical documentation for how to provision the scraper on Wikimedia Analytics servers
- old scraper aggregated data in spreadsheet
- https://enterprise.wikimedia.com/api/
Steps
- Get access to data: T396720: Scraper: Use Enterprise API to retrieve dumps
- Enhance the scraper: T396729: Scraper: Add new metrics for sub-ref data
- Run the scraper: T396730: Scraper: Run the scraper with the new features regularly
- Analyze subref contents T397124: Try to categorize different subref usage types
- Build dashboards: T396731: Scraper: Build a new dashboard for the updated scraper data