Background
We need better visibility into how different users consume Wikimedia dumps to inform our API strategy and improve the developer experience. This task proposes analyzing apache logs to understand dumps usage patterns.
Scope
Process apache logs data (referenced in T119070: Track the number of wikidata dumps that are downloaded by type) to gather usage metrics. Estimated timeline: 1 week for 2 engineers.
Expected Outcomes
We aim to answer these key questions:
- What is user preference for full dumps vs incremental dumps?
- What is the utilization across dump types? (eg: XML, old HTML, Enterprise HTML, backups, analytics)
- Are users typically downloading the latest version, or is there a preference for older versions?
- What is the utilization across Wikimedia projects? (eg: language specific utilization; Wikipedia vs Wiktionary)
- How are dumps being downloaded? (eg: bot/automated process vs human)
- What are the most popular mirrors to click through to?
Additionally we would like to learn about:
- User Demographics
- Identify proportion of human vs bot downloaders
- Distinguish between corporate and volunteer community users
- Usage Patterns
- Analyze preferences for monthly full dumps vs frequent updates
- Determine which dump types are most valuable to different user groups
Known Limitations
Risk: Log data may not provide sufficient resolution to fully answer all questions about usage patterns and user segmentation.
Related Work
- T119070: Track the number of wikidata dumps that are downloaded by type: Original logs analysis discussion
- Builds on exploration from T349750: Instrument dump download links using Metrics Platform and analysis done in T382069: Undeploy and archive ActiveAbstract
Dependencies
This work supports Q3 request for Research & Decision Science teams to analyze content access patterns as part of WE 5.5: API Strategy.
By the end of January, we will be able to measure and monitor Wikimedia hosted dumps traffic using log data, which will provide clarity on how users are consuming the different dumps options and access points. This will, in turn, will improve our understanding of what users care about in terms of recency, data completion, and structure, so that we can tailor the overall API strategy accordingly.
Next Steps
- Implement log processing pipeline
- Create visualization dashboard (in Superset?)
- Document findings and limitations
Deadline
January 31, 2025