Page MenuHomePhabricator

Analytics Cluster Dataset Usage Discovery Task
Open, MediumPublic

Description

Objective

Conduct a comprehensive audit to identify and document all internal users, applications, and use cases of datasets within the analytics cluster through systematic repository and configuration analysis.

Analysis Scope

  • Scan all Gerrit and GitHub repositories containing data pipeline - - configurations
  • Analyze Hive query configurations, scripts, and stored procedures
  • Review Spark jobs, Airflow DAGs, and other ETL processes
  • Examine dashboard configurations and reporting tools
  • Search through data access logs where available
  • Review job scheduler configurations referencing analytics cluster datasets

Methodology

  • Develop and employ automated scanning tools to identify dataset references for use in ongoing audits and reviews
  • Analyze query patterns and access frequency metrics
  • [Nice to have] Create a dataset dependency graph showing relationships between data sources

Acceptance Criteria

  • Complete inventory of dataset usage across all repositories and systems
  • Mapping of data lineage showing how datasets flow through systems wherever possible
  • Usage frequency metrics and performance impact assessment
  • Identification of critical vs. non-critical dataset dependencies
  • Recommendations for dataset consolidation, optimization, or deprecation

Deliverables

  • Documentation of dataset consumers with contact information for responsible teams
  • Risk assessment for potential dataset changes or optimizations
  • Suggested prioritization for dataset maintenance and optimization efforts
  • [nice to have] Interactive dataset usage dashboard showing connections and dependencies

Notes

  • Document methodology and assumptions during analysis
  • Flag datasets with unclear ownership or governance
  • Identify redundant or overlapping datasets
  • Note any access or permission requirements needed to complete the analysis
  • Provide recommendations for ongoing dataset usage monitoring

Follow up work may include:

  • Conduct targeted interviews with key stakeholders from data-consuming teams
  • Detailed documentation of all identified use cases categorized by team/service