Objective
Conduct a comprehensive audit to identify and document all internal users, applications, and use cases of datasets within the analytics cluster through systematic repository and configuration analysis.
Analysis Scope
- Scan all Gerrit and GitHub repositories containing data pipeline - - configurations
- Analyze Hive query configurations, scripts, and stored procedures
- Review Spark jobs, Airflow DAGs, and other ETL processes
- Examine dashboard configurations and reporting tools
- Search through data access logs where available
- Review job scheduler configurations referencing analytics cluster datasets
Methodology
- Develop and employ automated scanning tools to identify dataset references for use in ongoing audits and reviews
- Analyze query patterns and access frequency metrics
- [Nice to have] Create a dataset dependency graph showing relationships between data sources
Acceptance Criteria
- Complete inventory of dataset usage across all repositories and systems
- Mapping of data lineage showing how datasets flow through systems wherever possible
- Usage frequency metrics and performance impact assessment
- Identification of critical vs. non-critical dataset dependencies
- Recommendations for dataset consolidation, optimization, or deprecation
Deliverables
- Documentation of dataset consumers with contact information for responsible teams
- Risk assessment for potential dataset changes or optimizations
- Suggested prioritization for dataset maintenance and optimization efforts
- [nice to have] Interactive dataset usage dashboard showing connections and dependencies
Notes
- Document methodology and assumptions during analysis
- Flag datasets with unclear ownership or governance
- Identify redundant or overlapping datasets
- Note any access or permission requirements needed to complete the analysis
- Provide recommendations for ongoing dataset usage monitoring
Follow up work may include:
- Conduct targeted interviews with key stakeholders from data-consuming teams
- Detailed documentation of all identified use cases categorized by team/service