Page MenuHomePhabricator

Create a dashboard from the fsImage Dataset extracted from the HDFS FsImage
Closed, ResolvedPublic3 Estimated Story Points

Description

Primary Task
Create a Superset dashboard & jupyter notebook to extract information from the datasets
The dashboard/notebook should clearly:
  • Identify inefficient file storage (large number of small files)
  • Get trends of the footprint of our dataset
  • Monitor cluster capacity, corruption

Details

Event Timeline

EChetty set the point value for this task to 3.Oct 19 2022, 10:29 AM
EChetty moved this task from Ready to Next Up on the Data Pipelines (Sprint 03) board.
Antoine_Quhen renamed this task from Extract the analysis and make it available on superset. to Create a dashboard from the Dataset extracted from the HDFS FsImage dataset.Nov 2 2022, 5:13 PM
Antoine_Quhen renamed this task from Create a dashboard from the Dataset extracted from the HDFS FsImage dataset to Create a dashboard from the Dataset extracted from the HDFS FsImage.
EChetty renamed this task from Create a dashboard from the Dataset extracted from the HDFS FsImage to Create a dashboard from the fsImage Dataset extracted from the HDFS FsImage.Nov 2 2022, 5:23 PM

Change 853303 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery@master] Declare HDFS fsimage dataset in hive metastore

https://gerrit.wikimedia.org/r/853303

The dashboard needs to be set up on production after the deployment of the dataset.

Draft is here: https://superset.wikimedia.org/r/2073

We could also create a dashboard to know where we have Parquet mostly, Avro mostly, etc,

Change 853303 merged by Aqu:

[analytics/refinery@master] Declare the HDFS usage dataset in hive metastore

https://gerrit.wikimedia.org/r/853303

Here is a dashboard with 2 treemaps, 1 for the aggregated files sizes, 1 for the aggregated files counts: https://superset.wikimedia.org/superset/dashboard/409/

@Antoine_Quhen the dashboard doesn't show any results currently.