This section walks you through the Data Profiling process in LakeFusion, which analyzes datasets to generate key metrics that reveal data structure, assess quality, and identify anomalies for informed decision-making and improved data management.
Navigate to the Data Profiling card (either from Home or from the left navigation pane) and initiate a new profile task with the “Create Profile Task” button.
Provide the following required information:
Profile Task Name
Comprehensive description of the task
Select the resource type
Select the dataset
Schedule the frequency if necessary
Submit the profiling task configuration with the “Create” button.
After creating a data profiling task, you have two options to access the profiling dashboard:
Databricks SQL: Open the profiling dashboard directly in Databricks SQL for advanced analysis and integration with other Databricks tools.
Within LakeFusion: Click directly on the task name in our user interface to access the profiling dashboard. This option provides a streamlined experience with all the necessary metrics and visualizations in one place.
The Profiling Overview provides a summary of the dataset being profiled. It lists key attributes and helps users quickly understand the structure of the dataset and identify the columns they need to profile.
The metrics and statistics displayed in the profiling screens depend on the data type of the column being analyzed. Here’s how the screens vary based on data type:
All Data Types
Uniqueness Analysis: Tracks the percentage of unique values and identifies duplicates, applicable to both numerical and string data.
File Rate Analysis: Measures data completeness by showing the percentage of non-missing and missing values, relevant for all data types.
Numerical Data
Statistics Analysis: Provides detailed statistical insights into the dataset. This helps users understand the distribution of data and identify outliers or anomalies. Provides metrics such as Min, Max, Mean, Standard Deviation, and Median to help you understand the distribution and central tendency of numerical data. Range displays the spread of values between the minimum and maximum and Distinct Values shows the number of unique numerical values.
String Data
Length Statistics: Provides insights into the length of string values, including Minimum Length, Maximum Length, and Average Length.
Frequency Analysis: Evaluates how often specific string values appear in the dataset, helping you identify common or rare values.