Data Profiling Configuration

Data Profiling Configuration

This section walks you through the Data Profiling process in LakeFusion, which analyzes datasets to generate key metrics that reveal data structure, assess quality, and identify anomalies for informed decision-making and improved data management.

Step 1: Profiling Task Creation

  1. Navigate to the Data Profiling card (either from Home or from the left navigation pane) and initiate a new profile task with the “Create Profile Task” button.

  1. Provide the following required information:

  • Profile Task Name 

  • Comprehensive description of the task

  • Select the resource type

  • Select the dataset

  • Schedule the frequency if necessary

  1. Submit the profiling task configuration with the “Create” button.


Step 2: Profiling Dashboard

      After creating a data profiling task, you have two options to access the profiling dashboard:

1. Databricks SQL: Open the profiling dashboard directly in Databricks SQL for advanced analysis and integration with other Databricks tools.


2. Profiling Dashboard appears in Databricks for advanced analysis and integration.
 


 3. Within LakeFusion: Click directly on the task name in our user interface to access the profiling dashboard. This option provides a streamlined experience with all the necessary metrics and visualizations in one place.

Step 3: Data Analysis

The Profiling Overview provides a summary of the dataset being profiled. It lists key attributes and helps users quickly understand the structure of the dataset and identify the columns they need to profile.

The metrics and statistics displayed in the profiling screens depend on the data type of the column being analyzed. Here’s how the screens vary based on data type:

  • All Data Types

Uniqueness Analysis: Tracks the percentage of unique values and identifies duplicates, applicable to both numerical and string data.

File Rate Analysis: Measures data completeness by showing the percentage of non-missing and missing values, relevant for all data types.

  • Numerical Data 

Statistics Analysis: Provides detailed statistical insights into the dataset. This helps users understand the distribution of data and identify outliers or anomalies. Provides metrics such as Min, Max, Mean, Standard Deviation, and Median to help you understand the distribution and central tendency of numerical data. Range displays the spread of values between the minimum and maximum and Distinct Values shows the number of unique numerical values.

  • String Data 

Length Statistics: Provides insights into the length of string values, including Minimum Length, Maximum Length, and Average Length.

Frequency Analysis: Evaluates how often specific string values appear in the dataset, helping you identify common or rare values.

Data profiling assessed the dataset’s structure and quality; next, Data Quality Configuration ensures the data meets accuracy, consistency, and reliability standards through rule-based validation and cleansing.


    • Related Articles

    • Data Quality Notebook Configuration

      This section walks you through the steps to create, configure, run, and review a data quality task using Notebook Configuration in LakeFusion. The process is designed to ensure your data meets the required standards for accuracy, consistency, and ...
    • Data Quality Diagramming Configuration

      This section walks you through the steps to create, configure, run, and review a data quality task using Diagramming Configuration in LakeFusion. The process is designed to ensure your data meets the required standards for accuracy, consistency, and ...
    • Data Flow in LakeFusion

      This section provides a structured overview of the LakeFusion Data Flow, outlining the key stages and enabling technologies that support seamless data ingestion, preprocessing, and Master Data Management (MDM). Each stage ensures data is unified, ...
    • Who is LakeFusion MDM for?

      LakeFusion is designed for modern data teams that are scaling their use of Databricks and need to ensure consistency, accuracy, and governance in core data entities such as customers, products, suppliers, and employees. It addresses the ...
    • Platform Access & Navigation

      A. Initial Access 1. Authentication Navigate to the LakeFusion platform login page Enter your authorized credentials Complete any required two-factor authentication if enabled 2. Home Screen Orientation Upon successful authentication, the system ...