Data Profiling Configuration

Data Profiling Configuration

This section walks you through the Data Profiling process in LakeFusion, which analyzes datasets to generate key metrics that reveal data structure, assess quality, and identify anomalies for informed decision-making and improved data management.

Step 1: Profiling Task Creation
  1. Navigate to the Data Profiling card (either from Home or from the left navigation pane) and initiate a new profile task with the “Create Profile Task” button.

  1. Provide the following required information:

  • Profile Task Name 

  • Comprehensive description of the task

  • Select the resource type

  • Select the dataset

  • Schedule the frequency if necessary

  1. Submit the profiling task configuration with the “Create” button.


Step 2: Profiling Dashboard

      After creating a data profiling task, you have two options to access the profiling dashboard:

  1. Databricks SQL: Open the profiling dashboard directly in Databricks SQL for advanced analysis and integration with other Databricks tools.

  1. Within LakeFusion: Click directly on the task name in our user interface to access the profiling dashboard. This option provides a streamlined experience with all the necessary metrics and visualizations in one place.


Step 3: Data Analysis

The Profiling Overview provides a summary of the dataset being profiled. It lists key attributes and helps users quickly understand the structure of the dataset and identify the columns they need to profile.

The metrics and statistics displayed in the profiling screens depend on the data type of the column being analyzed. Here’s how the screens vary based on data type:

  • All Data Types

Uniqueness Analysis: Tracks the percentage of unique values and identifies duplicates, applicable to both numerical and string data.

File Rate Analysis: Measures data completeness by showing the percentage of non-missing and missing values, relevant for all data types.


  • Numerical Data 

Statistics Analysis: Provides detailed statistical insights into the dataset. This helps users understand the distribution of data and identify outliers or anomalies. Provides metrics such as Min, Max, Mean, Standard Deviation, and Median to help you understand the distribution and central tendency of numerical data. Range displays the spread of values between the minimum and maximum and Distinct Values shows the number of unique numerical values.

  • String Data 

Length Statistics: Provides insights into the length of string values, including Minimum Length, Maximum Length, and Average Length.

Frequency Analysis: Evaluates how often specific string values appear in the dataset, helping you identify common or rare values.

Data profiling assessed the dataset’s structure and quality; next, Data Quality Configuration ensures the data meets accuracy, consistency, and reliability standards through rule-based validation and cleansing.

    • Related Articles

    • Data Quality Notebook Configuration

      This section walks you through the steps to create, configure, run, and review a data quality task using Notebook Configuration in LakeFusion. The process is designed to ensure your data meets the required standards for accuracy, consistency, and ...
    • Data Quality Diagramming Configuration

      This section walks you through the steps to create, configure, run, and review a data quality task using Diagramming Configuration in LakeFusion. The process is designed to ensure your data meets the required standards for accuracy, consistency, and ...
    • Data Flow in LakeFusion

      This section provides a structured overview of the LakeFusion Data Flow, outlining the key stages and enabling technologies that support seamless data ingestion, preprocessing, and Master Data Management (MDM). Each stage ensures data is unified, ...
    • Who is LakeFusion MDM for?

      LakeFusion is ideal for data-driven enterprises seeking to solve challenges related to fragmented data, poor data quality, and unreliable analytics. It serves a wide range of users across business and technical teams. Business Users Chief Data ...
    • Integration Hub

      Integration Task creation Navigate to Integration Hub post-Match Maven completion Configure new pipeline with required parameters: Task Name designation Entity selection Model specification Execute task creation Access workflow configuration via ...