Data Flow in LakeFusion

This section provides a structured overview of the LakeFusion Data Flow, outlining the key stages and enabling technologies that support seamless data ingestion, preprocessing, and Master Data Management (MDM). Each stage ensures data is unified, cleansed, governed, and prepared for use across analytics and operational systems.

1. Data Sources and Unity Catalog

1.1 Source Systems

The data lifecycle begins with ingesting raw information from a wide range of enterprise systems. LakeFusion supports integration with:

Transactional databases (RDBMS): Core business systems containing operational data
Structured and unstructured files: Formats such as CSV, JSON, XML from various departments
External APIs and real-time data feeds: Streaming data from third-party services
Legacy system exports: Historical data retained in older platforms

These data sources are onboarded and registered in Databricks Unity Catalog for centralized management.

1.2 Unity Catalog Integration

Unity Catalog acts as the governance layer for all incoming data. It provides:

Centralized metadata management across all data assets
Fine-grained access control to enforce data security policies
End-to-end data lineage tracking for auditability and transparency
Standardized data discovery with classification and tagging features

This foundational layer ensures that all ingested data is secure, discoverable, and governed from the start.

2. Data Ingestion Layer

Once registered in Unity Catalog, source data is ingested into LakeFusion through the platform's Datasets interface. This ingestion process includes:

Creating structured datasets (Dataset 1, 2, 3, 4, etc.)
Preserving original metadata and schema fidelity
Establishing baselines for data quality profiling
Staging the data for downstream transformation and enrichment

This layer bridges raw source data and preprocessing workflows, ensuring a reliable starting point.

3. Data Preprocessing

3.1 Data Profiling

After ingestion, data flows through the Profiling Engine to analyze structure, quality, and consistency. Key outputs include:

Column-level statistics (e.g., null percentage, uniqueness)
Pattern recognition and frequency distributions
Outlier detection across numerical or categorical fields

Profiling provides a data health snapshot, helping teams understand readiness and design appropriate quality rules.

3.2 Data Quality Enforcement

Based on profiling insights, datasets pass through automated data quality pipelines. These are executed using predefined and customizable notebooks.

Standard Quality Checks:

Data type and format validation
Standardization of naming conventions or units
Null value handling and default population

Custom Business Rules:

Domain-specific validations (e.g., revenue thresholds, age ranges)
Cross-field logic checks (e.g., start date must be before end date)
Compliance-driven rules tailored to industry or region

This step ensures that only clean, reliable records are processed in the MDM pipeline.

4. Master Data Management (MDM) Process

4.1 Entity Creation

An Entity in LakeFusion represents a consolidated business object (e.g., a customer, product, or patient). Entities are created by:

Defining attributes such as name, type, and description
Mapping datasets to corresponding entity attributes
Establishing logical relationships between data fields

This step consolidates disparate structures into a unified data model.

4.2 Survivorship Rules

When multiple systems provide conflicting data, survivorship rules determine which value prevails. These rules support:

Source prioritization (e.g., CRM takes precedence over ERP)
Strategy-based aggregation (e.g., most recent update wins)
Conditional logic based on business requirements

4.3 Validation Rules

Entities are further safeguarded through validation functions to enforce attribute-level consistency. Examples include:

Range validation (e.g., age must be > 0)
Text pattern matching (e.g., valid email format)
Referential integrity across related entities

5. Match Maven: Advanced Entity Matching

The Match Maven module powers intelligent match-merge operations using Databricks Large Language Models (LLMs) and vector search.

5.1 Model Creation

LLM-Based Matching: Understands semantic similarities between records to identify duplicate or related entities
Vector Search: Embeds text data into vectors to facilitate high-precision similarity detection across datasets

5.2 Experiment Execution

The match process involves iterative tuning through four stages:

Preparation: Define match parameters and data scopes
Execution: Run matching jobs using trained LLMs and embeddings
Analysis: Review matching accuracy and false positives
Approval: Publish the best-performing model into production

5.3 Threshold Configuration

Users can fine-tune thresholds for:

Match confidence
Non-match exclusion
Merge triggers

This ensures the right balance between precision (avoiding false matches) and recall (detecting all true matches).

6. Entity Search and Stewardship

Following match execution, matched records are flagged for review and surfaced to business stewards:

Notifications alert users to critical or ambiguous matches
The interface allows for manual review, adjustment, or approval
Accuracy scores and comparison views support decision-making

This human-in-the-loop step ensures high confidence in match outcomes.

7. Golden Record Creation

Once matches are approved, LakeFusion generates Golden Records—a single, trusted version of each entity.

Combines best-quality fields from contributing sources
Reflects the most accurate and complete view of the entity
Stored in the Gold layer of the Medallion Architecture for downstream use

Golden records can be consumed by operational systems, reporting dashboards, and advanced analytics pipelines—driving data consistency across the enterprise.

Related Articles
Data Quality Diagramming Configuration
This section walks you through the steps to create, configure, run, and review a data quality task using Diagramming Configuration in LakeFusion. The process is designed to ensure your data meets the required standards for accuracy, consistency, and ...
Data Quality Notebook Configuration
This section walks you through the steps to create, configure, run, and review a data quality task using Notebook Configuration in LakeFusion. The process is designed to ensure your data meets the required standards for accuracy, consistency, and ...
End-to-End Flow
Here’s how the components of LakeFusion interact in a typical user session: User Login: The user accesses LakeFusion via the web browser. Their credentials are authenticated using Databricks OIDC and SSO through Okta or Azure AD. Portal Navigation: ...
Who is LakeFusion MDM for?
LakeFusion is designed for modern data teams that are scaling their use of Databricks and need to ensure consistency, accuracy, and governance in core data entities such as customers, products, suppliers, and employees. It addresses the ...
LakeFusion Environment
The LakeFusion Environment is the core of the platform and consists of two main layers: Micro Frontends: The user interface components. Microservices: The backend logic and processing components. a. Micro Frontends The LakeFusion Micro Frontends are ...

Data Flow in LakeFusion

Data Flow in LakeFusion

1. Data Sources and Unity Catalog

1.1 Source Systems

1.2 Unity Catalog Integration

2. Data Ingestion Layer

3. Data Preprocessing

3.1 Data Profiling

3.2 Data Quality Enforcement

4. Master Data Management (MDM) Process

4.1 Entity Creation

4.2 Survivorship Rules

4.3 Validation Rules

5. Match Maven: Advanced Entity Matching

5.1 Model Creation

5.2 Experiment Execution

5.3 Threshold Configuration

6. Entity Search and Stewardship

7. Golden Record Creation

Related Articles

Data Quality Diagramming Configuration

Data Quality Notebook Configuration

End-to-End Flow

Who is LakeFusion MDM for?

LakeFusion Environment