Data Flow in LakeFusion

Data Flow in LakeFusion

This section provides a structured overview of the LakeFusion Data Flow, outlining the key stages and enabling technologies that support seamless data ingestion, preprocessing, and Master Data Management (MDM). Each stage ensures data is unified, cleansed, governed, and prepared for use across analytics and operational systems.


1. Data Sources and Unity Catalog

1.1 Source Systems

The data lifecycle begins with ingesting raw information from a wide range of enterprise systems. LakeFusion supports integration with:

  • Transactional databases (RDBMS): Core business systems containing operational data

  • Structured and unstructured files: Formats such as CSV, JSON, XML from various departments

  • External APIs and real-time data feeds: Streaming data from third-party services

  • Legacy system exports: Historical data retained in older platforms

These data sources are onboarded and registered in Databricks Unity Catalog for centralized management.

1.2 Unity Catalog Integration

Unity Catalog acts as the governance layer for all incoming data. It provides:

  • Centralized metadata management across all data assets

  • Fine-grained access control to enforce data security policies

  • End-to-end data lineage tracking for auditability and transparency

  • Standardized data discovery with classification and tagging features

This foundational layer ensures that all ingested data is secure, discoverable, and governed from the start.

2. Data Ingestion Layer

Once registered in Unity Catalog, source data is ingested into LakeFusion through the platform's Datasets interface. This ingestion process includes:

  • Creating structured datasets (Dataset 1, 2, 3, 4, etc.)

  • Preserving original metadata and schema fidelity

  • Establishing baselines for data quality profiling

  • Staging the data for downstream transformation and enrichment

This layer bridges raw source data and preprocessing workflows, ensuring a reliable starting point.

3. Data Preprocessing

3.1 Data Profiling

After ingestion, data flows through the Profiling Engine to analyze structure, quality, and consistency. Key outputs include:

  • Column-level statistics (e.g., null percentage, uniqueness)

  • Pattern recognition and frequency distributions

  • Outlier detection across numerical or categorical fields

Profiling provides a data health snapshot, helping teams understand readiness and design appropriate quality rules.

3.2 Data Quality Enforcement

Based on profiling insights, datasets pass through automated data quality pipelines. These are executed using predefined and customizable notebooks.

Standard Quality Checks:

  • Data type and format validation

  • Standardization of naming conventions or units

  • Null value handling and default population

Custom Business Rules:

  • Domain-specific validations (e.g., revenue thresholds, age ranges)

  • Cross-field logic checks (e.g., start date must be before end date)

  • Compliance-driven rules tailored to industry or region

This step ensures that only clean, reliable records are processed in the MDM pipeline.

4. Master Data Management (MDM) Process

4.1 Entity Creation

An Entity in LakeFusion represents a consolidated business object (e.g., a customer, product, or patient). Entities are created by:

  • Defining attributes such as name, type, and description

  • Mapping datasets to corresponding entity attributes

  • Establishing logical relationships between data fields

This step consolidates disparate structures into a unified data model.

4.2 Survivorship Rules

When multiple systems provide conflicting data, survivorship rules determine which value prevails. These rules support:

  • Source prioritization (e.g., CRM takes precedence over ERP)

  • Strategy-based aggregation (e.g., most recent update wins)

  • Conditional logic based on business requirements

4.3 Validation Rules

Entities are further safeguarded through validation functions to enforce attribute-level consistency. Examples include:

  • Range validation (e.g., age must be > 0)

  • Text pattern matching (e.g., valid email format)

  • Referential integrity across related entities

5. Match Maven: Advanced Entity Matching

The Match Maven module powers intelligent match-merge operations using Databricks Large Language Models (LLMs) and vector search.

5.1 Model Creation

  • LLM-Based Matching: Understands semantic similarities between records to identify duplicate or related entities

  • Vector Search: Embeds text data into vectors to facilitate high-precision similarity detection across datasets

5.2 Experiment Execution

The match process involves iterative tuning through four stages:

  • Preparation: Define match parameters and data scopes

  • Execution: Run matching jobs using trained LLMs and embeddings

  • Analysis: Review matching accuracy and false positives

  • Approval: Publish the best-performing model into production

5.3 Threshold Configuration

Users can fine-tune thresholds for:

  • Match confidence

  • Non-match exclusion

  • Merge triggers

This ensures the right balance between precision (avoiding false matches) and recall (detecting all true matches).

6. Entity Search and Stewardship

Following match execution, matched records are flagged for review and surfaced to business stewards:

  • Notifications alert users to critical or ambiguous matches

  • The interface allows for manual review, adjustment, or approval

  • Accuracy scores and comparison views support decision-making

This human-in-the-loop step ensures high confidence in match outcomes.

7. Golden Record Creation

Once matches are approved, LakeFusion generates Golden Records—a single, trusted version of each entity.

  • Combines best-quality fields from contributing sources

  • Reflects the most accurate and complete view of the entity

  • Stored in the Gold layer of the Medallion Architecture for downstream use

Golden records can be consumed by operational systems, reporting dashboards, and advanced analytics pipelines—driving data consistency across the enterprise.

    • Related Articles

    • Data Quality Diagramming Configuration

      This section walks you through the steps to create, configure, run, and review a data quality task using Diagramming Configuration in LakeFusion. The process is designed to ensure your data meets the required standards for accuracy, consistency, and ...
    • Data Quality Notebook Configuration

      This section walks you through the steps to create, configure, run, and review a data quality task using Notebook Configuration in LakeFusion. The process is designed to ensure your data meets the required standards for accuracy, consistency, and ...
    • LakeFusion Environment

      The LakeFusion Environment is the core of the platform and consists of two main layers: Micro Frontends: The user interface components. Microservices: The backend logic and processing components. a. Micro Frontends The LakeFusion Micro Frontends are ...
    • Data Profiling Configuration

      This section walks you through the Data Profiling process in LakeFusion, which analyzes datasets to generate key metrics that reveal data structure, assess quality, and identify anomalies for informed decision-making and improved data management. ...
    • End-to-End Flow

      Here’s how the components of LakeFusion interact in a typical user session: User Login: The user accesses LakeFusion via the web browser. Their credentials are authenticated using Databricks OIDC and SSO through Okta or Azure AD. Portal Navigation: ...