What is Data Lineage?

Table of Contents

Data is plentiful and data-driven insights are the gold standard for business strategies and decisions. Yet, for many organizations visibility into what data sources (and what information is extracted) is still a mystery, leading to questions about integrity and accuracy because a lossy data conversion can remove details that, in security particularly, have larger repercussions for data analytics and key performance indicators (KPI’s).

Tracing data lineage across the security and IT data lifecycle enables organizations to create consistent, accurate, and reliable security analytics that enable strategic decisions making across security, IT, compliance, and operations functions.  Without the appropriate insight into data sources, organizations increase data storage and maintenance costs while reducing analytics’ value. 

What is data lineage? 

Data lineage is the process of tracking – and ideally documenting – the journey of data over time. This begins from its creation at the source, includes various transformations as it moves through data pipelines, workflow engines, and ETL/ELT processes, and ends at the final application. By tracking how data travels from upstream producers to downstream consumers, organizations can more effectively identify a data’s source, understand how it has been manipulated, and verify its reliability for use in analytics models that drive decision-making and reporting.  

Data lineage vs. data provenance vs. data governance 

Although these three terms are interrelated, they apply to different processes: 

  • Data lineage: the dataset’s complete history 
  • Data provenance: the data’s origin source system and transformations 
  • Data governance: data management that facilitates appropriate access, storage, and handling  

Data governance relies on data lineage and data provenance. To facilitate appropriate data handling, organizations need visibility into where data originates and how it has been processed. For example, with security data, organizations need visibility into: 

  • Technology generating the telemetry, like Endpoint Detection and Response (EDR) or network monitoring tools 
  • Normalization of the data across the pipeline, like transforming data to an extended version of the Open Cybersecurity Framework Schema (OCSF) 
  • Users accessing and using the data, like security, compliance, IT, and operations teams 

Why is Data Lineage Important? 

As organizations transform data and it moves throughout the data infrastructure, tracing data lineage empowers: 

  • Trust and transparency: documenting data’s journey from source to consumption layer so data users save time when vetting data 
  • Data quality and reliability: identifying intentional or unintentional loss of data or precisions that can create errors or bugs for downstream data consumers 
  • Data and application debugging: identifying data error root cause to fix errors at the source 
  • Regulatory compliance: implementing appropriate controls over protected information across the entire data lifecycle 
  • Data modeling: revealing unknown or accidentally bypassed relationships between data elements and gaining real-time context into data flows to update models or make them more precise 
  • Strategic decision-making: ensuring data remains updated and accurate so data users can trust the analytics used in decision-making 
  • Impact analysis: identifying upstream and downstream impacts that changes to tables, columns, or reports can have  
  • Self-service analytics: providing necessary context into upstream and downstream lineage for data analysts 
  • Data exploration: improving discovery capabilities for more accurate analytics 
  • Data modernization: identifying and documenting data elements that are critical for cloud migration 
  • Asset management: identifying least and most used and certified data assets 
Data lineage supports enterprise use of security telemetry for activities like: 
  • Continuous controls monitoring (CCM): identifying and normalizing the data that enterprise technologies and cybersecurity tools generate to drive accuracy 
  • Threat hunting: trusting analytics that enable the operationalization and automation of activities that proactively identify security incidents 
  • Security hygiene: correlating and analyzing data to discover assets and suggest missing owners 

How does data lineage work? 

When automating data lineage processes, tools typically collect metadata, the data about data, like: 

  • Type 
  • Format 
  • Structure 
  • Source 
  • Date created 
  • Date modified 
  • File size 

After collecting this information, the tools organize it in a hierarchical structure using the following concepts: 

  • Process: the data transformation operation that the system supports 
  • Run: the execution of a process, often containing details like start time, end time, state, or other attributes 
  • Event: the moment in time when a transformation operation occurred causing data to move between a source and the target entity 

When managing security data across multiple tools, like security information and event management (SIEM) solutions, data lineage becomes complicated. For example, at the enterprise level, organizations struggle to trace data from its origin and throughout various transformations because they may: 

  • Maintain more than 100 different cybersecurity tools 
  • Incorporate multiple SIEMs or centralized log management solutions that all use different schemas 
  • Face challenges losing data quality as they cascade through the hierarchy 

Types of data lineage 

Organizations struggle with data lineage because they often find that they need to choose between purpose and granularity when engaging in the process.  

Business versus technical lineage 

At a high level, the choice between these two means determining which is more important to the use case: 

  • Business: context into business purpose and daily use, like comments, data classifications, justifications, data consumer notes 
  • Technical: end-to-end visibility for data engineers and technical analysts about how data reached its destination team 

When managing security data and analytics, data teams struggle to identify all internal data consumers across various teams, including IT, security, compliance, and operations. With different needs, business lineage may be more important. Meanwhile, traditional data orchestration tools may not be able to extract technical lineage from the complex and diverse schemas that the IT and security stack uses. 

Table-level versus column-level lineage 

At this level, the choice between these two means determining whether location or granularity is more important: 

  • Table-level: ways tables map to each other by using the metadata from the relational database or data lake 
  • Column-level: ways changing data in a table’s columns impact attributes like data type and precision or how combining columns created new column

When managing security data analytics, teams often struggle because the data may be semi-structured, not easily lending itself to table mapping or column granularity.

Additional Resources from Search Results Top 20 Reviewed: