Data 101 series - part 4/8

In the last episode, we’ve seen that successful data strategies require reviewing existing data platforms and analyzing how business users can take advantage of them. In addition, any data strategy requires the right tools and technologies to work as planned. Here, data architecture is pivotal in choosing the right tools and processes that allow you to design an adequate data platform for your organization.

In this article, I will introduce the concept of the data journey and how raw data increase in value through its multiple stages.

Overview

The Data journey or the data value chain describes the different steps in which data goes from its creation to its eventual disposal. In the context of data platforms, It consists of acquisition, storage, processing, and dissemination (sharing, visualization, monetization…). This process aims to ensure that the right information is available at the right time for decision-making purposes while protecting against unauthorized access or misuse.

A modern data architecture represents all stages of the data journey ©Semantix.

To create a performant and adequate data journey for business users, the data architects focus on how technological tools enable your organization and, consequently, your people to be more data-driven (see Data 101 - part 2). Far from the common mindset of modernizing to modernize, he relies on a few strict rules:

Relevance: Who will use the tech, and will it meet their needs? Technology should serve the business users and not the opposite. Also, ensure that each stage of the data journey has the right technology and processes to maintain data integrity and produce the most value. Instead of identifying a universal best-in-class approach, use a customized tool selection based on the data characteristics (see Data 101 - part 1), your team size, and the maturity level of your organization.
Accessibility: Data architects should consider tools and technologies that enable everyone across the organization to make data-driven decisions without obstacles when accessing data.
Performance: Potent technologies on the market speed the data transformation process. Consider tools that will enable business users to be proactive and not reactive.

Data journey stages

The data journey consists of many stages. The main ones are ingestion, storage, processing, and serving. Each stage has its own set of activities and considerations. The data journey also has a notion of “underlying” activities — critical activities across the entire lifecycle. These include security, data management, DataOps, orchestration, and software engineering.

Understanding each stage of the data lifecycle will help you make better choices about tools and processes you apply to your data assets, depending on what you expect from the data platform. Next, I will briefly overview the different stages that form the data journey. Further details about each step will be extensively detailed in the next blog posts.

1 - Data Ingestion

Data ingestion is the first stage of the data lifecycle. This is where data is collected from various internal sources like databases, CRM, ERPs, legacy systems, external ones such as surveys, and third-party providers. It’s important to ensure the acquired data is accurate and up-to-date to be used effectively in subsequent stages of the cycle.

In this stage, raw data are extracted from one or many data sources, replicated, then ingested into a landing storage support. Next, you must consider the data characteristics you want to acquire to ensure the data ingestion stage has the right technology and processes to meet its goals.

In the next post, I will deep-dive into data ingestion, its main activities, and considerations. As I’ve explained in Data 101 - part 1, data has few innate characteristics. The most relevant ones for this stage: are Volume, Variety, and Velocity.

2 - Data Storage

Storage refers to how data is retained once it has been acquired. Data should be securely stored on a reliable platform with appropriate backup systems for disaster recovery. Access control measures must also be implemented to protect sensitive information from unauthorized users or malicious actors who may try to gain access illegally.

Selecting a storage solution is crucial for the success of the data lifecycle, but it can also be a complex process due to several factors. Data storage has a few characteristics, such as storage lifecycle (how data will evolve), storage options (how data can be stored efficiently), storage layers, storage formats (how data should be stored depending on data access frequency), and the storage technologies in which data are kept.

While storage is a distinct stage in the data journey, it intersects with other phases like ingestion, transformation, and serving. Storage runs across the entire data journey, often occurring in multiple places in a data pipeline, with storage systems crossing over with source systems, ingestion, transformation, and serving. In many ways, how data is stored impacts how it is used in all the stages of the data journey.

3 - Data Processing

After you’ve ingested and stored data, you need to do something with it. The next stage of the data journey lifecycle is transformation, meaning data needs to be changed from its original form into something useful for downstream use cases.

Processing involves transforming raw data into valuable insights through a series of basic transformations. These play a vital role in the data processing pipeline, ensuring data is accurately represented in the correct data types. This involves converting string data, such as dates and numerical values, to their respective data types. Basic transformations also involve standardizing data records, eliminating erroneous entries, and ensuring the data meets required formatting standards.

As the data processing pipeline progresses, more advanced transformations may be necessary, such as transforming or normalizing the data schema. Further downstream in the pipeline, we can aggregate data at scale for reporting purposes or transform data into feature vectors for use in machine learning processes.

The main challenges here are accuracy and efficiency since this stage requires significant computing power, which could become costly over time without proper optimization strategies.

4 - Data Serving

You’ve reached the last stage of the data journey. Now that the data has been ingested, stored, and processed into coherent and valuable structures, it’s time to get value from your data.

Data serving is the most exciting part of the data journey. This is where the magic happens. Here, BI engineers, ML engineers, and data scientists can apply the most advanced techniques to extract the added value of data. For example, Analytics is one of the famous data endeavors. It consists of interpreting and drawing insights from curated data to make informed decisions or predictions about future trends. Moreover, data visualization tools can be used here to present data more meaningfully.

Data Journey’s Underlying Activities

Data engineering has evolved beyond tools and technology and now includes a variety of practices and methodologies aimed at optimizing the entire data journey, such as data management, security, cost optimization, and newer practices like DataOps.

Security is a critical consideration in any data project. Data engineers must prioritize security in their work, as failure can have severe consequences. They must comprehensively understand data and access security and adhere to the principle of least privilege.

The principle of least privilege is a fundamental security principle that restricts access to data to only those necessary to perform specific functions. This approach limits the potential damage that could result from a security breach. By exercising the principle of least privilege, data engineers can help ensure that sensitive data is only accessible by authorized personnel and is protected from unauthorized access or misuse.
Cost optimization - Consider your data platform’s overall cost and return on investment (ROI). This cost can quickly become a blocker for any future data initiative if it doesn’t worth spending too much for too little value outcome. Like security, the price is critical and should be analyzed and assessed before deciding which tools and technologies you choose.
Data Management demonstrates that organizations must view data as a valuable asset in the same way they view other resources such as financial assets, real estate, or finished goods. Effective data management practices establish a cohesive framework that all stakeholders can follow to ensure the organization derives value from its data and handles it appropriately. This framework consists of multiple aspects, such as data governance, data discovery, data lineage, data accountability, data quality, master data management (MDM), data modeling and design, and data privacy.

Without a comprehensive framework for managing data, data engineers are isolated, focusing only on technical aspects without fully understanding the broader organizational context of data. Data engineers need a broader perspective of data’s utility across the organization, from the source systems to executive leadership and everywhere in between. By adopting a comprehensive data management framework, data engineers can help ensure that the organization leverages its data effectively, enabling better decision-making and improving overall operational efficiency.
DataOps is a methodology that incorporates the best practices of Agile methodology, DevOps, and statistical process control (SPC) to manage data effectively. While DevOps primarily focuses on enhancing the quality and release of software products, DataOps applies similar principles to data products.

By adopting DataOps, organizations can ensure that data products are delivered on time, meet quality standards, and are available to stakeholders as needed. This methodology emphasizes observability, collaboration, automation, and continuous improvement, ensuring that data processes are streamlined and efficient. In addition, by applying statistical process control principles, DataOps helps to identify and mitigate any issues or errors in the data pipeline.
Orchestration: As the complexity of data pipelines increases, a robust orchestration system becomes essential. In this context, orchestration refers to coordinating many jobs (responsible for data flows) to run as quickly and efficiently as possible on a scheduled cadence.

Data Chain of Value

Remember the data roles we introduced in Data 101 - part 2 of this series. Let’s put every role in the data journey perspective. We can observe that everyone has predefined tasks, skills, and interface contracts with the other ones. The data value increases through the data journey landscape thanks to this synergy. Throughout the process, from one end of the value chain to another and back again, there should be constant feedback between producers and stakeholders.

Data roles in the Data chain of value

The platform (DataOps) engineer builds the infrastructure of the data platform for other involved data professionals to use.

The data engineer leverages the data platform infrastructure to build and deploy pipelines that extract data from external sources, clean it, guarantee its quality, and pass it on for further modeling. If more complex processing is needed, then the ownership of data engineers would shift downstream accordingly.
The data analyst communicates with business stakeholders to create an accurate and intuitively usable data model for subsequent analyses. Then, they utilize this modeled data to conduct Exploratory Data Analyses (EDA) and Descriptive Analyses to answer business-related questions with the help of dashboards and reports.

MLOps engineers build out the infrastructure of a Machine Learning platform that can be used by both ML engineers and data scientists alike.

Data scientists leverage datasets in the data platform to explore their predictive insights and build Machine Learning Models. In more mature organizations, you will find feature stores as well.
Machine Learning engineers leverage ML platform capabilities to deploy Machine Learning models built by data scientists ensuring MLOps best practices. Communicate with MLOps engineers to align on, help improve, and industrialize ML platform capabilities.

Where do all these data roles fit in with data science? For example, there’s some debate, with some arguing data engineering is a sub-discipline of data science. Honestly, I believe data engineering is separate from data science and analytics. They complement each other, but they are distinctly different. Data engineering sits upstream of data science, meaning data engineers provide the inputs used by data scientists (downstream from data engineering), who convert these inputs into something useful.

Caution: Being further upstream of the Data Value Chain does not mean that your value is less. Quite the opposite - any mistake upstream multiplies the impact of the downstream applications making the upstream roles have the most impact on the final value.

To support this reasoning, I consider the Data Science Hierarchy of Needs (figure below). In 2017, Monica Rogati published this hierarchy in an article showing where AI and machine learning (ML) sat near more “mundane” tasks such as data collection, transformation, and infrastructure.

Data science hierarchy of needs

Although many data scientists may be excited to develop and fine-tune machine learning models, the truth is that approximately 70% to 80% of their time is devoted to the lower three levels of the hierarchy - data collection, data cleaning, and data processing - with only a tiny fraction of their time allocated to analysis and machine learning. To address this, Rogati argues that companies must establish a strong data foundation at the bottom levels of the hierarchy before embarking on areas such as AI and ML.

Data scientists typically need training in creating production-grade data systems and perform this work haphazardly because they need more assistance and resources from a data engineer. Ideally, data scientists should devote more than 90% of their time to the upper levels of the hierarchy, including analytics, experimentation, and Machine Learning. When data engineers concentrate on these lower portions of the hierarchy, they establish a robust foundation for data scientists to excel.

Summary

In this article, we introduced the concept of the data journey (data value chain) and how one can use modern technological tools to enable organizations and people to be more data-driven. Thus, we detailed the different stages of this landscape and observed that deep architectural analysis is required to choose the right technology for the right purpose.

Finally, we put data team roles in the data journey landscape. We observed that data value increases thanks to the interactions between each one of the data teams. However, upstream activities in the data value chain do not mean less value. Instead, every stage in the data journey immediately impacts the downstream steps and, thus, the final value.

References

Reis, J. and Housley M. Fundamentals of data engineering: Plan and build robust data systems. O’Reilly Media (2022).
“Data Lake Governance Best Practices”, Parth Patel, Greg Wood, and Adam Diaz (DZone).