Data 101, Part 1

“Data 101” is a series that introduces foundational skills and concepts to help people become data literate. Through this series, you will learn how to collect, store, process, analyze, visualize, and interpret data using various technological and statistical tools.

At the end of this series, you will acquire the basic analytical skills that are very valuable for managers. Moreover, you will master the technical foundations necessary to build a complete data journey: From acquisition to decision-making. The critical managerial insight of this whole exercise is to imbue in an organization the idea that you want to make decisions based on knowledge, not intuition.

Topics covered include basic statistical concepts such as descriptive measures; probability distributions; hypothesis testing; linear regression models for prediction purposes; machine learning algorithms like decision trees or neural networks for classification tasks. In addition, you will be introduced to the latest data technologies, such as data storage and processing, using frameworks like Hadoop, MapReduce, and Spark to process large datasets efficiently on distributed computing clusters. By the end of this series, you should have acquired enough knowledge about these topics so they can apply them in your projects with confidence!

In this first article, I will introduce the concept of data, the typology of data, its main characteristics, and its impacts on our modern way of life.

What is Data?

Modern organizations consider data their most valuable asset because it provides insights into customer behavior, market trends, product performance, and more that help inform decisions about allocating resources. As such, companies often invest in collecting or purchasing data from third-party sources to gain a competitive advantage over their competitors.

Data is any information you are collecting that has been organized and structured to make it worthwhile for analysis. Data are collected every time you purchase, navigate a website, travel, make a phone call, or post on a social media site. Data can come from many sources, including sensors, surveys, experiments, observations, or existing records (historical data), such as financial transactions. Never before has so much data about many different things been collected and stored every second of every day.

Data is everywhere! Data is everywhere

The theory of information pushed the concept of data way further (Shannon, 1948). The Theory of Information is a field of study that seeks to understand the nature and origin of information, and according to this study: everything can be considered data. This includes physical objects and abstract concepts such as ideas or emotions. Furthermore, data is defined as any set of symbols that conveys meaning when interpreted by a receiver. Therefore anything that has some form of symbolic representation (e.g., DNA sequences, words, numbers) could be classified as data in this context.

Types of Data

Data can be classified differently depending on the chosen perspective (by value, velocity, structure, or sensitivity…). From a purely statistical point of view, data can be part of two major categories according to their value:

Quantitative (numerical) data: is any information that can be expressed, measured, and compared using numerical values, such as integers or real numbers. Examples of quantitative data include height, weight, length, temperature readings, population size, or countable items such as the number of students in a classroom. This type of data can be further divided into Discrete (whole numbers) or Continuous (decimals).
- Continuous data: quantitative data that could be meaningfully divided into finer levels. It can be measured on a scale or continuum. It can have almost any numeric value: any value within a finite or infinite range (interval) or a value that compares two or more numbers (ratio). Examples include height, weight, temperature, speed, BMI, and time.
- Discrete data: consists of finite, numeric, countable values. The discrete values cannot be subdivided into parts. Discrete variables include counts (e.g., the number of children in a household), the total number of products, or binary indicators (yes/no, true/false).
Qualitative (categorical) data: is non-numerical information such as opinions, feelings, perceptions, and attitudes. This data can answer to questions such as: “How did it occur?” or “Why did this occur?”. Examples of qualitative data include gender, rankings, enumerations, etc. This kind of data can be divided into Nominal or Ordinal.
- Nominal data: a type of categorical data that has no numerical value or order. It consists of names, labels, or categories that classify and organize information into distinct groups. Examples include gender (male/female), nationality (Moroccan/french), and colors (green/blue).
- Ordinal data: this data type has an order or ranking associated with it. Examples include rankings such as 1st, 2nd, and 3rd; grades like A+, B-, and C/D; and high-medium-low ratings.

Statistically, qualitative variables must be transformed into dummy variables before any analysis. For example, we could artificially assign numbers to categories. For instance, if your categories are colors, we could give the number 1 to red and 2 to blue, but these do not have meaning in a mathematical sense. We would not conclude that blue is twice as much as red is!

Types of Data

Impact of Data

Data… Information… Knowledge. What is the difference? The DIKW Model (Rowley, 2007) answers this question and the underlying one: what is the finality of data?

The DIKW model is a model that describes the relationship between Data, Information, Knowledge, and Wisdom. Data can be considered raw material for wise decision-making because it provides an objective basis for drawing conclusions or making decisions. By analyzing large amounts of data in various ways, such as through statistical analysis or machine learning algorithms, we can uncover patterns within the data that may not have been obvious before. This information is then processed into meaningful insights, forming the basis of decision-making processes. Finally, wisdom comes in when these insights are applied with experience and judgment so that an informed choice can be made about what action should occur next (make more informed decisions about future strategies and actions).

DIKW pyramid

Thus, data adds value by providing insights and information that can be used to make informed decisions. Data helps organizations identify trends, measure performance, optimize processes, improve customer experience, and drive innovation. It also enables businesses to gain a competitive edge in the market through better decision-making capabilities based on data analysis.

I’ve deeply appreciated the power and the impact that the data has. Coming to the table with concrete evidence in data, in comparable titles, really allows the team to feel holistically comfortable with how we’re forecasting the business.

Through the different parts of this series, you will embark on this journey and learn how data transforms and enriches from collected signals, measures, and facts towards a decision-driven pattern of actions.

Characteristics of Data

In the early years of this century, data was only studied in terms of three characteristics, known as the three V’s of Data: Volume, Velocity, and Variety. Over time, two more V’s (value and veracity) have been added to help data scientists and managers more effectively articulate and communicate the essential characteristics of the data they work with.

The five main and innate characteristics of data are :

Volume: The amount of data an organization generates and stores.
Velocity: refers to how quickly data is generated and how quickly that data moves and can be processed into usable insights.
Variety: refers to the diversity of data. Organizations might gather data from multiple sources, which may vary in format. Collected data can be structured, semi-structured, or unstructured in nature.
Veracity: refers to the level of trust and reliability in the collected data. In simple words, the quality and accuracy of data. Collected data could have missing pieces, be inaccurate, or be unable to provide real value.
Value: refers to the value data can provide about what organizations can do with it. This characteristic directly clues the meaning and context that an organization might give to the collected data.

In marketing, subject-matter experts (SME) started to use two additional characteristics that are not innate to data but can significantly impact the insights generated from it. These two characteristics are:

Variability: a measure of the variation in the values within each variant of data. This concept is related to the context of data and the meaning given to it. In an organization, the meaning can constantly change, significantly impacting data homogenization. This concept differs from a variety: Imagine a coffee shop that offers six different blends of coffee (this is Variety), but if you get the same blend every day. It tastes different every day; that is variability.
Visualization: Visualization is critical in today’s world. Using charts and graphs to visualize large amounts of complex data is much more effective in conveying meaning than raw data in spreadsheets chock-full numbers and formulas.

7 V's of Data 7 V’s of Data

Summary

Data is a collection of observations and facts, materialized by values or measurements and organized to make it worthwhile for analysis.
Data is everywhere and can be qualitative (descriptive) or quantitative (numerical).
Data has no meaning by itself; it must be put in context, interpreted, and analyzed to gain insights. It should also contain enough information so that meaningful conclusions can be drawn and actions can be taken.
Data has different characteristics. In the following articles, we will learn that each character has an important impact on the data journey. Every action taken on data: either for collection, storage, processing, visualization, and interpretation, depends on the 7 V’s of Data.

References

Claude E. Shannon. “A Mathematical Theory of Communication”. The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948.
Rowley, Jennifer E.. “The wisdom hierarchy: representations of the DIKW hierarchy.” Journal of Information Science 33 (2007): 163 - 180.
“The 7 V’s of Big Data”, Matt Moore