In this post I try to understand "Data" and how it's related to "Information" and "Knowledge". I also explain the different types of data and on what basis data is classified. All these concepts cannot be covered in a brief article. But I’ll try my best to explain everything. Let’s start from the beginning…
What is Data?
Data can be any set of values which is used to generate information. Data is usually raw and unorganized facts that need to be processed. Once organized, Data can be used to generate valuable information to drive insights.
When many words come together we have a sentence and several sentences form a paragraph. Many paragraphs create a chapter. Which collectively becomes a book. All of these components can both individually and collectively be called data. By collecting, organizing and managing data we turn it into information. ie. words become books.
When raw data is organized, structured and modeled, it generates knowledge and becomes Information. In other words, Information is just data organized in such a way that it generates knowledge. Data is found everywhere. All domains, fields, studies, and sciences constantly create and consume data.
What are the different types of data?
Data can be classified on the basis of — where it is coming from, how it is stored and what is being stored.
Based on Data sources -
Based on where the data comes from it can be classified into -
Primary data: This includes the new raw data that is extracted or collected for the first time. Primary data is usually in an unorganized form without any order or structure. Example — Taking a picture of students for the yearbook.
Secondary data is typically collected from existing sources. It typically is already organized and arranged, and may already be in the form of information. Example - Clipping/cropping the pictures of students from the yearbook.
Based on the structure -
Another way to classify data is by storing and organizing it.
It is data that is arranged or organized according to a structure and follows a predefined data model. This makes it easy to analyze and generate insights from structured data. For example, Tables with several columns(features) and Rows(Observations).
Structured data often has a data model which defines the relationships between tables and the flow of data between different tables. The Best example of structured data is RDBMS (Relational Database management system) which is the base for SQL Data Bases.
Unstructured Data It is the most common type of data. More than 80–90% of the data is unstructured. It is not built on data models and does not have any predefined structure for storage. Unstructured Data is stored in the original format and processed as and when required. Pictures, Videos, emails, tweets, and documents are all unstructured data.
New Unstructured data is being created every day at a rate beyond human perception (2.5 Quintillion Bytes per day, according to [this] article).
Based on Values.
Finally, data can be classified based on the properties of the values being stored.
Qualitative Data cannot be measured or counted. Instead, it is usually Categorical / Nominal or Scalar/Ordinal in Nature.
Nominal — Hot vs Cold
Ordinal — Very cold — cold — warm — hot- Very Hot
Quantitative Data is measured, counted, z and aggregated to generate insights. Any numerical data that is Discrete/counted or Continuous/measured is of this type.
Discrete - 5 people say it’s hot outside.
Continuous — 34.5 Celsius outside.
By learning the different types of data and how data is structured we gain a foundational understanding of how it can be further processed.
This is very important in the Data Analysis process. When we know what type of data we are looking at and how we can arrange our observation we can easily start generating insights from it. Understanding the nature of data we collect is the first and the biggest step for proper analysis of data.