INTRODUCTION
DATAWAREHOUSE
DATA WAREHOUSE
ETL Process in Data Warehouse
• The ETL process (Extract, Transform, Load) is the core mechanism used to
integrate and prepare data for analysis in a data warehouse. It involves
extracting data from various sources, transforming it to fit the desired
format, and loading it into the data warehouse.
• 1. Extract
• In this phase, data is collected from multiple heterogeneous sources such as:
• Relational databases (e.g., MySQL, PostgreSQL)
• Flat files (e.g., CSV, Excel)
• ERP systems
• Web services/APIs
• Transform
• Once extracted, the data is processed and transformed into a suitable format for analysis.
Common transformation tasks include:
• Data cleaning: Removing duplicates, handling missing values, correcting errors.
• Data integration: Combining data from different sources.
• Data normalization/denormalization: Depending on the use case.
• Aggregation: Summarizing data (e.g., total sales by region).
• Data type conversion: Converting data types to ensure consistency.
• Load
• After transformation, the data is loaded into the data warehouse. Loading
can happen in two modes:
• Full load: Entire dataset is loaded.
• Incremental load: Only new or updated data is loaded.
The goal is to ensure that the data warehouse remains consistent and up-to-
date for reporting and analysis.
Difference between DBMS and Data warehouse
Aspect
DBMS (Database
Management System)
Data Warehouse
Definition
Software for storing, managing,
and retrieving data in databases.
A system designed for analytical
reporting and data analysis.
Purpose
Supports transactional operations
(OLTP – Online Transaction
Processing).
Supports analytical operations
(OLAP – Online Analytical
Processing).
Data Type
Stores current, real-time
transactional data.
Stores historical data aggregated
from multiple sources.
Normalization Data is often normalized to
reduce redundancy and ensure
consistency.
Data is typically denormalized to
improve query performance.
Processing Type
Handles read and write
operations (insert, update, delete).
Primarily optimized for read
operations (queries, reports).
Data Model Uses relational models (tables
with primary and foreign
keys).
Uses multidimensional models
(star schema, snowflake
schema).
Users Used by operational staff for day-
to-day transactions.
Used by analysts, data scientists,
and business executives.
Query Complexity
Simple queries for routine
transactions.
Complex queries for trends,
patterns, and business
insights.Complex queries for
trends, patterns, and business
insights.
Data Integration Typically contains data from a
single application or system.
Combines data from multiple
sources (ETL process).
Data Mining
• Data mining is the process of extracting knowledge or insights from large
amounts of data using various statistical and computational techniques.
• The data can be structured, semi-structured or unstructured, and can be stored in
various forms such as databases, data warehouses, and data lakes.
• The primary goal of data mining is to discover hidden patterns and relationships
in the data that can be used to make informed decisions or predictions. This
involves exploring the data using various techniques such as clustering,
classification, regression analysis, association rule mining, and anomaly detection.
• Data mining is the process of discovering patterns and relationships in large
datasets using techniques such as machine learning and statistical analysis.
The goal of data mining is to extract useful information from large datasets
and use it to make predictions or inform decision-making. Data mining is
important because it allows organizations to uncover insights and trends in
their data that would be difficult or impossible to discover manually.
Data Mining Architecture
• Data Sources: Data sources are the sources of data that are used in data mining. These
can include structured and unstructured data from databases, files, sensors, and other
sources. Data sources provide the raw data that is used in data mining and can be
processed, cleaned, and transformed to create a usable data set for analysis.
• Data Preprocessing: Data preprocessing is the process of preparing data for analysis.
This typically involves cleaning and transforming the data to remove errors,
inconsistencies, and irrelevant information, and to make it suitable for analysis. Data
preprocessing is an important step in data mining, as it ensures that the data is of high
quality and is ready for analysis.
• Data Mining Algorithms: Data mining algorithms are the algorithms and models that are used to
perform data mining. These algorithms can include supervised and unsupervised learning
algorithms, such as regression, classification, and clustering, as well as more specialized algorithms
for specific tasks, such as association rule mining and anomaly detection. Data mining algorithms
are applied to the data to extract useful insights and information from it.
• Data Visualization: Data visualization is the process of presenting data and insights in a clear and
effective manner, typically using charts, graphs, and other visualizations. Data visualization is an
important part of data mining, as it allows data miners to communicate their findings and insights to
others in a way that is easy to understand and interpret.
Data Mining Algorithms
Data Mining Algorithms
• Classification: It is a data analysis task, i.e. the process of finding a model
that describes and distinguishes data classes and concepts. Classification is
the problem of identifying to which of a set of categories (subpopulations),
a new observation belongs to, on the basis of a training set of data
containing observations and whose categories membership is known.
INTRODUCTION to datawarehouse IN DATA.pptx

INTRODUCTION to datawarehouse IN DATA.pptx

  • 1.
  • 2.
  • 13.
    ETL Process inData Warehouse • The ETL process (Extract, Transform, Load) is the core mechanism used to integrate and prepare data for analysis in a data warehouse. It involves extracting data from various sources, transforming it to fit the desired format, and loading it into the data warehouse.
  • 14.
    • 1. Extract •In this phase, data is collected from multiple heterogeneous sources such as: • Relational databases (e.g., MySQL, PostgreSQL) • Flat files (e.g., CSV, Excel) • ERP systems • Web services/APIs
  • 15.
    • Transform • Onceextracted, the data is processed and transformed into a suitable format for analysis. Common transformation tasks include: • Data cleaning: Removing duplicates, handling missing values, correcting errors. • Data integration: Combining data from different sources. • Data normalization/denormalization: Depending on the use case. • Aggregation: Summarizing data (e.g., total sales by region). • Data type conversion: Converting data types to ensure consistency.
  • 16.
    • Load • Aftertransformation, the data is loaded into the data warehouse. Loading can happen in two modes: • Full load: Entire dataset is loaded. • Incremental load: Only new or updated data is loaded. The goal is to ensure that the data warehouse remains consistent and up-to- date for reporting and analysis.
  • 18.
    Difference between DBMSand Data warehouse Aspect DBMS (Database Management System) Data Warehouse Definition Software for storing, managing, and retrieving data in databases. A system designed for analytical reporting and data analysis. Purpose Supports transactional operations (OLTP – Online Transaction Processing). Supports analytical operations (OLAP – Online Analytical Processing). Data Type Stores current, real-time transactional data. Stores historical data aggregated from multiple sources. Normalization Data is often normalized to reduce redundancy and ensure consistency. Data is typically denormalized to improve query performance. Processing Type Handles read and write operations (insert, update, delete). Primarily optimized for read operations (queries, reports).
  • 19.
    Data Model Usesrelational models (tables with primary and foreign keys). Uses multidimensional models (star schema, snowflake schema). Users Used by operational staff for day- to-day transactions. Used by analysts, data scientists, and business executives. Query Complexity Simple queries for routine transactions. Complex queries for trends, patterns, and business insights.Complex queries for trends, patterns, and business insights. Data Integration Typically contains data from a single application or system. Combines data from multiple sources (ETL process).
  • 20.
    Data Mining • Datamining is the process of extracting knowledge or insights from large amounts of data using various statistical and computational techniques. • The data can be structured, semi-structured or unstructured, and can be stored in various forms such as databases, data warehouses, and data lakes. • The primary goal of data mining is to discover hidden patterns and relationships in the data that can be used to make informed decisions or predictions. This involves exploring the data using various techniques such as clustering, classification, regression analysis, association rule mining, and anomaly detection.
  • 21.
    • Data miningis the process of discovering patterns and relationships in large datasets using techniques such as machine learning and statistical analysis. The goal of data mining is to extract useful information from large datasets and use it to make predictions or inform decision-making. Data mining is important because it allows organizations to uncover insights and trends in their data that would be difficult or impossible to discover manually.
  • 26.
    Data Mining Architecture •Data Sources: Data sources are the sources of data that are used in data mining. These can include structured and unstructured data from databases, files, sensors, and other sources. Data sources provide the raw data that is used in data mining and can be processed, cleaned, and transformed to create a usable data set for analysis. • Data Preprocessing: Data preprocessing is the process of preparing data for analysis. This typically involves cleaning and transforming the data to remove errors, inconsistencies, and irrelevant information, and to make it suitable for analysis. Data preprocessing is an important step in data mining, as it ensures that the data is of high quality and is ready for analysis.
  • 27.
    • Data MiningAlgorithms: Data mining algorithms are the algorithms and models that are used to perform data mining. These algorithms can include supervised and unsupervised learning algorithms, such as regression, classification, and clustering, as well as more specialized algorithms for specific tasks, such as association rule mining and anomaly detection. Data mining algorithms are applied to the data to extract useful insights and information from it. • Data Visualization: Data visualization is the process of presenting data and insights in a clear and effective manner, typically using charts, graphs, and other visualizations. Data visualization is an important part of data mining, as it allows data miners to communicate their findings and insights to others in a way that is easy to understand and interpret.
  • 29.
  • 30.
    Data Mining Algorithms •Classification: It is a data analysis task, i.e. the process of finding a model that describes and distinguishes data classes and concepts. Classification is the problem of identifying to which of a set of categories (subpopulations), a new observation belongs to, on the basis of a training set of data containing observations and whose categories membership is known.