INTRODUCTION to datawarehouse IN DATA.pptx

ETL Process in Data Warehouse
• The ETL process (Extract, Transform, Load) is the core mechanism used to
integrate and prepare data for analysis in a data warehouse. It involves
extracting data from various sources, transforming it to fit the desired
format, and loading it into the data warehouse.

• 1. Extract
• In this phase, data is collected from multiple heterogeneous sources such as:
• Relational databases (e.g., MySQL, PostgreSQL)
• Flat files (e.g., CSV, Excel)
• ERP systems
• Web services/APIs

• Transform
• Once extracted, the data is processed and transformed into a suitable format for analysis.
Common transformation tasks include:
• Data cleaning: Removing duplicates, handling missing values, correcting errors.
• Data integration: Combining data from different sources.
• Data normalization/denormalization: Depending on the use case.
• Aggregation: Summarizing data (e.g., total sales by region).
• Data type conversion: Converting data types to ensure consistency.

• Load
• After transformation, the data is loaded into the data warehouse. Loading
can happen in two modes:
• Full load: Entire dataset is loaded.
• Incremental load: Only new or updated data is loaded.
The goal is to ensure that the data warehouse remains consistent and up-to-
date for reporting and analysis.

Difference between DBMS and Data warehouse
Aspect
DBMS (Database
Management System)
Data Warehouse
Definition
Software for storing, managing,
and retrieving data in databases.
A system designed for analytical
reporting and data analysis.
Purpose
Supports transactional operations
(OLTP – Online Transaction
Processing).
Supports analytical operations
(OLAP – Online Analytical
Processing).
Data Type
Stores current, real-time
transactional data.
Stores historical data aggregated
from multiple sources.
Normalization Data is often normalized to
reduce redundancy and ensure
consistency.
Data is typically denormalized to
improve query performance.
Processing Type
Handles read and write
operations (insert, update, delete).
Primarily optimized for read
operations (queries, reports).

Data Model Uses relational models (tables
with primary and foreign
keys).
Uses multidimensional models
(star schema, snowflake
schema).
Users Used by operational staff for day-
to-day transactions.
Used by analysts, data scientists,
and business executives.
Query Complexity
Simple queries for routine
transactions.
Complex queries for trends,
patterns, and business
insights.Complex queries for
trends, patterns, and business
insights.
Data Integration Typically contains data from a
single application or system.
Combines data from multiple
sources (ETL process).

Data Mining
• Data mining is the process of extracting knowledge or insights from large
amounts of data using various statistical and computational techniques.
• The data can be structured, semi-structured or unstructured, and can be stored in
various forms such as databases, data warehouses, and data lakes.
• The primary goal of data mining is to discover hidden patterns and relationships
in the data that can be used to make informed decisions or predictions. This
involves exploring the data using various techniques such as clustering,
classification, regression analysis, association rule mining, and anomaly detection.

• Data mining is the process of discovering patterns and relationships in large
datasets using techniques such as machine learning and statistical analysis.
The goal of data mining is to extract useful information from large datasets
and use it to make predictions or inform decision-making. Data mining is
important because it allows organizations to uncover insights and trends in
their data that would be difficult or impossible to discover manually.

Data Mining Architecture
• Data Sources: Data sources are the sources of data that are used in data mining. These
can include structured and unstructured data from databases, files, sensors, and other
sources. Data sources provide the raw data that is used in data mining and can be
processed, cleaned, and transformed to create a usable data set for analysis.
• Data Preprocessing: Data preprocessing is the process of preparing data for analysis.
This typically involves cleaning and transforming the data to remove errors,
inconsistencies, and irrelevant information, and to make it suitable for analysis. Data
preprocessing is an important step in data mining, as it ensures that the data is of high
quality and is ready for analysis.

• Data Mining Algorithms: Data mining algorithms are the algorithms and models that are used to
perform data mining. These algorithms can include supervised and unsupervised learning
algorithms, such as regression, classification, and clustering, as well as more specialized algorithms
for specific tasks, such as association rule mining and anomaly detection. Data mining algorithms
are applied to the data to extract useful insights and information from it.
• Data Visualization: Data visualization is the process of presenting data and insights in a clear and
effective manner, typically using charts, graphs, and other visualizations. Data visualization is an
important part of data mining, as it allows data miners to communicate their findings and insights to
others in a way that is easy to understand and interpret.

Data Mining Algorithms
• Classification: It is a data analysis task, i.e. the process of finding a model
that describes and distinguishes data classes and concepts. Classification is
the problem of identifying to which of a set of categories (subpopulations),
a new observation belongs to, on the basis of a training set of data
containing observations and whose categories membership is known.

INTRODUCTION to datawarehouse IN DATA.pptx

INTRODUCTION to datawarehouse IN DATA.pptx

More Related Content

Similar to INTRODUCTION to datawarehouse IN DATA.pptx

More from urvashipundir04

Recently uploaded

INTRODUCTION to datawarehouse IN DATA.pptx