Query Processing Strategies in
Distributed Database
(Journal of Engineering, Computers & Applied Sciences (JEC&AS) ISSN No: 2319‐5606
Volume 2, No.7, July 2013)
Presented By:-
Shree Raj Khatiwada
Introduction
 Query: Instruction to DBMS to update or retrieve specific data to/from the
physically stored medium.
 Query Processor: There are three steps during the processing of that query:
1. Parsing and Translation: the human
readable form of the query is translated
into forms usable by the DBMS i.e.
Relational algebra expression, query
tree and query graph
SELECT Ename
FROM Employee
WHERE Salary > 5000;
This can be translated into following Relational Algebra
Expressions:
σ Salary > 5000 (π Ename (Employee))
OR
π Ename (σ Salary > 5000 (Employee)) Fig: Steps in Query Processing
Introduction contd…
2. Optimizing the Query:
 determines the efficient way to execute a query with different possible
query plans.
 Main aim is to minimize the cost function,
I/O Cost + CPU Cost + Communication Cost
 defines how an RDBMS can improve the performance of the query by re-
ordering the operations.
3. Evaluating the Query:
 The query-execution engine takes an (optimal) evaluation plan, executes
that plan, and returns the answers to the query.
Distributed Query Processing
 In a distributed database environment, data
is stored at different sites connected
through network.
 Distributed query processing contains four
stages, which are:
1. Query decomposition
 Calculus Query as an input
 Using global schema the calculus query is
decomposed to algebraic query
2. Data Localization
 Algebraic query as an input
 Uses fragment schema to generate localized
fragment query
 Fragment involvement is determined
Distributed Query Processing contd…
3. Global Optimization
 Fragment query is an input
 Uses fragment statistics to get optimized fragment query as an output.
 Finding best global schedule is done
4. Local Optimization
 Local schema is used to get optimized local query and then executed
 Output is returned to the site from where the query was generated.
Distributed Query Optimization
Distributed query optimization is defined as finding the efficient execution
strategy path in distributed network.
There are three components of distributed query optimization:
 Access Method: methods used to access data from distributed
environment
 Join Criteria: In distributed database join criteria is used to join the
different sites to get optimized result.
 Transmission Costs: Cost of transmitting the results from intermediate
steps needs to be considered .
There are many issues in distributed query optimization such as types of
optimizer, optimization granularity, network topologies and optimization
timing
Example
Site 1: COURSE, ENROLLMENT
Site 2: STUDENT
(Course: Physics and Student: Senior)
There are many ways to optimize this three-table join some of which are:
Option 1: Start with site 1, join C & E retrieving only physics course and move entire result set to
site 2 to be joined with S.
Option 2: Star with site 2, retrieve only senior student from S and move the entire result set to site
1 to be joined with C and E
Option 3: Move C & E to site 2 and proceed with the local 3-tables join
Option 4: Move S to site 1 and proceed with a local 3-tables join
Example Contd…
Which of these options will perform the best?
 The only correct answer is “It depends”.
The optimal choice will depends on:
 the size of the tables,
 the size of the result sets (the number of qualifying rows and their
length in bytes) and
 the efficiency of the network.
Optimal Distribution Strategies for
Simple QueriesA query optimization algorithm is an algorithm that derives a distribution
strategy for a given query.
Query optimization algorithms that derive optimal distribution strategies for a class of distributed
queries called simple queries.
There are various algorithms that are used for query optimization such as:
Algorithm PARALLEL:
 Algorithm PARALLEL was used to derive a minimal response time distribution strategy for any
given simple query.
 Algorithm PARALLEL searches for cost beneficial data transmissions by trying to join small
relations to large relations.
Algorithm SERIAL:
 Finds strategy with minimum total time
 consists of transmitting each relation, starting with Ri, to the next relation in a serial order.
 The strategy is represented by R1 -> R2 ->…….-> Rm->Rr, where Rr is the relation at the result
node.
Algorithm GENERAL:
• Algorithm GENERAL derives a query processing strategy for either
response time or total time minimization by using the procedures
RESPONSE, TOTAL, and COLLECTIVE.
• a relation can contain more than one joining attribute.
• It has three versions as :-
I. Response Time Version
II. Total Time Version
III. Handling Redundant Data Transmission
Conclusion
 Algorithm GENERAL to be an efficient algorithm of polynomial complexity that derives close to
optimal query processing strategies on distributed systems.
 Algorithm GENERAL is an extension of processing tactics found optimal for simple queries in
Algorithm PARALLEL and algorithm SERIAL
 There are two primary versions of Algorithm GENERAL
1. To minimize response time of a processing strategy, parallel data transmissions are emphasized
by the use of Algorithm PARALLEL and Procedure RESPONSE.
2. To minimize the total time of a processing strategy, serial time transmissions are emphasized by
the use of Algorithm SERIAL and Procedure TOTAL
Thank you.

Query processing strategies in distributed database

  • 1.
    Query Processing Strategiesin Distributed Database (Journal of Engineering, Computers & Applied Sciences (JEC&AS) ISSN No: 2319‐5606 Volume 2, No.7, July 2013) Presented By:- Shree Raj Khatiwada
  • 2.
    Introduction  Query: Instructionto DBMS to update or retrieve specific data to/from the physically stored medium.  Query Processor: There are three steps during the processing of that query: 1. Parsing and Translation: the human readable form of the query is translated into forms usable by the DBMS i.e. Relational algebra expression, query tree and query graph SELECT Ename FROM Employee WHERE Salary > 5000; This can be translated into following Relational Algebra Expressions: σ Salary > 5000 (π Ename (Employee)) OR π Ename (σ Salary > 5000 (Employee)) Fig: Steps in Query Processing
  • 3.
    Introduction contd… 2. Optimizingthe Query:  determines the efficient way to execute a query with different possible query plans.  Main aim is to minimize the cost function, I/O Cost + CPU Cost + Communication Cost  defines how an RDBMS can improve the performance of the query by re- ordering the operations. 3. Evaluating the Query:  The query-execution engine takes an (optimal) evaluation plan, executes that plan, and returns the answers to the query.
  • 4.
    Distributed Query Processing In a distributed database environment, data is stored at different sites connected through network.  Distributed query processing contains four stages, which are: 1. Query decomposition  Calculus Query as an input  Using global schema the calculus query is decomposed to algebraic query 2. Data Localization  Algebraic query as an input  Uses fragment schema to generate localized fragment query  Fragment involvement is determined
  • 5.
    Distributed Query Processingcontd… 3. Global Optimization  Fragment query is an input  Uses fragment statistics to get optimized fragment query as an output.  Finding best global schedule is done 4. Local Optimization  Local schema is used to get optimized local query and then executed  Output is returned to the site from where the query was generated.
  • 6.
    Distributed Query Optimization Distributedquery optimization is defined as finding the efficient execution strategy path in distributed network. There are three components of distributed query optimization:  Access Method: methods used to access data from distributed environment  Join Criteria: In distributed database join criteria is used to join the different sites to get optimized result.  Transmission Costs: Cost of transmitting the results from intermediate steps needs to be considered . There are many issues in distributed query optimization such as types of optimizer, optimization granularity, network topologies and optimization timing
  • 7.
    Example Site 1: COURSE,ENROLLMENT Site 2: STUDENT (Course: Physics and Student: Senior) There are many ways to optimize this three-table join some of which are: Option 1: Start with site 1, join C & E retrieving only physics course and move entire result set to site 2 to be joined with S. Option 2: Star with site 2, retrieve only senior student from S and move the entire result set to site 1 to be joined with C and E Option 3: Move C & E to site 2 and proceed with the local 3-tables join Option 4: Move S to site 1 and proceed with a local 3-tables join
  • 8.
    Example Contd… Which ofthese options will perform the best?  The only correct answer is “It depends”. The optimal choice will depends on:  the size of the tables,  the size of the result sets (the number of qualifying rows and their length in bytes) and  the efficiency of the network.
  • 9.
    Optimal Distribution Strategiesfor Simple QueriesA query optimization algorithm is an algorithm that derives a distribution strategy for a given query. Query optimization algorithms that derive optimal distribution strategies for a class of distributed queries called simple queries. There are various algorithms that are used for query optimization such as: Algorithm PARALLEL:  Algorithm PARALLEL was used to derive a minimal response time distribution strategy for any given simple query.  Algorithm PARALLEL searches for cost beneficial data transmissions by trying to join small relations to large relations.
  • 10.
    Algorithm SERIAL:  Findsstrategy with minimum total time  consists of transmitting each relation, starting with Ri, to the next relation in a serial order.  The strategy is represented by R1 -> R2 ->…….-> Rm->Rr, where Rr is the relation at the result node. Algorithm GENERAL: • Algorithm GENERAL derives a query processing strategy for either response time or total time minimization by using the procedures RESPONSE, TOTAL, and COLLECTIVE. • a relation can contain more than one joining attribute. • It has three versions as :- I. Response Time Version II. Total Time Version III. Handling Redundant Data Transmission
  • 11.
    Conclusion  Algorithm GENERALto be an efficient algorithm of polynomial complexity that derives close to optimal query processing strategies on distributed systems.  Algorithm GENERAL is an extension of processing tactics found optimal for simple queries in Algorithm PARALLEL and algorithm SERIAL  There are two primary versions of Algorithm GENERAL 1. To minimize response time of a processing strategy, parallel data transmissions are emphasized by the use of Algorithm PARALLEL and Procedure RESPONSE. 2. To minimize the total time of a processing strategy, serial time transmissions are emphasized by the use of Algorithm SERIAL and Procedure TOTAL
  • 12.