Query processing strategies in distributed database

Query Processing Strategies in
Distributed Database
(Journal of Engineering, Computers & Applied Sciences (JEC&AS) ISSN No: 2319‐5606
Volume 2, No.7, July 2013)
Presented By:-
Shree Raj Khatiwada

Introduction
 Query: Instruction to DBMS to update or retrieve specific data to/from the
physically stored medium.
 Query Processor: There are three steps during the processing of that query:
1. Parsing and Translation: the human
readable form of the query is translated
into forms usable by the DBMS i.e.
Relational algebra expression, query
tree and query graph
SELECT Ename
FROM Employee
WHERE Salary > 5000;
This can be translated into following Relational Algebra
Expressions:
σ Salary > 5000 (π Ename (Employee))
OR
π Ename (σ Salary > 5000 (Employee)) Fig: Steps in Query Processing

Introduction contd…
2. Optimizing the Query:
 determines the efficient way to execute a query with different possible
query plans.
 Main aim is to minimize the cost function,
I/O Cost + CPU Cost + Communication Cost
 defines how an RDBMS can improve the performance of the query by re-
ordering the operations.
3. Evaluating the Query:
 The query-execution engine takes an (optimal) evaluation plan, executes
that plan, and returns the answers to the query.

Distributed Query Processing
 In a distributed database environment, data
is stored at different sites connected
through network.
 Distributed query processing contains four
stages, which are:
1. Query decomposition
 Calculus Query as an input
 Using global schema the calculus query is
decomposed to algebraic query
2. Data Localization
 Algebraic query as an input
 Uses fragment schema to generate localized
fragment query
 Fragment involvement is determined

Distributed Query Processing contd…
3. Global Optimization
 Fragment query is an input
 Uses fragment statistics to get optimized fragment query as an output.
 Finding best global schedule is done
4. Local Optimization
 Local schema is used to get optimized local query and then executed
 Output is returned to the site from where the query was generated.

Distributed Query Optimization
Distributed query optimization is defined as finding the efficient execution
strategy path in distributed network.
There are three components of distributed query optimization:
 Access Method: methods used to access data from distributed
environment
 Join Criteria: In distributed database join criteria is used to join the
different sites to get optimized result.
 Transmission Costs: Cost of transmitting the results from intermediate
steps needs to be considered .
There are many issues in distributed query optimization such as types of
optimizer, optimization granularity, network topologies and optimization
timing

Example
Site 1: COURSE, ENROLLMENT
Site 2: STUDENT
(Course: Physics and Student: Senior)
There are many ways to optimize this three-table join some of which are:
Option 1: Start with site 1, join C & E retrieving only physics course and move entire result set to
site 2 to be joined with S.
Option 2: Star with site 2, retrieve only senior student from S and move the entire result set to site
1 to be joined with C and E
Option 3: Move C & E to site 2 and proceed with the local 3-tables join
Option 4: Move S to site 1 and proceed with a local 3-tables join

Example Contd…
Which of these options will perform the best?
 The only correct answer is “It depends”.
The optimal choice will depends on:
 the size of the tables,
 the size of the result sets (the number of qualifying rows and their
length in bytes) and
 the efficiency of the network.

Optimal Distribution Strategies for
Simple QueriesA query optimization algorithm is an algorithm that derives a distribution
strategy for a given query.
Query optimization algorithms that derive optimal distribution strategies for a class of distributed
queries called simple queries.
There are various algorithms that are used for query optimization such as:
Algorithm PARALLEL:
 Algorithm PARALLEL was used to derive a minimal response time distribution strategy for any
given simple query.
 Algorithm PARALLEL searches for cost beneficial data transmissions by trying to join small
relations to large relations.

Algorithm SERIAL:
 Finds strategy with minimum total time
 consists of transmitting each relation, starting with Ri, to the next relation in a serial order.
 The strategy is represented by R1 -> R2 ->…….-> Rm->Rr, where Rr is the relation at the result
node.
Algorithm GENERAL:
• Algorithm GENERAL derives a query processing strategy for either
response time or total time minimization by using the procedures
RESPONSE, TOTAL, and COLLECTIVE.
• a relation can contain more than one joining attribute.
• It has three versions as :-
I. Response Time Version
II. Total Time Version
III. Handling Redundant Data Transmission

Conclusion
 Algorithm GENERAL to be an efficient algorithm of polynomial complexity that derives close to
optimal query processing strategies on distributed systems.
 Algorithm GENERAL is an extension of processing tactics found optimal for simple queries in
Algorithm PARALLEL and algorithm SERIAL
 There are two primary versions of Algorithm GENERAL
1. To minimize response time of a processing strategy, parallel data transmissions are emphasized
by the use of Algorithm PARALLEL and Procedure RESPONSE.
2. To minimize the total time of a processing strategy, serial time transmissions are emphasized by
the use of Algorithm SERIAL and Procedure TOTAL

Query processing strategies in distributed database

More Related Content

What's hot

Similar to Query processing strategies in distributed database

More from ShreerajKhatiwada

Recently uploaded

In this document

Query processing strategies in distributed database