MODULE -II (BIO STATISTICS)
Definition : Biostatistics is a branch of statistics that deals with the collection, analysis,
interpretation, presentation, and organization of data in the context of biological and
health sciences. It plays a crucial role in medical research, epidemiology, public health,
and various other fields.
Data : Collection of information or facts is known as Data.
Applications of Statistics in Pharmaceutical Sciences :
1. Clinical Trials Design and Analysis
• Randomization: Ensures unbiased assignment of subjects to treatment or
control groups.
• Sample Size Calculation: Determines the number of participants needed to
detect treatment effects with adequate power.
• Survival Analysis: Analyses time-to-event data, such as time until the
occurrence of a side effect or disease progression.
• Statistical Significance Testing: Uses tests like t-tests, ANOVA, or regression to
compare treatment groups.
• Confidence Intervals: Estimates the range within which true effects of
treatments are expected to fall.
2. Bioequivalence Studies
• Assess whether generic drugs are equivalent to brand-name counterparts in
terms of their bioavailability and pharmacokinetics using ANOVA, crossover
designs, and other statistical tools.
3. Pharmacokinetics and Pharmacodynamics (PK/PD) Modelling
• Nonlinear Regression: Fits models to drug concentration vs. time data to
understand absorption, distribution, metabolism, and excretion of drugs.
• Compartmental Models: Helps describe how drugs move within different body
compartments.
• Population Modeling: Investigates variations in drug response across
populations.
CAPIO
4. Quality Control and Assurance
• Statistical Process Control (SPC): Monitors manufacturing processes using
control charts and other tools to maintain consistency.
• Acceptance Sampling: Determines whether a batch of pharmaceutical
products meets predefined quality standards.
• Design of Experiments (DoE): Optimizes formulation and manufacturing
conditions by systematically varying input factors.
5. Preclinical Studies
• Dose-Response Relationship: Determines the relationship between drug dose
and its biological effect, often modelled using logistic regression or nonlinear
techniques.
• Toxicological Studies: Uses statistics to evaluate potential toxic effects of new
drug candidates.
6. Post-Marketing Surveillance
• Pharmacovigilance: Statistical tools are used to detect and assess the
incidence of adverse events after a drug has been released to the market.
• Signal Detection: Analyses databases of adverse drug reactions to identify
potential safety concerns.
7. Genomics and Personalized Medicine
• Multivariate Statistics: Used to analyze complex biological data like gene
expression and genetic variants to identify drug responses.
• Biomarker Discovery: Statistical techniques help in identifying biomarkers that
predict drug efficacy and toxicity in different patient subgroups.
8. Regulatory Submissions
• Data Summarization: Statistics is used to summarize clinical trial data and
prepare the evidence for regulatory authorities (e.g., FDA, EMA).
• Meta-Analysis: Pools results from multiple studies to provide a broader
understanding of drug effects.
Population :
Population is a group of items, units or objects which is under reference of study.
Population may consists homogeneous units. Number of units in the population is denoted
by ‘N’.
Population can be broadly classified into two types. They are
1. Finite population
2. Infinite population
Finite population: A population which consists finite number (countable number) of
elements or units is said to be finite population.
Example: 1. Set of all natural numbers between 100 and 500
Example: 2. Population of a city.
Infinite population: A population which consists infinite number of elements or units is said
to be infinite population.
Example: 1. Set of rational numbers between 1 to 10
Example: 2. . Set of real numbers from 0 to 1
Parameter : A constant which is measured from the population is said to be parameter.
Example : Population mean “  ” population variance “ 2
 ” population standard
deviation “  ” population proportion “ P”.
Sample: subset of the population is known as sample.
Random sample: A sample which is collected from the population in random manner
is known as random sample.
Statistic : A constant which is measured from the sample is said to be statistic.
Example : sample mean “ x ”sample variance “ 2
s ” sample standard deviation “ s ”
sample proportion “ p ”.
Sample Size Determination :
Key Factors Influencing Sample Size Determination:
1. Study Objective and Design:
o Type of Study: The sample size can vary depending on whether the
study is a clinical trial, observational study, or bioequivalence study.
o Endpoints: The primary outcome variable (e.g., survival time, response
rate) influences how many subjects are needed.
2. Power of the Study (1 - β):
o Power is the probability of detecting a true effect (usually set at 80% or
90%).
#Set of elements drawn from a
population is called Sample.
#Number of elements in the sample
is denoted by ‘n’
sample size determination to types
probability sampling non probablity sampling
#simple random s #snowball s
#stratified s #quout s
#statistical s #convenience s
#cluster s
Sample size determination or estimation is
the act of choosing the number of
observations or replicates to include in a
statistical sample. The sample size is an
important feature of any empirical study
in which the goal is to make inferences
about a population from a sample.
external perspective:
access to the sample
resources
time
personal and its competences and experiences
technical support
measurement procedures etc.
Internal perspective:
researcher factors, aim of research, aim of generalisation, research
methadologyes,educational research paradigm, motivation,intrest,skills,
experience.
o Type II Error (β): The risk of not detecting a true difference when one
exists (typically set at 0.1 or 0.2).
3. Significance Level (α):
o The level of significance is the probability of a Type I error (rejecting the
null hypothesis when it is true), usually set at 5% (α = 0.05).
o This means that there's a 5% risk of finding a difference when none
exists.
4. Effect Size:
o The effect size is the minimum clinically or scientifically relevant
difference between groups.
o It can be based on previous studies or clinical judgment. Smaller effect
sizes require larger sample sizes to detect.
5. Variability in Data:
o Higher variability or standard deviation in the outcome measures
increases the required sample size.
o For example, drug response data with high interpatient variability will
need more subjects to identify a true effect.
6. One-Tailed or Two-Tailed Test:
o A two-tailed test requires a larger sample size than a one-tailed test
because it tests for effects in both directions (e.g., a drug being better
or worse).
o In contrast, a one-tailed test is more focused but generally less
conservative.
DETERMINATION OF SAMPLE SIZE BY COCHRAN FORMULA:
Cochran’s formula is considered especially appropriate in situations with large
populations. A sample of any given size provides more information about a smaller
population than a larger one, so there’s a ‘correction’ through which the number given
by Cochran’s formula can be reduced if the whole population is relatively small.
The Cochran formula is:
𝒏 =
𝒁𝟐𝒑𝒒
𝒆𝟐
• e is the desired level of precision (i.e. the margin of error),
z= statistical value/critical value
p=probability of sucess
q=probability of failure
e=level of significance.
N= infinity
{infinity population]
sample size:
1.sample size for discreate data:
# cochran's method
2. sample size for known population:
# yamane's formula.
• p is the (estimated) proportion of the population which has the attribute in
question,
• q is 1 – p.
The z-value is found in a Z table.
Ex : Suppose we are doing a study on the inhabitants of a large town, and want to find
out how many households serve breakfast in the mornings. We don’t have much
information on the subject to begin with, so we’re going to assume that half of the
families serve breakfast: this gives us maximum variability. So p = 0.5. Now let’s say we
want 95% confidence, and at least 5 percent—plus or minus—precision. A 95 %
confidence level gives us Z values of 1.96, per the normal tables, so we get
𝒏 =
𝒁𝟐𝒑𝒒
𝒆𝟐
=
(𝟏.𝟗𝟔)𝟐×𝟎.𝟓×𝟎.𝟓
(𝟎.𝟎𝟓)𝟐 = 385.
So a random sample of 385 households in our target population should be
enough to give us the confidence levels we need.
Yamane’s Formula :
Yamane’s method is a simplified formula for calculating sample size, often used
when determining the sample size for surveys or studies in social sciences. It provides
a quick way to estimate the required sample size from a known population size. The
formula is particularly useful when you don't have access to advanced tools or when
you're conducting preliminary calculations.
𝑛 =
𝑁
1 + 𝑁𝑒2
Where, 𝑛 = 𝑇ℎ𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑖𝑧𝑒
𝑁 = 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑆𝑖𝑧𝑒
𝑒 = 𝑀𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟
Steps for Using Yamane’s Method:
1. Determine the Population Size (N):
o This is the total number of people or units in the population you’re
studying.
N=Countable or finate
2. Decide the Margin of Error (e):
o Common choices for ‘e’ are 0.05 (5%) or 0.10 (10%), depending on how
precise you want your estimate to be.
o A smaller ‘e’ means more precision, requiring a larger sample size, while
a larger ‘e’ reduces precision and requires a smaller sample.
3. Apply the Formula:
o Plug in the values for ‘N’ (population size) and ‘e’ (margin of error) into
Yamane’s formula to calculate the required sample size ‘n’.
Example of Yamane’s Method:
Let’s say you want to conduct a survey in a population of 5,000 people, and
you are willing to accept a margin of error of 5% (0.05):
𝑛 =
𝑁
1 + 𝑁𝑒2
𝑛 =
5000
1 + 5000(0.052)
𝑛 =
5000
1+12.5
𝑛 =
5000
13.5
= 370.37
The required sample size by the Yamane’s method at 5% level of significance
from the population of size 5000 is 370.
Importance of Sample Size : Sample size is a crucial aspect of research because it
significantly influences the reliability, validity, and generalizability of the study's
findings. Here are several key reasons why sample size is important:
1. Sample size and sampling error are inversely proportional. A larger sample size
increases the statistical power of a study, which means the ability to detect a true
effect or difference when it exists. Small sample sizes can lead to Type II errors
(false negatives), where the study fails to detect an effect that is actually present.
2. Sample size directly affects the statistical power of a study, which is the ability
to detect a true effect if one exists.
3. A large sample size enhances the representativeness of the sample, ensuring it
more closely reflects the diversity of the target population. This reduces
selection bias and improves the generalizability of the findings to the broader
population.
RGV
sample size 1/sample error
statitstical power
4. With a larger sample size, the impact of outliers and extreme values on the
overall results diminishes. This leads to a more stable and reliable outcome, as
the sample mean and other statistics are less influenced by unusual data points.
5. Many statistical tests assume that the data comes from a large enough sample
to approximate a normal distribution (Central Limit Theorem). Insufficient
sample sizes may lead to non-normal distributions, making statistical tests less
valid or requiring alternative, less powerful non-parametric tests.
6. A larger sample size gives researchers greater confidence in their findings. For
example, with smaller samples, even if results are statistically significant, there
may be concerns about whether the findings are repeatable or whether they
occurred by chance.
Population : Population is a set of similar items or events which is of interest for some
questions or experiments or for study. Number of units in the population is denoted by ‘N’
Population can be broadly classified into two types. They are
1. Finite Population
2. Infinite Population
Finite Population: Number of units in the population is (countable) finite, then the
population is said to be finite population.
Example : Population of male in Puttur.
Infinite Population : Number of units in the population is (not countable) infinite , then the
population is said to be infinite population.
Example : Set of all rational numbers between 1 to 10
Parameter : A constant which is measured from the population is known as parameter.
Population parameter can be denoted by 
Example : Population mean  , Population variance 2
 , Population proportion P etc.,
Sample : Sub set of the population is known as sample
Set of elements drawn from a population is called Sample.
Number of elements in the sample is denoted by ‘n’
Statistic : A constant which is measured from the sample is known as statistic.
Example : Sample Mean x , Sample variance 2
s , sample proportion p etc.,
Dropout rate:
it is the estimation of a number of subjects those can leave out the study/clinical trail due to some
reasons.normally the sample size calculation will give a number of study subjects required for
achieving target statistical significance for a given hypothesis.however in clinical practice, we may
need to enrool more subjects to compensate for these potential droupouts
N1= n/(1-d)
n_ sample size
d_dropout rate
Statistical hypothesis :
The statement formulated on statistical population are called statistical hypothesis. It
means a statistical hypothesis reveals the probability distribution of the population or the
parameter involved in it.
Suppose we are interested to examine two fertilizers A and B, according to their yields.
In this contest we can formulate the following statements about the fertilizers A and B.
1. The fertilizer A provides more yield than the fertilizer B
2. The fertilizer B provides more yield than the fertilizer A
3. There is no significant difference between the two types of fertilizers A and B with
respect to their yields.
The above three statements are called statistical hypothesis. The third statement is an
unbiased statement, where as the remaining two statements shows Bias towards either of
the fertilizers.
Null hypothesis :
The Statistical hypothesis which is to be tested is called null hypothesis. It is denoted
by 0
H .
Null hypothesis should be numerical or natural regarding to the outcome of the test.
It should be completely impartial and should not allow any personal views to influence
the decision.
For example, let us suppose we have two educational systems A and B and we have
to examine which system is better, then in this contest the null hypothesis will be framed as
0
H : There is no significant difference between the two educational systems.
Alternate or alternative hypothesis :
The decisions which we are accept when the null hypothesis is rejected is called
alternative hypothesis.
Statistical hypothesis or statement which is made exact opposite to the null
hypothesis is called an alternative hypothesis. It is denoted by 1
H .
Example: If the null hypothesis, in testing of unbiasedness is
2
1
:
0 =
p
H then the alternative
hypothesis is
2
1
:
1 
p
H (or)
2
1
:
1 
p
H (or)
2
1
:
1 
p
H
Simple and composite hypothesis :
Simple hypothesis :
A Hypothesis which can specify all the parameters of the population completely is
known as simple hypothesis.
Level of significance:
during the testing of hypothesis it is necessary to select a suitable level of significance .the
confidence with which a research reject or accepts he null hypothesis depends on the
significance level adopted. the probabilty of rejecting the null hypothesis even it is true is known
as level of significance .it is denotted by alfha
generally it is taken as 5%.
Example : Suppose we have a sample of ‘ n ’ observations, with the values n
x
x
x
x ,......
,
, 3
2
1
are drawn from a normal population with mean  and variance 2
 then 0
0 : 
 =
H and
2
0
2
0 : 
 =
H is known as simple hypothesis.
composite hypothesis :
A statistical Hypothesis which does not specify all the parameters of the population
completely is known as composite hypothesis.
Example : 0
1 : 
 
H (or) 0
1 : 
 
H
2
0
2
1 : 
 
H (or) 2
0
2
1 : 
 
H
Two types of errors :
Error may happen while decisions are being made with the help of the sample instead
of considering the total population in testing of hypothesis. In such cases the conclusions
drawn (accepting or rejecting null hypothesis) on the basis of the sample data may not be
true in all the cases. So we may commit two types of errors which are exhibited as follows
0
H is True 0
H is False
Rejection of 0
H Type-I Error No Error
Acceptance of 0
H No Error Type-II Error
Type –I Error :
Probability of rejection of null hypothesis when it is true is called Type-I error.
(or)
Probability of accepting the alternative hypothesis when it is false is known as Type-I error.
Type-I error is also known as producer’s risk and it is denoted by ‘ ’.
( )  
True
is
H
P
Error
I
Type /
H
of
Rejection 0
0
=
− 
 
False
is
/
H 1
1 H
Accepting
P
=
 
0
/ H
w
x
P 
=

=
w
dx
L0

Type-II Error :
Probability of accepting the null hypothesis when it is false is called Type-II error.
(or)
Probability of rejecting the alternative hypothesis when it is true is known as Type-II error.
Type-II error is also known as consumer’s risk and it is denoted by ‘  ’.
( )  
alse
/
H
of
Accepting 0
0 F
is
H
P
Error
II
Type =
− 
 
True
is
/
H
ejecting 1
1 H
R
P
=
 
1
/ H
w
x
P 
=

=
w
dx
L1

Critical Region :
Critical region is also known as rejection region.
Let n
x
x
x
x ,.........
,
, 3
2
1 be the ‘ n ’ sample observations.
Let us consider these ‘ n ’ sample points in a sample space.
Let us check whether the sample points are lie within the region (or) outside the region.
The total sample space S can be divided into two disjoint sets ( )w
w
S
and
w −
The null hypothesis 0
H is to be reject if the observed sample points falls in w and if
they falls in w then we can reject 1
H and accept the null hypothesis 0
H .
The region of rejection of null hypothesis when it is true is the region of outside of the
set where 0
H is rejected. If the sample points falls in that region then it is called critical region.
Critical region is the region of the sample space when 0
H rejected even though it is
true. When the sample statistic falls in that region then w
w
S 
= and 
=
w
w .
A Region in which we can reject a null hypothesis at a certain probability is called
critical region.
In other words critical region is a region such that an observation falls in it tends to
rejection of null hypothesis when it is true.
Tests of significance :
In testing of hypothesis a hypothesis is to be framed about the population parameter
and then we have to test whether the framed hypothesis is correct or not. If the formed null
hypothesis refer the parameters completely then it is called null hypothesis.
After framing the null hypothesis our business is to take the decision about the
rejection or acceptance of the null hypothesis. The methods which are used for this purpose
are known as tests of significance.
Example : Let us suppose that, the bulbs of a company are manufactured by using two
methods old method and new method.
Here we have to decide which method increases the lifetime of the bulbs, in this case
we can frame 3 statements. they are
• Old method is better than new method.
• New method is better than old method.
• There is no significant difference between the old method and the new
method.
# It is a process for camparing observed data with a claim(also called a hypotisis),the
truth of which is being assessed in further analysis.
# the hypothesis is a statement about a parameter like p and mu.
# we express the results of significance test in term of a probability
# it is also known as "HYPOTHISIS TESTING"
A test of significance is a formal procedure for comparing observed data witha claim
(hypothesis) the truth of which is being assed. the claim is a statemnet about a parameter ,
like the population propotion p or the population mean mue.
We have to decide which of the above statements are true, to make a decision about
the method of manufacturing. Here we have to use a systematic procedure for taking decision
about the statements, the procedure which is used is known as the tests of significance.
Power of the Test :

−
1 , the probability of rejection of the null hypothesis 0
H when it is false is known
as power of the test.
Thus,  
alse
/
H
of
Accepting 0
0 F
is
H
P
=

 
True
is
/
H
ejecting 1
1 H
R
P
=
 
1
/ H
w
x
P 
=

=
w
dx
L1


−
=
−
w
dx
L
1
1 1

One tailed and two tailed test :
In any tests the critical region is represented by a portion of the area under the
probability curve of the sampling distribution of the test statistic.
A test of any hypothesis where the alternative hypothesis 1
H is right tailed then it is
said to be a right tailed test.
Similarly, a test of any hypothesis where the alternative hypothesis 1
H is left tailed
then it is said to be left tailed test.
Right tiled or left tailed test is known as one tailed test.
Example : In testing the mean of a population the null hypothesis 0
0 : 
 =
H against the
alternative hypothesis 0
1 : 
 
H (or) 0
1 : 
 
H
Here 0
1 : 
 
H is known as left tailed and 0
1 : 
 
H is known as right tailed test.
In the right tailed test the critical region lies entirely in the right tail of the sampling
distribution of the test statistic. Similarly, in left tailed test the critical region lies entirely in
the left tail of the sampling distribution of the sample statistic.
A test of hypothesis where the alternative hypothesis is two tailed such as 0
0 : 
 =
H
against the alternative hypothesis 0
1 : 
 
H is known as two tailed test. In two tailed test
the critical region is lie in both of the tails of the sampling distribution of the test statistic.
SMALL SAMPLE TESTS
Student’s t - Test :
If n
x
x
x
x ,......
,
, 3
2
1 is a random sample of size '
'n from a normal population with mean  and
variance 2
 then the student’s t - statistic is defined as
n
S
x
n
S
x
t
2

 −
=
−
=
Where,
n
x
x

= , is the sample mean
And ( )
 −
−
=
2
2
1
1
x
x
n
S i is
An unbiased estimate of the population variance 2
 and it follows students
t - distribution with 1
−
= n
 degrees of freedom with probability density function given by
( ) 



−






+






= +
t
t
t
f ;
1
1
2
,
2
1
1
2
1
2





Alternatively, students t can also be written as
1
−
−
=
n
s
x
t

where ( )
 −
=
2
2 1
x
x
n
s i , the
sample variance and ( ) 2
2
1 s
n
ns −
= .
Assumptions for student’s t -test :
The following assumptions are made in the students −
t test
(i) The parent population from which the sample is drawn is normal.
(ii) The population observations are independent i.e., the given sample is random.
(iii) The sample standard deviation is unknown.
Applications of t - distribution :
The t - distribution has a number of applications in statistics. Some of them are
(i) t - test for significance of single sample mean, population variance being unknown.
(ii) t - test for significance of difference between two sample means, the population variances
being equal but unknown.
(iii) t - test for significance of an observed sample correlation coefficient.
t - test for Single mean :
Let n
x
x
x
x ,.....
,
, 3
2
1 be a random sample of size ‘ n ’ drawn from a normal population with
sample mean x . To test the significant difference between the sample and population means, the
null hypothesis can be taken as
:
0
H There is no significant difference between the sample and population mean.

=
x
H :
0
To test the above null hypothesis the test statistic is given by
n
S
x
t

−
=
Where,
n
x
x

= and ( )
 −
−
=
2
2
1
1
x
x
n
S i follows student’s t - distribution with
( )
1
−
n degrees of freedom.
Compare the calculated value of t with the critical values for ( )
1
−
n degrees of freedom at
certain level of significance.
If Tab
Cal t
t  , then we can’t reject null hypothesis and we can say that there is no significant
difference between sample and population mean.
If Tab
Cal t
t  , then we can reject null hypothesis and we can say that there is a significant
difference between sample and population mean.
95 % confidence limits for  are
n
S
t
x
n
S
t
t 05
.
0

=
 
99 % confidence limits for  are
n
S
t
x
n
S
t
t 01
.
0

=
 
Problem 1 : A machine is designed to produce insulating washers for electrical devices of an average
thickness of 0.025 cm. A random sample of 10 washers was found to have an average thickness of
0.024cmwith a standard deviation of 0.002 cm. Test the significance of the deviation.
Solution : Given ,
Sample size 10
=
n
Average thickness of washers in a sample cm
x 024
.
0
=
Sample standard deviation cm
s 002
.
0
=
Average thickness of washers in a population cm
0025
.
0
=

:
0
H There is no significant difference between the average thickness of washer in sample and
population
To test the above null hypothesis, the test statistic is
1
−
−
=
n
s
x
t

1
10
002
.
0
025
.
0
024
.
0
−
−
=
t
5
.
1
002
.
0
3
001
.
0
−
=

−
=
t
Tabulated value of t is 262
.
2
9
1
10
1
05
.
0 =
=
=
= −
− t
t
t
t n
Here 262
.
2

Cal
t , we can’t reject the null hypothesis
 There is no significant difference between the average thickness of washer in sample and
population
Problem 2 : A soap manufacturing company was distributing a particular brand of soap through a large
number of retail shops. Before a heavy advertisement campaign, the mean sale per week, per shop
was 140 dozens. After the campaign, a sample of 26 shops was taken and the mean sale was found to
be 147 dozens with standard deviation 16. Can you consider the advertisement effective?
Solution :
Given , Number of shops taken in to sample 26
=
n
Average sale of soaps after campaign in as sample is dozens
x 147
=
Standard deviation of sample is 16
=
s
Average sales before advertisement campaign dozens
140
=

:
0
H There is no significant difference between the sample and population means
147
:
1 

H (Right Tail Test )
To test the above null hypothesis
19
.
2
16
5
7
1
26
16
140
147
1
=

=
−
−
=
−
−
=
n
s
x
t

Tabulated value of t at 5 % level of significance ( Right tail test) is
708
.
1
25
1
26
1 =
=
= −
− t
t
tn
Here 708
.
1

cal
t , We can reject the null hypothesis.
 advertisement campaign is effective in increasing the sale of soaps.
Problem 3 : Certain pesticide is packed into bags by a machine. A random sample of 10 bags is drawn
and their contents are found to weigh ( in kgs) as follows
50, 49, 52, 44, 45, 48, 46, 45, 49, 45
Test if the average packing can be taken as 50 kg.
Solution :
x 50 49 52 44 45 48 46 45 49 45 473
2
x 2500 2401 2704 1936 2025 2304 2116 2025 2401 2025 22437
Sample size 10
=
n
Sample mean 3
.
47
10
473
=
=
=

n
x
x kg
Mean weight of bag in population 50
=
 kg
Sample standard deviation
5318
.
2
41
.
6
3
.
47
7
.
2243
10
473
10
22437 2
2
2
2
=
=
−
=






−
=
=








−
=


n
x
n
x
s
:
0
H Average weight of bag is 50kg.
To test the above null hypothesis
1993
.
3
5318
.
2
3
7
.
2
1
10
5318
.
2
50
3
.
47
1
−
=

−
=
−
−
=
−
−
=
n
s
x
t

Tabulated value of t at 5 % level of significance is
262
.
2
9
1
10
1 =
=
= −
− t
t
tn
Here 262
.
2

cal
t , We can’t reject the null hypothesis.
 Average weight of a bag is 50 kg.
t - Test for difference of means :
Let 1
1
13
12
11 ,....
,
, n
x
x
x
x and 2
2
23
22
21 ,....
,
, n
x
x
x
x be the two random samples of sizes 1
n and 2
n
drawn from a normal population.
To test the significant difference between the two sample means, the null hypothesis can be
framed as
:
0
H There is no significant difference between the two sample means
2
1
0 : x
x
H =
To test the above null hypothesis, the test statistic can be taken as
2
2
1
2
2
1
2
1
~
1
1
−
+








+
−
= n
n
t
n
n
S
x
x
t
Where,
1
1
1
1
1
n
x
x
n
i
i

=
= ,
2
1
2
2
2
n
x
x
n
j
j

=
= and
( )
1
2
1
1
2
1
n
x
x
s
i
 −
= ,
( )
2
2
2
2
2
2
n
x
x
s
i
 −
=
2
2
1
2
2
2
2
1
1
2
−
+
+
=
n
n
s
n
s
n
S
Follows student t distribution with 2
2
1 −
+ n
n degrees of freedom.
Compare the calculated value of t with the critical values for ( )
2
2
1 −
+ n
n degrees of freedom
at certain level of significance.
If Tab
Cal t
t  , then we can’t reject null hypothesis and we can say that there is no significant
difference between the two sample means.
If Tab
Cal t
t  , then we can reject null hypothesis and we can say that there is a significant
difference between two sample means.
Problem 1 : The average number of articles produced by two machines per day are 200 and 250 with
Standard deviations 20 and 25 respectively on the base of records of 25 days production. Can you
regard both the machines equally efficient at 5% level of significance.
Solution : Given ,
Average number of articles produced by first machine per day from 25 days productions is 200
1 =
x
Average number of articles produced by second machine per day from 25 days productions is
250
2 =
x
Size of the first sample 25
1 =
n
Size of the second sample 25
2 =
n
Standard deviation from first sample 20
1 =
s
Standard deviation from second sample 25
2 =
s
:
0
H There is no significant difference between the production capacity of the two machines.
2
25
25
25
25
20
25
2
2
2
2
1
2
2
2
2
1
1
2
−
+

+

=
−
+
+
=
n
n
s
n
s
n
S
8542
.
533
48
25625
48
625
25
400
25
=
=

+

=
To test the above null hypothesis
2
2
1
2
2
1
2
1
~
1
1
−
+








+
−
= n
n
t
n
n
S
x
x
t
( ) 





+
−
=
25
1
25
1
8542
.
533
250
200
t
6509
.
7
5352
.
6
50
08
.
0
8542
.
533
50
−
=
−
=

−
=
Tabulated value 05
.
0
t is 96
.
1
48
2
25
25
2
2
1
=
=
= −
+
−
+ t
t
t n
n
Here the calculated value of t is greater than the critical value of t at 5% level of significance.
.,
.e
i We can reject the null hypothesis.
There is a significant difference between the production capacity of the two machines.
Problem 2 : The means of two random samples of sizes 9 and 7 are 196.42 and 198.82 respectively.
The sum of the squares of the deviations from the mean are 26.94 and 18.73 respectively. Can the
samples be considered to have been drawn from the same normal population ?
Solution : Size of the first sample 9
1 =
n
Size of the second sample 7
2 =
n
Mean of the first sample 42
.
196
1 =
x
Mean of the second sample 82
.
198
2 =
x
Sum of squares of deviations from mean in first sample ( )
 =
− 94
.
26
2
1
1 x
x i
Sum of squares of deviations from mean in second sample ( )
 =
− 73
.
18
2
2
2 x
x j
We know that
( )
( ) 2
1
1
2
1
1
1
2
1
1
2
1 s
n
x
x
n
x
x
s i
i
=
−

−
= 

( ) ( ) 2
2
2
2
2
2
2
2
2
2
2
2 s
n
x
x
n
x
x
s j
j
=
−

−
= 

( ) ( ) 2621
.
3
14
67
.
45
2
7
9
73
.
18
94
.
26
2
2 2
1
2
2
2
2
1
1
2
1
2
2
2
2
1
1
2
=
=
−
+
+
=
−
+
−
+
−
=
−
+
+
=


n
n
x
x
x
x
n
n
s
n
s
n
S
j
i
:
0
H There is no significant difference between the means of two samples.
To test the above null hypothesis
2
2
1
2
2
1
2
1
~
1
1
−
+








+
−
= n
n
t
n
n
S
x
x
t
( ) 





+
−
=
7
1
9
1
2621
.
3
82
.
198
42
.
196
t
254
.
0
2621
.
3
40
.
2

−
=
6365
.
2
9103
.
0
40
.
2
−
=
−
=
6365
.
2
=
Cal
t
Tabulated value 05
.
0
t is 15
.
2
14
2
7
9
2
2
1
=
=
= −
+
−
+ t
t
t n
n
Here the calculated value of t is greater than the critical value of t at 5% level of significance.
.,
.e
i We can reject the null hypothesis.
There is a significant difference between the means of two sample means.
Problem 3 : two different types of drugs A and B were tried on certain patients for increasing weight,
5 persons were given drug A and 7 persons were given drug B. The increase in weight in pounds are
given below.
Drug A 8 12 13 9 3
Drug B 10 8 12 15 6 8 11
Do the two drugs differ significantly with regard to their effect in increasing weight.
Solution : No of patients were given drug A is 5
1 =
n
No of patients were given drug B is 7
2 =
n
Mean of the first sample 9
5
45
5
3
9
13
12
8
1
1
1 =
=
+
+
+
+
=
=

n
x
x
Mean of the second sample 10
7
70
7
11
8
6
15
12
8
10
2
2
2 =
=
+
+
+
+
+
+
=
=

n
x
x
1
x 1
1 x
x − ( )2
1
1 x
x − 2
x 2
2 x
x − ( )2
2
2 x
x −
8 -1 1 10 0 0
12 3 9 8 -2 4
13 4 16 12 2 4
9 0 0 15 5 25
3 -6 36 6 -4 16
8 -2 4
11 1 1
45 62 70 54
( ) ( )
6
.
11
10
116
2
7
5
54
62
2
2
1
2
2
2
2
1
1
2
=
=
−
+
+
=
−
+
−
+
−
=


n
n
x
x
x
x
S
:
0
H There is no significant difference between the increase in weight of patients by taking drug A and
drug B.
To test the above null hypothesis
5014
.
0
9944
.
1
1
3429
.
0
6
.
11
1
7
1
5
1
6
.
11
10
9
1
1
2
1
2
2
1
−
=
−
=

−
=






+
−
=








+
−
=
n
n
S
x
x
t
5014
.
0
=
Cal
t
Tabulated value 05
.
0
t is 23
.
2
10
2
7
5
2
2
1
=
=
= −
+
−
+ t
t
t n
n
Here the calculated value of t is less than the critical value of t at 5% level of significance.
.,
.e
i We can’t reject the null hypothesis.
There is no significant difference between the increase in weight of patients by taking drug A and drug
B.
PAIRED t- TEST :
This test was designed to examine the differences between the corresponding values of a
sample at two levels is significant or not. If ‘ n ’ pairs are considered as a sample, we can test the null
hypothesis as there is no significant differences between the ‘ n ’ paired observations.
This test can be applied only when the sample pairs are available. Some of the applications of
this test is to check whether the students are benefitted through a particular type of coaching method,
to check whether the two types of food stuffs increased the weight of chicks etc.
Suppose ‘ n ’ chicks are selected at random . Let the initial weights of these chicks be
n
x
x
x
x ,.........
,
, 3
2
1 . Let these ‘ n ’ chicks are feeded with a particular brand of food stuffs and the after
feeding them are n
y
y
y
y ,.........
,
, 3
2
1 .
Hence we get ‘ n ’ paired observations )
,
...(
),........
,
(
),
,
(
),
,
( 3
3
2
2
1
1 n
n y
x
y
x
y
x
y
x .
The null hypothesis can be considered as H0 : 2
1 
 =
The following are the steps to test the null hypothesis.
Step 1 : Calculate the differences ‘di’ for each sample pair. i.e., i
i
i y
x
d −
=
Step 2 : Calculate d . i.e.,
n
d
d
i

=
Step 3 : Calculate e
 . i.e.,
( ) ( )








−
−
=
−
−
= 


n
d
d
n
n
d
d i
i
i
e
2
2
2
1
1
1

Step 4 : Compute the test statistic 1
~ −
= n
e
t
n
d
t

This test statistic follows t-distribution with ‘ 1
−
n ’ degrees of freedom.
t is to be calculated and it should be compared with the tabulated value at the desired level
of significance. If the calculated value of ‘t’ less than the critical value then we can’t reject the null
hypothesis , otherwise we can reject the null hypothesis.
Problem 1 : The sales of data of an item in six shops before and after a special promotional campaign
are as under
Shops A B C D E F
Before Campaign 53 28 31 48 50 42
After Campaign 58 29 30 55 56 45
Can the campaign be judged a success?
Solution : :
0
H There is no significant change in the sales after the special promotional campaign
Shop i
x i
y i
i
i y
x
d −
= 2
i
d
A 53 58 -5 25
B 28 29 -1 1
C 31 30 1 1
D 48 55 -7 49
E 50 56 -6 36
F 42 45 -3 9
- 21 121
5
.
3
6
21
−
=
−
=
=

n
d
d
i
( ) ( ) 5
.
9
6
285
5
1
6
441
121
5
1
6
21
121
1
6
1
1
1
2
2
2
2
=






=






−
=





 −
−
−
=








−
−
= 

n
d
d
n
S
i
i
To test the above null hypothesis the test statistic
7815
.
2
2583
.
1
5
.
3
6
5
.
9
5
.
3
2
−
=
−
=
−
=
=
n
S
d
t
7815
.
2
=
Cal
t
The tabulated value of t is
Here, , We can reject the null hypothesis.
There is a significant change in the sales after the special promotional campaign.
Problem 2 : The results of IQ test are given below. Find out whether there is any change in IQ after
training progamme.
Candidate 1 2 3 4 5 6 7
IQ Before Training 112 120 116 125 131 132 129
IQ After Training 120 124 118 129 136 136 125
Solution : There is no significant change in IQ after training programme
02
.
2
5
1
6
1 =
=
= −
− t
t
tn
02
.
2

Cal
t
:
0
H
Candidate
1 112 120 -8 64
2 120 124 -4 16
3 116 118 -2 4
4 125 129 -4 16
5 131 136 -5 25
6 132 136 -4 16
7 129 125 4 16
-23 157
To test the above null hypothesis the test statistic
The tabulated value of is 45
.
2
6
1
7
1 =
=
= −
− t
t
tn
Here, , We can’t reject the null hypothesis.
There is no significant change in IQ after training programme.
Problem 3 : A drug is given to 10 patients and the increments in their blood pressure were recorded
to be 3 , 6, -2, 4, -3, 4, 6, 3, 2, 2. Test whether the drug has any effect on the change of the blood
pressure.
Solution :
There is no significant difference in the blood pressure readings of the patients before and after
the drug.
i
x i
y i
i
i y
x
d −
= 2
i
d
2857
.
3
7
23
−
=
−
=
=

n
d
d
i
( ) ( ) 5714
.
13
7
570
6
1
7
529
157
6
1
7
23
157
1
7
1
1
1
2
2
2
2
=






=






−
=





 −
−
−
=








−
−
= 

n
d
d
n
S
i
i
3597
.
2
3924
.
1
2857
.
3
7
5714
.
13
2857
.
3
2
−
=
−
=
−
=
=
n
S
d
t
3597
.
2
=
Cal
t
t
45
.
2

Cal
t
:
0
H
3 9
6 36
-2 4
4 16
-3 9
4 16
6 36
3 9
2 4
2 4
25 143
To test the above null hypothesis the test statistic
The tabulated value of is
Here, , We can reject the null hypothesis.
There is a significant difference in the blood pressure readings of the patients before and after
the drug.
i
i
i y
x
d −
= 2
i
d
5
.
2
10
25
=
=
=

n
d
d
i
( ) ( ) 9444
.
8
10
805
9
1
10
625
143
9
1
10
25
143
1
10
1
1
1
2
2
2
2
=






=






−
=






−
−
=








−
−
= 

n
d
d
n
S
i
i
6435
.
2
9457
.
0
5
.
2
10
9444
.
8
5
.
2
2
=
=
=
=
n
S
d
t
6435
.
2
=
Cal
t
t 26
.
2
9
1
10
1 =
=
= −
− t
t
tn
26
.
2

Cal
t
F – TEST FOR EQUALITY OF POPULATION VARIANCES :
Consider the two samples from a normal population as and
.
Let the mean and variances of the two populations be respectively.
Ratio of two independent variates with the corresponding degrees of freedom is known
as F statistic.
For testing the equality between the two sample variances , the null hypothesis can be framed as
There is no significant difference between the two variances
To test the above null hypothesis , the test statistic is given by,
Where
2
1
1
1
2
1
1 







−
=


n
x
n
x
s
i
i
,
2
2
2
2
2
2
2 







−
=


n
x
n
x
s
i
i
and ,
The above test statistic follows F-distribution with degrees of freedom.
Compare the calculated value of ‘F’ with the critical value at the desired level of significance.
If the calculated value of ‘F’ less than the critical value , then we can’t reject the null
hypothesis, otherwise we can reject the null hypothesis at the desired level of significance.
Applications or Uses of F- Test :
1. F- test for testing the significance of an observed sample multiple correlation.
2. F – test for testing the significance of an observed sample correlation ratio.
3. F – test for testing the linearity of regression.
4. F- test for testing the equality of several population means.
1
1
13
12
11 ,........
,
, n
x
x
x
x
2
2
23
22
21 ,......
,
, n
x
x
x
x
2
2
2
1
2
1 ,
, 


 and
2

2
2
2
1
0 : 
 =
H
:
0
H
( )
1
,
1
1
1
2
1
2
2
2
2
1
2
1
1
2
2
2
1
−
−

−
−
=
= n
n
F
n
s
n
n
s
n
F


1
1
2
1
1
2
1
−
=
n
s
n

1
2
2
2
2
2
2
−
=
n
s
n

( )
1
,
1 2
1 −
− n
n
test the significance of difference between the standard
deviations of two samples
Problem 1 : The time taken by workers in performing a job by method – I and method – II is given
below
Method – I 20 16 26 27 23 22
Method – II 27 33 42 35 32 34 38
Do the data show that the variances of time distribution from population from which these
samples are drawn do not differ significantly ?
Solution :
There is no significant difference between the variances of the time distribution by the workers
in performing a job by method – I and method – II
Method – I Method – II
20 400 27 729
16 256 33 1089
26 676 42 1764
27 729 35 1225
23 529 32 1024
22 484 34 1156
38 1444
134 3074 241 8431
Sample variance using method – I is
Sample variance using method –II is
Here
To test the null hypothesis , the test statistic
:
0
H
1
x 2
1
x 2
x 2
2
x
5556
.
13
6
134
6
3074
2
2
1
1
1
2
1
2
1 =






−
=








−
=


n
x
n
x
s
1020
.
19
7
241
7
8431
2
2
2
2
2
2
2
2
2 =






−
=








−
=


n
x
n
x
s
2667
.
16
1
6
5556
.
13
6
1
1
2
1
1
2
1 =
−

=
−
=
n
s
n
S
2857
.
22
1
7
1020
.
19
7
1
2
2
2
2
2
2 =
−

=
−
=
n
s
n
S
2
1
2
2 S
S 
Tabulated value of F is
Since the calculated value of F is less than the tabulated value
We can’t reject the null hypothesis.
There is no significant difference between the variances of the time distribution by
the workers in performing a job by method – I and method – II
Problem 2 : Two horses A and B were tested according to the time ( in seconds) to run a particular
track with the following results
Horse A : 28 30 32 33 33 29 34
Horse B : 29 30 30 24 27 29
Test whether the two horses have the same running capacity.
Solution : The two horses A and B have the same running capacity.
Size of the first sample
Size of the second sample
Horse A Horse B
28 784 29 841
30 900 30 900
32 1024 30 900
33 1089 24 576
33 1089 27 729
29 841 29 841
34 1156
219 6883 169 4787
Sample variance using method – I is
Sample variance using method –II is
37
.
1
2667
.
16
2857
.
22
2
1
2
2
=
=
=
S
S
F
( ) ( ) 95
.
4
5
,
6
)
1
6
,
1
7
(
1
,
1 1
2
=
=
= −
−
−
− F
F
F n
n
:
0
H
7
1 =
n
6
2 =
n
1
x 2
1
x 2
x 2
2
x
4898
.
4
7
219
7
6883
2
2
1
1
1
2
1
2
1 =






−
=








−
=


n
x
n
x
s
4722
.
4
6
169
6
4787
2
2
2
2
2
2
2
2
2 =






−
=








−
=


n
x
n
x
s
Here
To test the null hypothesis , the test statistic
Tabulated value of F is ( ) ( ) ( ) 39
.
4
6
,
5
1
7
,
1
6
1
,
1 1
2
=
=
= −
−
−
− F
F
F n
n
Since the calculated value of F is less than the tabulated value
We can’t reject the null hypothesis.
The two horses A and B have the same running capacity.
Problem 3 : In a sample of 8 observations, the sum of squares of deviations of items from their mean
was 94.5. In another sample of 10 observations, the value was found to be 101.7. Test whether the
difference is significant at 5 % level ?
Solution : There is no significant difference between the two samples
Size of the first sample
Size of the second sample
Sum of squares of deviations obtained from mean in a first sample is
Sum of squares of deviations obtained from mean in a second sample is
Here
To test the null hypothesis , the test statistic
Tabulated value of F is
Since the calculated value of F is less than the tabulated value
2381
.
5
1
7
4898
.
4
7
1
1
2
1
1
2
1 =
−

=
−
=
n
s
n
S
3666
.
5
1
6
4722
.
4
6
1
2
2
2
2
2
2 =
−

=
−
=
n
s
n
S
2
1
2
2 S
S 
0245
.
1
2381
.
5
3666
.
5
2
1
2
2
=
=
=
S
S
F
:
0
H
8
1 =
n
10
2 =
n
( )
 =
− 5
.
94
2
1
1 x
x
( )
 =
− 7
.
101
2
2
2 x
x
( )
5
.
13
1
8
5
.
94
1
1
2
1
1
2
1 =
−
=
−
−
=

n
x
x
S
( )
3
.
11
1
10
7
.
101
1
2
2
2
2
2
2 =
−
=
−
−
=

n
x
x
S
2
2
2
1 S
S 
147
.
1
3
.
11
5
.
13
2
2
2
1
=
=
=
S
S
F
( ) ( ) 29
.
3
9
,
7
)
1
10
,
1
8
(
1
,
1 2
1
=
=
= −
−
−
− F
F
F n
n
We can’t reject the null hypothesis.
There is no significant difference between the two samples.
- Distribution :
- Distribution was discovered by Prof. Helmert in 1875 and was developed by Karl
Pearson in 1900. Karl Pearson applied - Distribution as a test of goodness of fit.
Definitions of -variate:
The square of standard normal variate is defined as a -variate
Let then
in a - variate with one degrees of freedom.
In general , if are ‘n’ independent normal variates with mean and variance
then
Chi- Square Test :
tests were based on the assumption that the sample were drawn from the
normal population. However, there are many situations in which it is not possible to
make any dependable assumption about the parent distribution from which the
samples have been drawn. This lead to the development of a group of alternative
techniques known as non- parametric or distribution free methods.
Chi square test was first used by Karl Pearson in the year 1900. The describes
the magnitude of the discrepancy between theory and observation.
2

2

2

2

2

( )
2
,

N
x  ( )
1
,
0
N
x
Z 
−
=


2
2





 −
=



x
z
2

n
x
x
x
x ,.........
,
, 3
2
1 i

2
i

( )

=








 −
=
1
1
2
2
1
,
0
n
i i
i
i
N
x




=
=
1
1
2
2
n
i
i
z

F
t,
2

Applications of Chi – Square distribution :
Chi- square distribution has a number of applications, some of which are enumerated
below
i. Chi- square test of goodness of fit.
ii. - test for independence of attributes.
iii. To test if the population has a specified value of the variable
Chi – Square test for goodness of fit :
Suppose we are given a set of observed frequencies obtained under some experiment
and we want to test if the experimental results support a particular hypothesis or theory. Karl
Pearson developed a test for testing the significant difference between experimental value
and the theoretical value named test of goodness of fit.
Steps for computation of and drawing the conclusions :
Step 1 : Compute the expected frequencies corresponding to the
observed frequencies under some theory or hypothesis.
Step 2 : Compute the deviations for each frequency and then square them to obtain
.
Step 3 : Divide the square of the deviations by the corresponding expected frequency to
obtain
Step 4 : Add the values obtained in step 3 to compute
Step 5 : Look up the tabulated values of for degrees of freedom at certain level of
significance, usually 5 % or 1 % from the table of significant values of
Step 6 : If the calculated value of is less than the corresponding tabulated value, then it is
said to be non- significant at the required level of significance and we may conclude
that there is good correspondence between theory and experiment.
Step 7 : If the calculated value of is greater than the tabulated value, it is said to be
significant and we may conclude that the experiment does not support the theory.
Conditions for validation of Chi – Square test :
The chi- square test statistic can be used only if the following conditions are satisfied.
1. , the total frequency , should be large.
2

2

−
2

2

n
E
E
E
E ,........
,
, 3
2
1
n
O
O
O
O ,.......
,
, 3
2
1
( )
i
i E
O −
( )2
i
i E
O −
( )
i
i
i
E
E
O
2
−
( )








 −
=
i
i
i
E
E
O
2
2

2
 ( )
1
−
n
2

2

2

N
2. The sample observations should be independent.
3. The Total expected frequency must be equal to the total observed frequency.
4. No theoretical frequency should be small. If any theoretical frequency is less than 5,
then we cannot apply test.
Problem 1 : The number of automobile accidents per week in a certain community were as
follows 12, 8, 20, 2, 14, 10, 15, 6, 9, 4
Are these frequencies in agreement with the belief that accident conditions were the
same during the 10 week period.
Solution : Given sample size
Total number of accidents in 10 weeks period
Average number of accidents per week in the given 10 week period
Accident conditions were same during the 10 week period
Week
Observed
Frequency
Expected
Frequency
1 12 10 2 4 0.4
2 8 10 -2 4 0.4
3 20 10 10 100 10
4 2 10 -8 64 6.4
5 14 10 4 16 1.6
6 10 10 0 0 0
7 15 10 5 25 2.5
8 6 10 -4 16 1.6
9 9 10 -1 1 0.1
10 4 10 -6 36 3.6
Total 100 100 26.6
2

10
=
n
100
4
9
6
15
10
14
2
20
8
12 =
+
+
+
+
+
+
+
+
+
10
10
100
=
=
:
0
H
i
O i
E
i
i E
O − ( )2
i
i E
O −
( )
i
i
i
E
E
O
2
−
To test the above null hypothesis the test statistic
Degrees of freedom = 10-1=9
Tabulated value
Here, the calculated value of is greater than the tabulated value of . Hence we
can reject the null hypothesis.
Accident conditions were not same in the given 10 weeks period.
Problem 2 : In a mendelian experiment on breeding four types of plants are expected to
occur in the proportion of 9:3 : 3 : 1. The observed frequencies are 891 round and yellow,
316 wrinkled and yellow, 290 round and green and 119 wrinkled and green. Find the chi-
square value and examine the correspondence between the theory and the experiment.
Solution :
In a Mendelian experiment on breeding, there is no significant difference between the
theoretical and observed frequency .
Total number of observed plants = 891 + 316 + 290 + 119 = 1616
Given , four types of plants are expected to occur in the proportion of 9:3 : 3 : 1
Round and yellow
Wrinkled and yellow
Round and green
Wrinkled and green
( ) 6
.
26
2
2
=





 −
= 
i
i
i
E
E
O

919
.
16
2
9
2
1
10
2
1 =
=
= −
− 

n
2
 2

.,
.e
i
:
0
H
909
1616
16
9
1616
1
3
3
9
9
=

=

+
+
+
=
303
1616
16
3
1616
1
3
3
9
3
=

=

+
+
+
=
303
1616
16
3
1616
1
3
3
9
3
=

=

+
+
+
=
101
1616
16
1
1616
1
3
3
9
1
=

=

+
+
+
=
Breed
Observed
Frequency
Expected
Frequency
Round and yellow 891 909 -18 324 0.3565
Wrinkled and yellow 316 303 13 169 0.5578
Round and green 290 303 -13 169 0.5578
Wrinkled and green 119 101 18 324 3.2079
Total 1616 1616 4.6799
To test the above null hypothesis the test statistic
Degrees of freedom = 4-1=3
Tabulated value
Here, the calculated value of is less than the tabulated value of . Hence we can
not reject the null hypothesis.
In a Mendelian experiment on breeding, there is no significant difference between
the theoretical and observed frequency .
Fitting the Binomial distribution and testing the goodness of fit :
Let be the ‘ ’ frequencies for a random variable under
consideration. By using these observed frequencies we have to fit a binomial distribution
and hence calculate the expected frequencies
.
Here total expected frequency must be equal to the total observed frequency
To test the significant difference between the observed and expected frequencies, the
null hypothesis can be framed as
There is no significant difference between the observed and expected frequencies
i
O i
E
i
i E
O − ( )2
i
i E
O −
( )
i
i
i
E
E
O
2
−
( ) 6799
.
4
2
2
=





 −
= 
i
i
i
E
E
O

80
.
7
2
3
2
1
4
2
1 =
=
= −
− 

n
2
 2

.,
.e
i
n
O
O
O
O ,.........
,
, 3
2
1
n
( ) n
x
q
p
C
x
X
P x
n
x
x
n
,........
3
,
2
,
1
,
0
; =
=
= −
n
E
E
E
E ,.........
,
, 3
2
1
 
= i
i E
O
e
i .,
.
:
0
H
Binomial distribution is best fit to the given frequencies( observed frequencies)
To test the above null hypothesis,
The above statistic follows distribution with degrees of freedom
Compare the calculated value of with the critical or tabulated value of
If , we can reject the null hypothesis and we may conclude that Binomial
distribution holds good for the given data. Otherwise we can reject the null hypothesis and
we can conclude that Binomial distribution is not best fit to the given data.
Problem 1 : Records taken of the number of male and female births in 800 families having
four children are given below.
No. of births
Frequency
Male Female
0 4 32
1 3 178
2 2 290
3 1 236
4 0 64
Test whether the data are consistent with the hypothesis that the binomial law holds and the
chances of a male birth is equal to that of female birth.
Solution :
The probability of male and female birth are equal and binomial law holds
From the data we can considered , that the probability of male and female births are
equal
Probability of male birth
Probability of female birth
The fitted binomial distribution for the given data is
:
0
H
( )
 




 −
=
i
i
i
E
E
O
2
2

2
 1
−
− k
n
2
 2

2
2
Tab
Cal 
 
:
0
H
q
p =
.,
.e
i
2
1
=
p
2
1
=
q
( ) n
x
q
p
C
x
X
P x
n
x
x
n
,........
3
,
2
,
1
,
0
; =
=
= −
Put in equation (1) then
Put in equation (1) then
Put in equation (1) then
Put in equation (1) then
Put in equation (1) then
Expected frequencies
0 32 50 -18 324 6.48
1 178 200 -22 484 2.42
2 290 300 -10 100 0.3333
3 236 200 36 1296 6.48
4 64 50 14 196 3.92
19.6333
( )
x
x
x
C
x
X
P
−












=
=
4
4
2
1
2
1
( )
x
x
x
C
x
X
P
−
+






=
=
4
4
2
1
( ) )
1
(
4
,
3
,
2
,
1
,
0
;
2
1
4
4
→
=






=
= x
C
x
X
P x
0
=
x ( ) 0625
.
0
16
1
.
1
2
1
0
4
0
4
=
=






=
= C
X
P
1
=
x ( ) 25
.
0
16
1
.
4
2
1
1
4
1
4
=
=






=
= C
X
P
2
=
x ( ) 375
.
0
16
1
.
6
2
1
2
4
2
4
=
=






=
= C
X
P
3
=
x ( ) 25
.
0
16
1
.
4
2
1
3
4
3
4
=
=






=
= C
X
P
4
=
x ( ) 0625
.
0
16
1
.
1
2
1
4
4
4
0
4
=
=






=
= C
X
P
( ) ( ) 50
0625
.
0
800
0
.
0 =

=
=
=
= X
P
N
X
E
( ) ( ) 200
25
.
0
800
1
.
1 =

=
=
=
= X
P
N
X
E
( ) ( ) 300
375
.
0
800
2
.
2 =

=
=
=
= X
P
N
X
E
( ) ( ) 200
25
.
0
800
3
.
3 =

=
=
=
= X
P
N
X
E
( ) ( ) 50
0625
.
0
800
4
.
4 =

=
=
=
= X
P
N
X
E
i
x i
O i
E i
i E
O − ( )2
i
i E
O −
( )
i
i
i
E
E
O
2
−
Here , , We can reject the null hypothesis.
The male and female births are not equally probable.
Binomial distribution is not good fit for the given data.
Fitting the Poisson distribution and testing the goodness of fit :
Let be the ‘ ’ frequencies for a random variable under
consideration. By using these observed frequencies we have to fit a Poisson distribution
and hence calculate the expected frequencies
.
Here total expected frequency must be equal to the total observed frequency
To test the significant difference between the observed and expected frequencies, the
null hypothesis can be framed as
There is no significant difference between the observed and expected frequencies
Poisson distribution is best fit to the given frequencies( observed frequencies)
To test the above null hypothesis,
The above statistic follows distribution with degrees of freedom
Compare the calculated value of with the critical or tabulated value of
If , we can reject the null hypothesis and we may conclude that Poisson
distribution holds good for the given data. Otherwise we can reject the null hypothesis and
we can conclude that Poisson distribution is not best fit to the given data.
Problem : The following mistakes per page were observed in a book.
( ) 6333
.
19
2
2
=





 −
= 
i
i
i
E
E
O

6333
.
19
2
=
Cal

488
.
9
2
4
2
1
5
2
1
2
=
=
=
= −
− 


 n
Tab
2
2
Tab
Cal 
 
n
O
O
O
O ,.........
,
, 3
2
1
n
( ) ,........
3
,
2
,
1
,
0
;
!
=
=
=
−
x
x
e
x
X
P
x


n
E
E
E
E ,.........
,
, 3
2
1
 
= i
i E
O
e
i .,
.
:
0
H
:
0
H
( )
 




 −
=
i
i
i
E
E
O
2
2

2
 1
−
− k
n
2
 2

2
2
Tab
Cal 
 
No. of. mistakes per page 0 1 2 3 4 Total
No. of pages 211 90 19 5 0 325
Fit a Poisson distribution and test the goodness of fit.
Solution :
Poisson distribution is a good fit to the given data.
Let be a random variable denotes the number of mistakes per page.
0 211 0
1 90 90
2 19 38
3 5 15
4 0 0
325 143
Average number of mistakes per page is
In a Poisson distribution mean is
The fitted Poisson distribution for the given data is
then
then
then
then
then
Expected frequencies :
:
0
H
x
x f fx
44
.
0
325
143
=
=
=

N
fx
x
44
.
0
=


( ) 4
,
3
,
2
,
1
,
0
;
!
44
.
0
!
44
.
0
=
=
=
=
−
−
x
x
e
x
e
x
X
P
x
x


Put 0
=
x ( ) 6440
.
0
!
0
44
.
0
0
0
44
.
0
=
=
=
−
e
X
P
Put 1
=
x ( ) 2834
.
0
!
1
44
.
0
1
1
44
.
0
=
=
=
−
e
X
P
Put 2
=
x ( ) 0623
.
0
!
2
44
.
0
2
2
44
.
0
=
=
=
−
e
X
P
Put 3
=
x ( ) 0091
.
0
!
3
44
.
0
3
3
44
.
0
=
=
=
−
e
X
P
Put 4
=
x ( ) 001
.
0
!
4
44
.
0
4
4
44
.
0
=
=
=
−
e
X
P
( ) ( ) 3
.
209
3
.
209
6440
.
0
325
0
.
0 
=

=
=
=
= X
P
N
X
E
Observed Frequencies Expected Frequencies
211 209.3 1.7 2.98 0.01381
90 92.1 -2.1 4.41 0.04788
19 20.3
5 24 3.0 23.6 0.4 0.16 0.00678
0 0.3
325 325 0.06847
( ) 06847
.
0
2
2
=







 −
= 
i
i
i
E
E
O

Table value of is
Here , Calculated value of is less than the tabulated value of . Hence , we can not reject the
null hypothesis.
We can conclude that , Poisson distribution is a good fit to the given data.
- Test for independence of attributes :
Let us suppose that the given population consisting of items is divided into ‘ ’
mutually disjoint (exclusive) and exhaustive classes with respect to the
attribute . Similarly let us suppose that the same population is divided into ‘ ’ mutually
disjoint and exhaustive classes can be represented in the following
manifold contingency table.
( ) ( ) 1
.
92
105
.
92
2834
.
0
325
1
.
1 
=

=
=
=
= X
P
N
X
E
( ) ( ) 3
.
20
2475
.
20
0623
.
0
325
2
.
2 
=

=
=
=
= X
P
N
X
E
( ) ( ) 0
.
3
9575
.
2
0091
.
0
325
3
.
3 
=

=
=
=
= X
P
N
X
E
( ) ( ) 3
.
0
325
.
0
001
.
0
325
4
.
4 
=

=
=
=
= X
P
N
X
E
i
O i
E
( )
i
i E
O − ( )2
i
i E
O −
( )
i
i
i
E
E
O
2
−
2
 841
.
3
2
1
2
3
1
5
2
1 =
=
= −
−
−
− 

 k
n
2
 2

2

N r
r
A
A
A
A ..
,.........
,
, 3
2
1
A s
S
B
B
B
B ,......
,
, 3
2
1
s
r 
manifold contingency table.
B
A
…… …… Total
…… ……
…… ……
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
( )
2
B
Ai …… ……
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
…… ……
Total …… ……
Where is the frequency of the attribute . i.e., it is the number of persons
possessing the attribute ; , is the number of persons possessing the
attribute , and is the number of persons possessing both the
attributes and ( )
Here,
The attribute and are independent
- test statistic is given by
Here the statistic follows -distribution with degrees of freedom.
Compare the calculated value of with the critical or tabulated value of
If , we can reject the null hypothesis Otherwise we can reject the null
hypothesis.
s
r 
1
B 2
B j
B s
B
1
A ( )
1
1 B
A ( )
2
1 B
A ( )
j
B
A1 ( )
s
B
A1 ( )
1
A
2
A ( )
1
2 B
A ( )
2
2 B
A ( )
j
B
A2 ( )
s
B
A2 ( )
2
A
i
A ( )
1
B
Ai
( )
j
i B
A ( )
s
i B
A ( )
i
A
r
A ( )
1
B
Ar ( )
2
B
Ar
( )
j
r B
A ( )
s
r B
A ( )
r
A
( )
1
B ( )
2
B ( )
j
B ( )
s
B N
( )
i
A th
i i
A
i
A r
i ,....
3
,
2
,
1
= ( )
j
B
j
B s
j ,....
3
,
2
,
1
= ( )
j
i B
A
i
A j
B r
i ,....
3
,
2
,
1
= s
j ,....
3
,
2
,
1
=
( )

=
=
s
j
j
i
i B
A
A
1
( )

=
=
r
i
j
i
j B
A
B
1
 
= =
=
=
r
i
s
j
j
i B
A
N
1 1
A B ( ) ( )( )
N
B
A
B
A
j
i
j
i =
 2

( ) ( )
 
( )
( )

 = =
= = 






 −
=







 −
=
r
i
s
j ij
ij
ij
r
i
s
j E
j
i
E
j
i
O
j
i
E
E
O
B
A
B
A
B
A
1 1
2
1 1
2
2

2
 2
 ( ) ( )
1
1 −

− s
r
2
 2

2
2
Tab
Cal 
 
Problem : A certain drug was administered to 456 males out of a total 720 in a certain locality
to test its efficacy against typhoid. The incidence of typhoid is shown below. Find out the
effectiveness of the drug against the disease.
Infection No infection Total
Administering the drug 144 312 456
Without Administering the drug 192 72 264
Total 336 384 720
Solution :
Incidence of typhoid and the administration of drug are independent
By the independence of attributes, the expected frequencies are
Observed Frequency Expected Frequency
144 212.8 -68.8 4733.44 22.2436
192 123.2 68.8 4733.44 38.4208
312 243.2 68.8 4733.44 19.4632
72 140.8 -68.8 4733.44 33.6182
113.7458
Test statistic
Tabulated value of is
Here the calculated value of is greater than the tabulated value of .
:
0
H
( ) 8
.
212
720
456
336
144 =

=
E
( ) 2
.
123
720
264
336
192 =

=
E
( ) 2
.
243
720
456
384
312 =

=
E
( ) 8
.
140
720
264
384
72 =

=
E
ij
O ij
E
ij
ij E
O − ( )2
ij
ij E
O −
( )
ij
ij
ij
E
E
O
2
−
( )
7458
.
113
2
2
=







 −
=  Eij
E
O ij
ij

2
 ( )( ) ( )( ) 841
.
3
2
1
2
1
2
1
2
2
1
1 =
=
= −
−
−
− 

 s
r
2
 2

We can say that Incidence of typhoid and the administration of drug are not independent.
Problem 2 : Data on the hair colour and the eye colour are given in the table. Calculate the
value. Determine the association between the hair colour and the eye colour.
Hair Colour
Total
Fair Brown Black
Eye
Colour
Blue 15 20 5 40
Gray 20 20 10 50
Brown 25 20 15 60
Total 60 60 30 150
Solution :
Hair colour and Eye colour are independent.
By the independence of attribures.
Hair Colour
Total
Fair Brown Black
Eye
Colour
Blue
15 20 5
40
Gray
20 20 10
50
Brown
25 20 15
60
Total 60 60 30 150
:
0
H
16
150
60
40
=

16
150
60
40
=

8
150
30
40
=

20
150
60
50
=

20
150
60
50
=

10
150
30
50
=

24
150
60
60
=

24
150
60
60
=

12
150
30
60
=

Observed Frequency Expected Frequency
15 16 -1 1 0.0625
20 16 4 16 1
5 8 -3 9 1.125
20 20 0 0 0
20 20 0 0 0
10 10 0 0 0
25 24 1 1 0.0417
20 24 -4 16 0.6667
15 12 3 9 0.75
3.6459
Test statistic
Tabulated value of is
Here the calculated value of is less than the tabulated value of . We can not
reject the null hypothesis. We can say that the eye colour and hair colour are independent.
Non-Parametric Test
Introduction:
Most of the statistical tests we have discussed are based on the following two features.
1. The form of the frequency functions of the parent population from which the samples
are drawn.
2. They were consult with testing statistical hypothesis about the parameters of a
frequency distribution.
For example all the exact sampling tests of significance are based on the assumption
that the parent the population is normal and estimating the means and variances of these
populations. Such tests which deal with the parameters of a population are known as
parametric test.
A non-parametric test is a test that does not depend on the particular form of the
basic frequency function, from which the samples are drawn.
In other words, a non-parametric test does not make any assumption recording the
form of the population. In Non-parametric test we have to make some assumptions. They are
i. Sample observations are independent.
ij
O ij
E ij
ij E
O − ( )2
ij
ij E
O −
( )
ij
ij
ij
E
E
O
2
−
( )
6459
.
3
2
2
=







 −
=  Eij
E
O ij
ij

2
 ( )( ) ( )( ) 49
.
9
2
4
2
1
3
1
3
2
1
1 =
=
= −
−
−
− 

 s
r
2
 2

ii. The variables are taken as continuous.
iii. Probability density function is also continuous.
Advantages and disadvantages of non-parametric test over parametric test:
Advantages:
➢ Non-parametric tests are more comparable, very simple and easy to apply with
compared to the parametric tests.
➢ There is no need to make an assumption about “the form of the frequency function” of
the parent population from which the sample is drawn.
➢ Non-parametric tests are useful to deal with the data which contains ranks.
Disadvantages:
➢ Non parametric test can be used only if the measurements are ordinal. Even in such
case if a parametric test exists then non parametric test does not give the effective
results compare to the parametric test.
➢ Non parametric test does not exists for testing interaction effects is analysis of variance
(ANOVA).
➢ Non parametric test are designed to test the statistical hypothesis only and they cannot
applicable for estimating the parameter.
Difference between parametric and non parametric test:
Parametric Test Non-Parametric Test
1. In parametric test the test statistic
specifies certain condition about the
parameter.
1. In Non-parametric test the statistic does
not specifies any condition about the
parameters.
2. Parametric test required the
measurements of the variables.
2. Non-parametric test does not required
measurements of the variables.
3. In parametric test the sample must be
specified and classified exactly
3. In Non-parametric test the sample data
may not be (classified) exactly.
4. In parametric tests there is a need of
knowing parent distribution of the
population.
4. In Non-parametric test there is no need
to knowing about the parent distribution
of the population.
5. Parametric test can not be used for all
signs.
5. Non-parametric test can be used for the
data which contains the ranks with
positive or negative signs.
Run test:
Run test is a subsequent of one or more identical symbols representing a common
property of the data.
Let us consider n
x
x
x
x ,......
,
, 3
2
1 be the values of a random sample drawn from the
population where the distribution function of the population is unknown.
To test the randomness of the sample from the population , the null hypothesis can
be considered as
0
H : The given sample observations are in random. against the alternative hypothesis.
1
H : The given sample observations are not independent (are not in random)
To test the above null hypothesis the test statistic is given by
( )
1
,
0
~
1
2
4
2
2
N
n
n
n
n
r
Z






−
−
+
−
= Here r = No. of runs and n = Sample size
The above test statistic gives the calculated value of Z . Compare the calculated value
of Z with the tabulated value or critical value.
If Cal
Z is greater than Tab
Z at the specified level of significance then we can infer that
the sample observation are not in random, otherwise we can say that the sample
observations are in random.
Problem : The wins and losses of a cricket team in 30 matches are W W L L L W W L L L W W
W W L W L L L W W L W W W L W L W W . Whether the observations are random or not?
Solution :
W W L L L W W L L L W W W W L W L L L W W L W W W L W L
W W
Here, The number of runs 15
=
r
Sample size 30
=
n
:
0
H The given sample observations are in random
To test the above null hypothesis, the test statistic is






−
−
+
−
=
1
2
4
2
2
n
n
n
n
r
Z
3716
.
0
6910
.
2
1
29
28
2
15
16
15
1
30
2
30
4
30
2
2
30
15
−
=
−
=






−
=






−
−
+
−
=
3716
.
0
=
Cal
Z
96
.
1
=
Tab
Z
Here, Tab
cal Z
Z  , We can not reject the null hypothesis.
The given sample observations are in random.
Sign Test:
Sign test get its name, that it is applicable for the positive or negative signs.
It is useful in research ,in which the competitive measurements are having the more
important . It is used in the case of two related samples.
Let n
x
x
x
x ,......
,
, 3
2
1 and n
y
y
y
y ,......
,
, 3
2
1 be the two random samples of same size
which are drawn from two different populations.
Let us consider the sign of the difference of corresponding sample observations.
.,
.e
i i
i
i y
x
d −
= where ,.......
3
,
2
,
1
=
i
If the two populations are continuous or independent then the probabilities of the
positive or negative is ½
To test the equality of populations
0
H : The populations are unequal
By testing the number of positive signs have the binomial variant with mean
2
n
np =
and variance
4
n
In this case the conditional probability of a sign being Ve
+ then the difference is ½ in
such situations t-Test is more effective. Thus we can used t-Test to sign test.
When =
d No of Ve
+ signs
( )
2
n
t
E = and ( )
4
n
t
V =
To test the above null hypothesis the test statistic is given by
1
~
4
2
−
−
= n
t
n
n
d
t
The above test statistic gives the calculated value of t . Compute the critical value of ‘
t ’ for the specified level of significance with 1
−
n degrees of freedom
If the calculated value of ‘t ’ is less than the critical value of ‘t ’ then we cannot reject
the null hypothesis at the specified level of significance, otherwise we can reject the null
hypothesis.
Problem : Two samples are drawn from a large population and the sample points are
Sampl
e I
5 7 8 4 3 6 4 8 9 4 5 6 6 8 5 4 7 3 8 9
1
1
6 7 8
1
0
Sampl
e II
4 8 6 3 5 7 2 5 7 5 6 5 5 7 6 5 4 8 3 7
1
0
8 6 5
1
1
Is the two samples drawn from the same population or not?
Solution : :
0
H The given two samples are drawn from the same population.
Sample I Sample II i
d
5 4 1
7 8 -1
8 6 2
4 3 1
3 5 -2
6 7 -1
4 2 2
8 5 3
9 7 2
4 5 -1
5 6 -1
6 5 1
6 5 1
8 7 1
5 6 -1
4 5 -1
7 4 3
3 8 -5
8 3 5
9 7 2
11 10 1
6 8 -2
7 6 1
8 5 3
10 11 -1
Here, The number of positive signs 15
=
d
To test the above null hypothesis the test statistic is given by
1
4
25
2
25
15
4
2 =
−
=
−
=
n
n
d
t
1
=
Cal
t
Tabulated value of t is 064
.
2
24
1
25 =
=
= − t
t
tTab
Here, Tab
Cal t
t  , we cannot reject the null hypothesis.
The given two samples are not drawn from the same population
WILCOXON SIGNED RANK TEST:
CASE(I) (One sample): If n
x
x
x
x ,......
,
, 3
2
1 be the random sample drawn from the population
with median ‘M’.
Compute the difference between the each observation with the median
( ) ( ) ( ) ( )
M
x
M
x
M
x
M
x n −
−
−
− ,.....
,
, 3
2
1 and note that whether the difference is either
positive or negative and assume that the number of positive signs are r and calculate P












=
 
=
r
x
x
n
n
C
P
0
2
1
In the above test statistic, the null hypothesis can be taken as
0
H : The given median of the population is significant.
Compare the calculated value of P with critical value and draw the inference
accordingly. If the P is less than the tabulated value then we can accept the null hypothesis
and we can say that the median of the population is significant, otherwise we can reject the
null hypothesis.
CASE(II):
Let n
x
x
x
x ,......
,
, 3
2
1 and n
y
y
y
y ,......
,
, 3
2
1 be the two sample of equal size drawn from
the same population.
Calculate the difference of each corresponding observations of the two samples.
To test the equality between the two samples the null hypothesis can be taken as.
0
H : The given two samples are drawn from the sample population
To test the above null hypothesis the test statistic is given by
( )
1
,
0
~
4
2 N
n
n
r
Z
−
=
The above test statistic gives the calculated value of Z . compute the critical value of
Z for the specified level of significance.
If the calculated value of Z is greater than the tabulated value, then we can reject the
null hypothesis at the specified level of significance, otherwise we cannot reject the null
hypothesis.
Problem : Sample of 7 observations are drawn from the large population and the sample
values are
67, 35, 34, 70, 65, 48, 60
Is the sample drawn from the population, whose median is 50.
Solution :
Given, Population median is 50
=
M
Size of the sample 7
=
n
x M M
x − Sign
67 50 17 +
35 50 -15 −
34 50 -16 −
70 50 20 +
65 50 15 +
48 50 -2 −
60 50 10 +
No of positive signs 4
=
r
The test statistic 











=
 
=
r
x
x
n
n
C
P
0
2
1












= 
=
4
0
7
7
2
1
x
x
C
 
4
7
3
7
2
7
1
7
0
7
128
1
C
C
C
C
C +
+
+
+
=
 
128
99
35
35
21
7
1
128
1
=
+
+
+
+
=
7735
.
0
=

P
Here, 5
.
0


Cal
P
We can reject the null hypothesis.
The given sample is not drawn from the population whose median is 50.
WILCOXON SIGNED RANK TEST:
CASE(I) (One sample): If n
x
x
x
x ,......
,
, 3
2
1 be the random sample drawn from the population
with median ‘M’.
Compute the difference between the each observation with the median
( ) ( ) ( ) ( )
M
x
M
x
M
x
M
x n −
−
−
− ,.....
,
, 3
2
1 and note that whether the difference is either
positive or negative and assume that the number of positive signs are r and calculate P












=
 
=
r
x
x
n
n
C
P
0
2
1
In the above test statistic, the null hypothesis can be taken as
0
H : The given median of the population is significant.
Compare the calculated value of P with critical value and draw the inference
accordingly. If the P is less than the tabulated value then we can accept the null hypothesis
and we can say that the median of the population is significant, otherwise we can reject the
null hypothesis.
CASE(II):
Let n
x
x
x
x ,......
,
, 3
2
1 and n
y
y
y
y ,......
,
, 3
2
1 be the two sample of equal size drawn from
the same population.
Calculate the difference of each corresponding observations of the two samples.
To test the equality between the two samples the null hypothesis can be taken as.
0
H : The given two samples are drawn from the sample population
To test the above null hypothesis the test statistic is given by
( )
1
,
0
~
4
2 N
n
n
r
Z
−
=
The above test statistic gives the calculated value of Z . compute the critical value of
Z for the specified level of significance.
If the calculated value of Z is greater than the tabulated value, then we can reject the
null hypothesis at the specified level of significance, otherwise we cannot reject the null
hypothesis.
Problem : Sample of 7 observations are drawn from the large population and the sample
values are
67, 35, 34, 70, 65, 48, 60
Is the sample drawn from the population, whose median is 50.
Solution :
Given, Population median is 50
=
M
Size of the sample 7
=
n
x M M
x − Sign
67 50 17 +
35 50 -15 −
34 50 -16 −
70 50 20 +
65 50 15 +
48 50 -2 −
60 50 10 +
No of positive signs 4
=
r
The test statistic 











=
 
=
r
x
x
n
n
C
P
0
2
1












= 
=
4
0
7
7
2
1
x
x
C
 
4
7
3
7
2
7
1
7
0
7
128
1
C
C
C
C
C +
+
+
+
=
 
128
99
35
35
21
7
1
128
1
=
+
+
+
+
=
7735
.
0
=

P
Here, 5
.
0


Cal
P
We can reject the null hypothesis.
The given sample is not drawn from the population whose median is 50.
ANALYSIS OF VARIANCE
ANOVA
Introduction : Analysis of variance (ANOVA) is the powerful tool for test of significance.
Suppose we are interested in finding out whether the effect of fertilizers on the yields is
significantly differ or not. One procedure to answer this question is to conduct −
t test for 2
C
n
times, which is impossible. The alternative is to apply the technique of ANOVA.
The main aim of analysis of variance is to test the homogeneity of several means or to
test different treatment effects.
ANOVA was introduced by Prof. R.A. Fisher in the year 1920. Variance is inherent in
nature. Every experiment consists of a list of outcomes and we should not accept then without
variation. Even though an experiment is conducted under similar conditions with total
homogeneous experimental units, variance exists in the results.
For example, the signature of the person varies from sign to sign, even though it is his
own sign.
The total variation in any set of numerical data is due to number of causes. Which may
be classified as
1. Variation due to assignable causes
2. Variation due to chance causes
The variation due to assignable causes can be detected and measured, where as the
variation due to chance causes is beyond the control of human hand and cannot be treated
separately.
Definition : According to R.A. Fisher, analysis of variance is the separation of variance
ascribable to one group of causes from the variance ascribable to another group .
Assumptions of ANOVA : For the validity of Test
F − in analysis of variance, the
following assumptions are made.
1. The sample observations are independent.
2. The parent population from which the observations are drawn is Normal.
3. Various effects are additive in nature.
One Way ANOVA :
Let us suppose that ‘ N ’ observations ‘ ij
x ’ ; k
i ,.....
3
,
2
,
1
= i
n
j ,...
3
,
2
,
1
= of a random
variable ‘ x ’ split into ‘k ’ classes on some basis with sizes k
n
n
n
n ,....
,
, 3
2
1 respectively.
These values are exhibited in the following classification table.
Classes 1 2 3 ……… i
n
Total
.
i
T
Mean
.
i
x
1 11
x 12
x 13
x ……… 1
1n
x .
1
T .
1
x
2 21
x 22
x 23
x ……… 2
2n
x .
2
T .
2
x
3 31
x 32
x 33
x ……… 3
3n
x .
3
T .
3
x
: : : : : : :
i 1
i
x 2
i
x 3
i
x ……… i
in
x .
i
T .
i
x
: : : : : : :
k 1
k
x 2
k
x 3
k
x ……… k
kn
x .
k
T .
k
x
Grand total 
 =
= =
=
=
k
i
i
k
i
n
j
ij T
x
G
i
1
.
1 1
The total variation in the observations ij
x can be split into the following components.
1. Variation between the classes
2. Variation within the classes
The first type of variation is due to assignable causes which can be identified and the
second type of variation is due to chance causes.
Null hypothesis :
:
o
H The means of classification table are equal to their general mean.




 =
=
=
=
= .
.
3
.
2
.
1 .......
: k
o
H
0
.......
: 3
2
1 =
=
=
=
= k
o
H 



Working procedure of One-Way ANOVA :
The following are the steps to carryout ANOVA One –Way Classification
Step 1 : Calculate the Grand total  
= = =
=
=
k
i
n
j
k
i
i
ij
i
T
x
G
1 1 1
.
Step 2 : Calculate the correction factor
N
G
F
C
2
. =
Step 3 : Calculate the Row sum of squares 
= =
=
k
i
n
j
ij
i
x
S
S
R
1 1
2
.
.
Step 4 : Calculate the total Sum of squares F
C
S
S
R
S
S
T .
.
.
.
. −
=
Step 5 : Calculate the sum of squares due to classes F
C
n
T
C
S
S
k
i i
i
.
.
.
1
2
.
−
= 
=
Step 6 : Calculate the sum of squares due to Errors C
S
S
S
S
T
E
S
S .
.
.
.
.
. −
−
Step 7 : Calculate the mean sum of squares
Mean sum of squares due to Classes
1
.
.
.
.
.
−
=
k
C
S
S
C
S
S
M
Mean sum of squares due to Errors
k
N
E
S
S
E
S
S
M
−
=
.
.
.
.
.
Step 8 : Compute the calculated value of F
.,
.e
i
E
S
S
M
C
S
S
M
FCal
.
.
.
.
.
.
=
Step 9 : Find the critical value of F from the Table
F − at ( )
k
N
k −
− ,
1 degrees of freedom.
Step 10 : Compare the calculated value of F with the critical values of F . If the Calculated
value of F is less than the critical value of F then we cannot reject the null hypothesis.
ANOVA Tables :
Sources of
Variation
Degrees of
Freedom
Sum of
Squares
Mean Sum
Of squares
F-Ratio
Cal
F Tab
F
Due to Classes 1
−
k C
S
S .
. C
S
S
M .
.
.
E
S
S
M
C
S
S
M
FCal
.
.
.
.
.
.
= ( )
k
N
k
Tab F
F −
−
= ,
1
Due to errors k
N − E
S
S .
. E
S
S
M .
.
.
Total 1
−
N S
S
T .
.
Compare the calculated value of F with the critical values of F . If the Calculated
value of F is less than the critical value of F then we cannot reject the null hypothesis.
ANOVA Two-Way Classification :
Let us suppose that ‘ N ’ observations, which can be ‘h ’ groups and each group
containing ‘ k ’ experimental units. .,
.e
i k
h
N 
=
Let ‘ ij
x ’ be the yield of th
j variety which receives th
i treatment can be arranged in the
following bivariate frequency distribution.
Classes 1 2 3 ……… j ……… k
Total
.
i
T
Mean
.
i
x
1 11
x 12
x 13
x ……… j
x1 ……… k
x1 .
1
T .
1
x
2 21
x 22
x 23
x ……… j
x2 ……… k
x2 .
2
T .
2
x
3 31
x 32
x 33
x ……… j
x3 ……… k
x3 .
3
T .
3
x
: : : : : ……… : : :
i 1
i
x 2
i
x 3
i
x ……… ij
x ……… ik
x .
i
T .
i
x
: : : : : ……… : : :
h 1
h
x 2
h
x 3
h
x ……… hj
x ……… hk
x .
h
T .
h
x
Total 1
.
T 2
.
T 3
.
T ……… j
T. ……… k
T. G
Mean 1
.
x 2
.
x 3
.
x j
x. k
x.
From the above Bivariate table
Grand total 
 
 =
= =
=
=
=
=
k
j
j
h
i
h
i
i
k
j
ij T
T
x
G
1
.
1 1
.
1
Mathematical model of Two way ANOVA :
The mathematical model of Two-Way classification is
ij
j
i
ij E
x +
+
+
= 

 where h
i ,......
3
,
2
,
1
= and k
j ,...
3
,
2
,
1
=
Here =
ij
x Yield of th
j variety which receives th
i treatment
=
 General mean effect
=
i
 Effect of the th
i Treatment (or) Effect of the th
i Row
=
j
 Effect of the th
j Variety (or) Effect of the th
j Column
=
ij
E Random Error.
Null Hypothesis :
We set up the null hypothesis as that the treatments as well as varieties are
homogeneous.
( ) 




 =
=
=
=
=
=
= .
.
.
3
.
2
.
1 .....
.......
: h
i
OR
OR H
or
H
( ) 0
.....
.......
: .
.
.
3
.
2
.
1 =
=
=
=
=
=
= h
i
OR
OR H
or
H 




( ) 




 =
=
=
=
=
=
= k
j
OC
OT H
or
H .
.
3
.
2
.
1
. .....
.......
:
( ) 0
.....
.......
: .
.
3
.
2
.
1
. =
=
=
=
=
=
= k
j
OC
OT H
or
H 




Working procedure of Two-Way ANOVA :
The following are the steps to carryout ANOVA One –Way Classification
Step 1 : Calculate the Grand total 
  =
= = =
=
=
=
k
j
j
h
i
k
j
h
i
i
ij T
T
x
G
1
.
1 1 1
.
Step 2 : Calculate the correction factor
N
G
F
C
2
. =
Step 3 : Calculate the Row sum of squares 
= =
=
h
i
k
j
ij
x
S
S
R
1 1
2
.
.
Step 4 : Calculate the total Sum of squares F
C
S
S
R
S
S
T .
.
.
.
. −
=
Step 5 : Calculate the sum of squares due to Rows F
C
k
T
R
S
S
h
i
i
.
.
. 1
.
−
=

=
Step 6 : Calculate the sum of squares due to Columns F
C
h
T
C
S
S
k
j
j
.
.
.
1
.
−
=

=
Step 7 : Calculate the sum of squares due to Errors C
S
S
R
S
S
S
S
T
E
S
S .
.
.
.
.
.
.
. −
−
−
Step 7 : Calculate the mean sum of squares
Mean sum of squares due to Rows
1
.
.
.
.
.
−
=
h
R
S
S
R
S
S
M
Mean sum of squares due to Columns
1
.
.
.
.
.
−
=
k
C
S
S
C
S
S
M
Mean sum of squares due to Errors
( )( )
1
1
.
.
.
.
.
−
−
=
k
h
E
S
S
E
S
S
M
Step 8 : Compute the calculated values of F
.,
.e
i
E
S
S
M
R
S
S
M
F Cal
R
.
.
.
.
.
.
=
E
S
S
M
C
S
S
M
F Cal
C
.
.
.
.
.
.
=
Step 9 : Find the critical values of F from the Table
F − at ( )( )
( )
1
1
,
1 −
−
− k
h
h and
( )( )
( )
1
1
,
1 −
−
− k
h
k degrees of freedom.
Step 10 : Compare the calculated values of F with the critical values of F . If the Calculated
value of F is less than the critical value of F then we cannot reject the null hypothesis.
ANOVA Tables :
Sources of
Variation
Degrees of
Freedom
Sum of
Squares
Mean
Sum
Of
squares
F-Ratio
Cal
F Tab
F
Due to Rows 1
−
h R
S
S .
. R
S
S
M .
.
.
E
S
S
M
R
S
S
M
F Cal
R
.
.
.
.
.
.
=
E
S
S
M
C
S
S
M
F Cal
C
.
.
.
.
.
.
=
( )( )
( )
1
1
,
1 −
−
−
= k
h
h
R F
F Tab
( )( )
( )
1
1
,
1 −
−
−
= k
h
k
C F
F Tab
Due to
Classes
1
−
k C
S
S .
. C
S
S
M .
.
.
Due to
errors
( )( )
1
1 −
− k
h E
S
S .
. E
S
S
M .
.
.
Total 1
−
N S
S
T .
.
Compare the calculated values of F with the critical values of F . If the Calculated
value of F is less than the critical value of F then we cannot reject the null hypothesis.
CORRELATION
Definition :
Combined relationship between two or more variable is known as “
Correlation”
Correlation is broadly classified into three types. They are
➢ Positive Correlation
➢ Negative Correlation
➢ Zero Correlation
Positive Correlation :
If the two variables x and y moving in same direction then the correlation
between such variables is known as “ positive correlation” and these variables are said
to be positively correlated variables.
In positive correlation, both the variables moves in same direction, i.e., If x
increases then y must be increased, similarly if x decreases then y must be
decreased.
Example : Demand and cost of an item are positively correlated variables.
Negative Correlation :
If the two variables x and y moving in opposite directions then the correlation
between such variables is known as “Negative correlation” and these variables are said
to be negatively correlated variables.
In negative correlation, both the variables moves in opposite directions, i.e., If
x increases then y must be decreased, similarly if x decreases then y must be
increased.
Example : Supply and demand of an item are negatively correlated variables.
Zero Correlation : If the two variables x and y does not depend upon each other or
independent then the correlation between such variables are known as “Zero
Correlation” and these variables are said to be un-correlated variables.
Example : Marks in Telugu and marks in Mathematics.
Measures of Correlation :
There are several methods to measure the correlation between two variables.
Some of them are
 Scatter Diagram Method
 Karl Pearson’s Coefficient of Correlation
 Spearman’s Rank Correlations Coefficient
Scatter Diagram Method :
Scatter diagrams are the easiest way to graphically represent the relationship
between two quantitative variables. They're just x-y plots, with the predictor
variable as the x and the response variable as the y.
In scatter diagram method we can obtain the measures of relationship between
two variables, by plot the values on a graph by taking values of one variable (i.e., x )
on x-axis and the values of another variable (i.e., y ) on y- axis.
Note :
1. If all the scattered sample points exactly lie on a straight line from bottom left
corner to top right corner, then the correlation between the two variables is said
to be perfect positive correlation.
2. If all the scattered sample points exactly lie on a straight line from top left corner
to bottom right corner, then the correlation between the two variables is said to
be perfect negative correlation.
3. If the scattered sample points cluster more around a straight line from bottom
left corner to top right corner then the correlation between the two variables is
said to be moderately positive correlation.
4. If the scattered sample points cluster more around a straight line from top left
corner to bottom right corner then the correlation between the two variables is
said to be moderately negative correlation.
Merits and demerits of Scatter diagram Method:
Merits : 1. Scatter diagram method is the simplest method of studying relationship
between two variables.
2. It takes less time to study the correlation between the two variables.
3. It is less expensive.
Demerits : 1. We cannot able to study the correlation between the three or more
variables using this method.
2. We cannot able to calculate the percentage of correlation between the
variables using this method.
(or)
We cannot able to find the correlation between the two variables
numerically using this method.
Karl Pearson’s Coefficient of Correlation :
This is an important method to calculate the correlation coefficient between
two variables numerically. This method was introduced by Prof. Karl Pearson.
Karl Pearson’s coefficient of correlation is denoted by ‘r ’. It is mathematically
defined as
( )
( )
  ( )
( )
 
2
2
2
2
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
y
E
y
E
x
E
x
E
y
E
x
E
xy
E
xy
Cov
y
V
x
V
xy
Cov
r
y
x
−
−
−
=
=
=


















−
















−
















−
=







2
2
2
2
n
y
n
y
n
x
n
x
n
y
n
x
n
y
x
( )( )
( ) ( )







 −







 −
−
=
 
 

 
2
2
2
2
2
2
2
n
y
y
n
n
x
x
n
n
y
x
xy
n
( )( )
( )
  ( )
 
 
 
 

−
−
−
=
2
2
2
2
y
y
n
x
x
n
y
x
xy
n
Properties of Karl Pearson’s coefficient Correlation :
1. Correlation coefficient between x and y is equal to the correlation
coefficient between x
and
y . i.e., yx
xy r
r =
2. Correlation coefficient is a numerical value and it does not contain units.
3. Karl Pearson’s coefficient of correlation is always lie between 1
1 +
− and
• If 1
=
r then the correlation between x and y is perfect positive
correlation.
• If 1
−
=
r then the correlation between x and y is perfect negative
correlation.
• If 0
=
r then the correlation between x and y is zero correlation.
• If 0

r then the correlation between x and y is positive correlation.
• If 0

r then the correlation between x and y is negative correlation.
4. If two variables x and y are independent then the correlation coefficient
between x and y is equal to zero. i.e., 0
=
r
Note : If the correlation coefficient between x and y is zero, then they need
not be independent.
5. Correlation coefficient is independent of change of origin and scale. i.e.,
uv
xy r
r =
Problem 1: Find the correlation coefficient between y
and
x using Karl Pearson’s
Coefficient of Correlation from the following data.
x 24 26 32 28 40 36 38 42
y 26 28 24 32 30 40 38 36
Calculation :
x y 2
x 2
y xy
24
26
32
28
40
36
38
42
26
28
24
32
30
40
38
36
576
676
1024
784
1600
1296
1444
1764
676
784
576
1024
900
1600
1444
1296
624
728
768
896
1200
1440
1444
1512
266 254 9164 8300 8612
Here n= 8,      =
=
=
=
= 8612
8300
,
9164
,
254
,
266 2
2
xy
and
y
x
y
x
 Karl Pearson’s Coefficient of Correlation
( )( )
( )
  ( )
 
 
 
 

−
−
−
=
2
2
2
2
y
y
n
x
x
n
y
x
xy
n
r
  ( )
 
2
2
254
)
8300
(
8
)
266
(
)
9164
(
8
)
254
(
)
266
(
)
8612
(
8
−
−
−
=
r
  
( )( )
6070
.
0
2194.426
1332
4815504
1332
1884
2556
1332
64516
-
66400
70756
-
73312
67564
-
68896
=
=
=
=
=
r
Problem 2: Find the correlation coefficient between y
and
x using Karl Pearson’s
Coefficient of Correlation from the following data.
x 102 106 104 105 118 112 116 115 120
y 87 76 80 85 86 82 74 78 84
Calculation :
x y 2
x 2
y xy
102
106
104
105
118
112
116
115
120
87
76
80
85
86
82
74
78
84
10404
11236
10816
11025
13924
12544
13456
13225
14400
7569
5776
6400
7225
7396
6724
5476
6084
7056
8874
8056
8320
8925
10148
9184
8584
8970
10080
998 732 111030 59706 81141
Here n= 9,
     =
=
=
=
= 81141
59706
,
111030
,
732
,
998 2
2
xy
and
y
x
y
x
 Karl Pearson’s Coefficient of Correlation
( )( )
( )
  ( )
 
 
 
 

−
−
−
=
2
2
2
2
y
y
n
x
x
n
y
x
xy
n
r
  ( )
 
2
2
732
)
59706
(
9
)
998
(
)
111030
(
9
)
732
(
)
998
(
)
81141
(
9
−
−
−
=
r
  
535824
-
537354
996004
-
999270
730536
-
730269
=
r
( )( )
-0.1194
2235.3926
267
-
4996980
267
-
1530
3266
267
-
=
=
=
=
r
Spearman’s Rank Correlation Coefficient :
Spearman’s Rank correlation coefficient is a measure to study the degree of
relationship between the non-measurable quantities such as intelligence, beauty,
honest etc.
If the given data contains the values of two characteristics then in such
situations we have to allot the ranks as first rank to the first highest value, second rank
to the second highest value and so on last rank to the least value.
Prof C.E. Spearman introduced a formula to measure the relationship between
two non measurable quantities as follows
( )
1
6
1 2
2
−
−
=

n
n
di

Here i
d = difference between the corresponding ranks Y
X
i R
R
d
e
i −
=
.,
.
n = Number of paired observations.
This formula is used when the repetition of observations are not occur in the
data.
In case of repeated values or equal values then we have to allot the ranks serially
and find the average of the ranks for the repeated or equal values.
In case of repeated observations the spearman’s rank correlation coefficient is
defined as
( ) ( ) ( )
( )
1
.........
12
1
12
1
12
1
6
1 2
2
3
3
2
2
2
2
1
1
2
−






+
−
+
−
+
−
+
−
=

n
n
m
m
m
m
m
m
di

( )
1
6
1 2
2
−

−
=

n
n
di
 ( ) ( ) ( )






+
−
+
−
+
−
+
 .........
12
1
12
1
12
1 2
3
3
2
2
2
2
1
1
2 m
m
m
m
m
m
d
Where i

Here i
d = difference between the corresponding ranks Y
X
i R
R
d
e
i −
=
.,
.
n = Number of paired observations.
..
,.........
,
, 3
2
1 m
m
m are the number of tied ranks
Properties Of Spearman’s Rank Correlation Coefficient :
➢ Spearman’s Ran correlation coefficient is Always lie between 1
1 +
− and
1
1
.,
. +


− 
e
i
➢ If the values of y
and
x series takes the similar ranks , then the Spearman’s rank
correlation coefficient between y
and
x is equal to one ( )
1
=
 then we can say
that there is perfect positive correlation.
➢ If the values of y
and
x series takes opposite ranks, then the Spearman’s rank
correlation coefficient between y
and
x is equal to ‘-1’ ( )
1
−
=
 then we can say
that there is perfect negative correlation.
➢ If 0
=
 then we can say that the two variables are uncorrelated.
➢ Rank correlation coefficient does not affect if we allot ranks to the existing
ranks.
Merits and Demerits of Rank Correlation coefficient :
Merits :
• It is easy to understand.
• It is easy to calculate.
• Some times, it can be used as an approximate or quick estimate of correlation
coefficient.
• Rank Correlation coefficient does not effect if allot the existing ranks.
• Karl Pearson’s coefficient of correlation is equal to one ( )
1
=
r then the
Spearman’s rank correlation coefficient is also equal to one ( )
1
=
 but converse
need not be true.
Demerits :
• In case of measurable variables rank correlation coefficient ignores the actual
values. Hence, we may not get exact relationship between two variables.
• Rank correlation coefficient does not produce regression lines.
• We cannot estimate the ranks of one variable with the help of known ranks of
the corresponding variable.
Problem 1 : calculate Spearman’s rank correlation coefficient between y
and
x using
the following data.
x 25 28 34 30 36 38 40 42
y 34 30 36 32 38 35 33 37
Solution :
x y x
R y
R y
x
i R
R
d −
= 2
i
d
25
28
34
30
36
38
40
42
34
30
36
32
38
35
33
37
8
7
5
6
4
3
2
1
5
8
3
7
1
4
6
2
3
-1
2
-1
3
-1
-4
-1
9
1
4
1
9
1
16
1
42
Here 8
=
n
( )
1
6
1 2
2
−
−
=

n
n
di

( )
( )
1
8
8
42
6
1 2
−
−
=

( )
( ) 2
1
1
63
8
42
6
1 −
=
−
=

5
.
0
=

Problem 2 : calculate Spearman’s rank correlation coefficient between y
and
x using
the following data.
x 23 24 26 24 28 25 29 24 28 30
y 32 30 36 34 32 33 36 38 40 38
Solution :
( ) ( ) ( ) .........
12
1
12
1
12
1 2
3
3
2
2
2
2
1
1
2
2
+
−
+
−
+
−
+
=
 

m
m
m
m
m
m
d
d i
i
( ) ( ) ( ) ( ) ( )
12
1
2
2
12
1
2
2
12
1
2
2
12
1
3
3
12
1
2
2
5
.
81
2
2
2
2
2
−
+
−
+
−
+
−
+
−
+
=
12
6
12
6
12
6
12
24
12
6
5
.
81 +
+
+
+
+
=
5
.
0
5
.
0
5
.
0
2
5
.
0
5
.
81 +
+
+
+
+
=
=85.5
 Spearman’s Rank correlation coefficient
( )
1
6
1 2
2
−

−
=

n
n
di

( )
( )
1
10
0
1
5
.
85
6
1 2
−
−
=

( )
99
0
1
513
1−
=

x y x
R y
R y
x
i R
R
d −
=
2
i
d
1
2
3.5
3.5
5
6
8
8
8
10
1
2.5
2.5
4.5
4.5
6
7
8.5
8.5
10
1.5
-2
0.5
2
-5
-1
-2.5
5.5
2.5
-1.5
2.25
4
0.25
4
25
1
6.25
30.25
6.25
2.25
81.50
990
513
1−
=

5182
.
0
1−
=

4818
.
0
=

Problem 3 : calculate Spearman’s rank correlation coefficient between y
and
x using
the following data.
( ) ( ) ( ) .........
12
1
12
1
12
1 2
3
3
2
2
2
2
1
1
2
2
+
−
+
−
+
−
+
=
 

m
m
m
m
m
m
d
d i
i
( ) ( ) ( ) ( ) ( )
12
1
2
2
12
1
2
2
12
1
3
3
12
1
2
2
12
1
4
4
258
2
2
2
2
2
−
+
−
+
−
+
−
+
−
+
=
12
6
12
6
12
24
12
6
12
60
258 +
+
+
+
+
=
x y x
R y
R y
x
i R
R
d −
=
2
i
d
1
3.5
3.5
3.5
3.5
6.5
6.5
8
9
10 2
2
2
4
5
6.5
6.5
8.5
8.5
10
-3
4.5
5
-5.5
-1.5
6
-5
8
-6.5
-2
9
20.25
25
30.25
2.25
36
25
64
42.25
4
258
5
.
0
5
.
0
2
5
.
0
5
258 +
+
+
+
+
=
=266.5
 Spearman’s Rank correlation coefficient
( )
1
6
1 2
2
−

−
=

n
n
di

( )
( )
1
10
0
1
5
.
266
6
1 2
−
−
=

( )
99
0
1
1599
1−
=

990
1599
1−
=

6152
.
1
1−
=

6152
.
0
−
=


Biostaticstics, Application of Biostaticstics

  • 1.
    MODULE -II (BIOSTATISTICS) Definition : Biostatistics is a branch of statistics that deals with the collection, analysis, interpretation, presentation, and organization of data in the context of biological and health sciences. It plays a crucial role in medical research, epidemiology, public health, and various other fields. Data : Collection of information or facts is known as Data. Applications of Statistics in Pharmaceutical Sciences : 1. Clinical Trials Design and Analysis • Randomization: Ensures unbiased assignment of subjects to treatment or control groups. • Sample Size Calculation: Determines the number of participants needed to detect treatment effects with adequate power. • Survival Analysis: Analyses time-to-event data, such as time until the occurrence of a side effect or disease progression. • Statistical Significance Testing: Uses tests like t-tests, ANOVA, or regression to compare treatment groups. • Confidence Intervals: Estimates the range within which true effects of treatments are expected to fall. 2. Bioequivalence Studies • Assess whether generic drugs are equivalent to brand-name counterparts in terms of their bioavailability and pharmacokinetics using ANOVA, crossover designs, and other statistical tools. 3. Pharmacokinetics and Pharmacodynamics (PK/PD) Modelling • Nonlinear Regression: Fits models to drug concentration vs. time data to understand absorption, distribution, metabolism, and excretion of drugs. • Compartmental Models: Helps describe how drugs move within different body compartments. • Population Modeling: Investigates variations in drug response across populations. CAPIO
  • 2.
    4. Quality Controland Assurance • Statistical Process Control (SPC): Monitors manufacturing processes using control charts and other tools to maintain consistency. • Acceptance Sampling: Determines whether a batch of pharmaceutical products meets predefined quality standards. • Design of Experiments (DoE): Optimizes formulation and manufacturing conditions by systematically varying input factors. 5. Preclinical Studies • Dose-Response Relationship: Determines the relationship between drug dose and its biological effect, often modelled using logistic regression or nonlinear techniques. • Toxicological Studies: Uses statistics to evaluate potential toxic effects of new drug candidates. 6. Post-Marketing Surveillance • Pharmacovigilance: Statistical tools are used to detect and assess the incidence of adverse events after a drug has been released to the market. • Signal Detection: Analyses databases of adverse drug reactions to identify potential safety concerns. 7. Genomics and Personalized Medicine • Multivariate Statistics: Used to analyze complex biological data like gene expression and genetic variants to identify drug responses. • Biomarker Discovery: Statistical techniques help in identifying biomarkers that predict drug efficacy and toxicity in different patient subgroups. 8. Regulatory Submissions • Data Summarization: Statistics is used to summarize clinical trial data and prepare the evidence for regulatory authorities (e.g., FDA, EMA). • Meta-Analysis: Pools results from multiple studies to provide a broader understanding of drug effects. Population : Population is a group of items, units or objects which is under reference of study. Population may consists homogeneous units. Number of units in the population is denoted by ‘N’.
  • 3.
    Population can bebroadly classified into two types. They are 1. Finite population 2. Infinite population Finite population: A population which consists finite number (countable number) of elements or units is said to be finite population. Example: 1. Set of all natural numbers between 100 and 500 Example: 2. Population of a city. Infinite population: A population which consists infinite number of elements or units is said to be infinite population. Example: 1. Set of rational numbers between 1 to 10 Example: 2. . Set of real numbers from 0 to 1 Parameter : A constant which is measured from the population is said to be parameter. Example : Population mean “  ” population variance “ 2  ” population standard deviation “  ” population proportion “ P”. Sample: subset of the population is known as sample. Random sample: A sample which is collected from the population in random manner is known as random sample. Statistic : A constant which is measured from the sample is said to be statistic. Example : sample mean “ x ”sample variance “ 2 s ” sample standard deviation “ s ” sample proportion “ p ”. Sample Size Determination : Key Factors Influencing Sample Size Determination: 1. Study Objective and Design: o Type of Study: The sample size can vary depending on whether the study is a clinical trial, observational study, or bioequivalence study. o Endpoints: The primary outcome variable (e.g., survival time, response rate) influences how many subjects are needed. 2. Power of the Study (1 - β): o Power is the probability of detecting a true effect (usually set at 80% or 90%). #Set of elements drawn from a population is called Sample. #Number of elements in the sample is denoted by ‘n’ sample size determination to types probability sampling non probablity sampling #simple random s #snowball s #stratified s #quout s #statistical s #convenience s #cluster s Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. external perspective: access to the sample resources time personal and its competences and experiences technical support measurement procedures etc. Internal perspective: researcher factors, aim of research, aim of generalisation, research methadologyes,educational research paradigm, motivation,intrest,skills, experience.
  • 4.
    o Type IIError (β): The risk of not detecting a true difference when one exists (typically set at 0.1 or 0.2). 3. Significance Level (α): o The level of significance is the probability of a Type I error (rejecting the null hypothesis when it is true), usually set at 5% (α = 0.05). o This means that there's a 5% risk of finding a difference when none exists. 4. Effect Size: o The effect size is the minimum clinically or scientifically relevant difference between groups. o It can be based on previous studies or clinical judgment. Smaller effect sizes require larger sample sizes to detect. 5. Variability in Data: o Higher variability or standard deviation in the outcome measures increases the required sample size. o For example, drug response data with high interpatient variability will need more subjects to identify a true effect. 6. One-Tailed or Two-Tailed Test: o A two-tailed test requires a larger sample size than a one-tailed test because it tests for effects in both directions (e.g., a drug being better or worse). o In contrast, a one-tailed test is more focused but generally less conservative. DETERMINATION OF SAMPLE SIZE BY COCHRAN FORMULA: Cochran’s formula is considered especially appropriate in situations with large populations. A sample of any given size provides more information about a smaller population than a larger one, so there’s a ‘correction’ through which the number given by Cochran’s formula can be reduced if the whole population is relatively small. The Cochran formula is: 𝒏 = 𝒁𝟐𝒑𝒒 𝒆𝟐 • e is the desired level of precision (i.e. the margin of error), z= statistical value/critical value p=probability of sucess q=probability of failure e=level of significance. N= infinity {infinity population] sample size: 1.sample size for discreate data: # cochran's method 2. sample size for known population: # yamane's formula.
  • 5.
    • p isthe (estimated) proportion of the population which has the attribute in question, • q is 1 – p. The z-value is found in a Z table. Ex : Suppose we are doing a study on the inhabitants of a large town, and want to find out how many households serve breakfast in the mornings. We don’t have much information on the subject to begin with, so we’re going to assume that half of the families serve breakfast: this gives us maximum variability. So p = 0.5. Now let’s say we want 95% confidence, and at least 5 percent—plus or minus—precision. A 95 % confidence level gives us Z values of 1.96, per the normal tables, so we get 𝒏 = 𝒁𝟐𝒑𝒒 𝒆𝟐 = (𝟏.𝟗𝟔)𝟐×𝟎.𝟓×𝟎.𝟓 (𝟎.𝟎𝟓)𝟐 = 385. So a random sample of 385 households in our target population should be enough to give us the confidence levels we need. Yamane’s Formula : Yamane’s method is a simplified formula for calculating sample size, often used when determining the sample size for surveys or studies in social sciences. It provides a quick way to estimate the required sample size from a known population size. The formula is particularly useful when you don't have access to advanced tools or when you're conducting preliminary calculations. 𝑛 = 𝑁 1 + 𝑁𝑒2 Where, 𝑛 = 𝑇ℎ𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑖𝑧𝑒 𝑁 = 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑆𝑖𝑧𝑒 𝑒 = 𝑀𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟 Steps for Using Yamane’s Method: 1. Determine the Population Size (N): o This is the total number of people or units in the population you’re studying. N=Countable or finate
  • 6.
    2. Decide theMargin of Error (e): o Common choices for ‘e’ are 0.05 (5%) or 0.10 (10%), depending on how precise you want your estimate to be. o A smaller ‘e’ means more precision, requiring a larger sample size, while a larger ‘e’ reduces precision and requires a smaller sample. 3. Apply the Formula: o Plug in the values for ‘N’ (population size) and ‘e’ (margin of error) into Yamane’s formula to calculate the required sample size ‘n’. Example of Yamane’s Method: Let’s say you want to conduct a survey in a population of 5,000 people, and you are willing to accept a margin of error of 5% (0.05): 𝑛 = 𝑁 1 + 𝑁𝑒2 𝑛 = 5000 1 + 5000(0.052) 𝑛 = 5000 1+12.5 𝑛 = 5000 13.5 = 370.37 The required sample size by the Yamane’s method at 5% level of significance from the population of size 5000 is 370. Importance of Sample Size : Sample size is a crucial aspect of research because it significantly influences the reliability, validity, and generalizability of the study's findings. Here are several key reasons why sample size is important: 1. Sample size and sampling error are inversely proportional. A larger sample size increases the statistical power of a study, which means the ability to detect a true effect or difference when it exists. Small sample sizes can lead to Type II errors (false negatives), where the study fails to detect an effect that is actually present. 2. Sample size directly affects the statistical power of a study, which is the ability to detect a true effect if one exists. 3. A large sample size enhances the representativeness of the sample, ensuring it more closely reflects the diversity of the target population. This reduces selection bias and improves the generalizability of the findings to the broader population. RGV sample size 1/sample error statitstical power
  • 7.
    4. With alarger sample size, the impact of outliers and extreme values on the overall results diminishes. This leads to a more stable and reliable outcome, as the sample mean and other statistics are less influenced by unusual data points. 5. Many statistical tests assume that the data comes from a large enough sample to approximate a normal distribution (Central Limit Theorem). Insufficient sample sizes may lead to non-normal distributions, making statistical tests less valid or requiring alternative, less powerful non-parametric tests. 6. A larger sample size gives researchers greater confidence in their findings. For example, with smaller samples, even if results are statistically significant, there may be concerns about whether the findings are repeatable or whether they occurred by chance. Population : Population is a set of similar items or events which is of interest for some questions or experiments or for study. Number of units in the population is denoted by ‘N’ Population can be broadly classified into two types. They are 1. Finite Population 2. Infinite Population Finite Population: Number of units in the population is (countable) finite, then the population is said to be finite population. Example : Population of male in Puttur. Infinite Population : Number of units in the population is (not countable) infinite , then the population is said to be infinite population. Example : Set of all rational numbers between 1 to 10 Parameter : A constant which is measured from the population is known as parameter. Population parameter can be denoted by  Example : Population mean  , Population variance 2  , Population proportion P etc., Sample : Sub set of the population is known as sample Set of elements drawn from a population is called Sample. Number of elements in the sample is denoted by ‘n’ Statistic : A constant which is measured from the sample is known as statistic. Example : Sample Mean x , Sample variance 2 s , sample proportion p etc., Dropout rate: it is the estimation of a number of subjects those can leave out the study/clinical trail due to some reasons.normally the sample size calculation will give a number of study subjects required for achieving target statistical significance for a given hypothesis.however in clinical practice, we may need to enrool more subjects to compensate for these potential droupouts N1= n/(1-d) n_ sample size d_dropout rate
  • 8.
    Statistical hypothesis : Thestatement formulated on statistical population are called statistical hypothesis. It means a statistical hypothesis reveals the probability distribution of the population or the parameter involved in it. Suppose we are interested to examine two fertilizers A and B, according to their yields. In this contest we can formulate the following statements about the fertilizers A and B. 1. The fertilizer A provides more yield than the fertilizer B 2. The fertilizer B provides more yield than the fertilizer A 3. There is no significant difference between the two types of fertilizers A and B with respect to their yields. The above three statements are called statistical hypothesis. The third statement is an unbiased statement, where as the remaining two statements shows Bias towards either of the fertilizers. Null hypothesis : The Statistical hypothesis which is to be tested is called null hypothesis. It is denoted by 0 H . Null hypothesis should be numerical or natural regarding to the outcome of the test. It should be completely impartial and should not allow any personal views to influence the decision. For example, let us suppose we have two educational systems A and B and we have to examine which system is better, then in this contest the null hypothesis will be framed as 0 H : There is no significant difference between the two educational systems. Alternate or alternative hypothesis : The decisions which we are accept when the null hypothesis is rejected is called alternative hypothesis. Statistical hypothesis or statement which is made exact opposite to the null hypothesis is called an alternative hypothesis. It is denoted by 1 H . Example: If the null hypothesis, in testing of unbiasedness is 2 1 : 0 = p H then the alternative hypothesis is 2 1 : 1  p H (or) 2 1 : 1  p H (or) 2 1 : 1  p H Simple and composite hypothesis : Simple hypothesis : A Hypothesis which can specify all the parameters of the population completely is known as simple hypothesis. Level of significance: during the testing of hypothesis it is necessary to select a suitable level of significance .the confidence with which a research reject or accepts he null hypothesis depends on the significance level adopted. the probabilty of rejecting the null hypothesis even it is true is known as level of significance .it is denotted by alfha generally it is taken as 5%.
  • 9.
    Example : Supposewe have a sample of ‘ n ’ observations, with the values n x x x x ,...... , , 3 2 1 are drawn from a normal population with mean  and variance 2  then 0 0 :   = H and 2 0 2 0 :   = H is known as simple hypothesis. composite hypothesis : A statistical Hypothesis which does not specify all the parameters of the population completely is known as composite hypothesis. Example : 0 1 :    H (or) 0 1 :    H 2 0 2 1 :    H (or) 2 0 2 1 :    H Two types of errors : Error may happen while decisions are being made with the help of the sample instead of considering the total population in testing of hypothesis. In such cases the conclusions drawn (accepting or rejecting null hypothesis) on the basis of the sample data may not be true in all the cases. So we may commit two types of errors which are exhibited as follows 0 H is True 0 H is False Rejection of 0 H Type-I Error No Error Acceptance of 0 H No Error Type-II Error Type –I Error : Probability of rejection of null hypothesis when it is true is called Type-I error. (or) Probability of accepting the alternative hypothesis when it is false is known as Type-I error. Type-I error is also known as producer’s risk and it is denoted by ‘ ’. ( )   True is H P Error I Type / H of Rejection 0 0 = −    False is / H 1 1 H Accepting P =   0 / H w x P  =  = w dx L0  Type-II Error : Probability of accepting the null hypothesis when it is false is called Type-II error. (or) Probability of rejecting the alternative hypothesis when it is true is known as Type-II error. Type-II error is also known as consumer’s risk and it is denoted by ‘  ’. ( )   alse / H of Accepting 0 0 F is H P Error II Type = −    True is / H ejecting 1 1 H R P =   1 / H w x P  =  = w dx L1 
  • 10.
    Critical Region : Criticalregion is also known as rejection region. Let n x x x x ,......... , , 3 2 1 be the ‘ n ’ sample observations. Let us consider these ‘ n ’ sample points in a sample space. Let us check whether the sample points are lie within the region (or) outside the region. The total sample space S can be divided into two disjoint sets ( )w w S and w − The null hypothesis 0 H is to be reject if the observed sample points falls in w and if they falls in w then we can reject 1 H and accept the null hypothesis 0 H . The region of rejection of null hypothesis when it is true is the region of outside of the set where 0 H is rejected. If the sample points falls in that region then it is called critical region. Critical region is the region of the sample space when 0 H rejected even though it is true. When the sample statistic falls in that region then w w S  = and  = w w . A Region in which we can reject a null hypothesis at a certain probability is called critical region. In other words critical region is a region such that an observation falls in it tends to rejection of null hypothesis when it is true. Tests of significance : In testing of hypothesis a hypothesis is to be framed about the population parameter and then we have to test whether the framed hypothesis is correct or not. If the formed null hypothesis refer the parameters completely then it is called null hypothesis. After framing the null hypothesis our business is to take the decision about the rejection or acceptance of the null hypothesis. The methods which are used for this purpose are known as tests of significance. Example : Let us suppose that, the bulbs of a company are manufactured by using two methods old method and new method. Here we have to decide which method increases the lifetime of the bulbs, in this case we can frame 3 statements. they are • Old method is better than new method. • New method is better than old method. • There is no significant difference between the old method and the new method. # It is a process for camparing observed data with a claim(also called a hypotisis),the truth of which is being assessed in further analysis. # the hypothesis is a statement about a parameter like p and mu. # we express the results of significance test in term of a probability # it is also known as "HYPOTHISIS TESTING" A test of significance is a formal procedure for comparing observed data witha claim (hypothesis) the truth of which is being assed. the claim is a statemnet about a parameter , like the population propotion p or the population mean mue.
  • 11.
    We have todecide which of the above statements are true, to make a decision about the method of manufacturing. Here we have to use a systematic procedure for taking decision about the statements, the procedure which is used is known as the tests of significance. Power of the Test :  − 1 , the probability of rejection of the null hypothesis 0 H when it is false is known as power of the test. Thus,   alse / H of Accepting 0 0 F is H P =    True is / H ejecting 1 1 H R P =   1 / H w x P  =  = w dx L1   − = − w dx L 1 1 1  One tailed and two tailed test : In any tests the critical region is represented by a portion of the area under the probability curve of the sampling distribution of the test statistic. A test of any hypothesis where the alternative hypothesis 1 H is right tailed then it is said to be a right tailed test. Similarly, a test of any hypothesis where the alternative hypothesis 1 H is left tailed then it is said to be left tailed test. Right tiled or left tailed test is known as one tailed test. Example : In testing the mean of a population the null hypothesis 0 0 :   = H against the alternative hypothesis 0 1 :    H (or) 0 1 :    H Here 0 1 :    H is known as left tailed and 0 1 :    H is known as right tailed test. In the right tailed test the critical region lies entirely in the right tail of the sampling distribution of the test statistic. Similarly, in left tailed test the critical region lies entirely in the left tail of the sampling distribution of the sample statistic. A test of hypothesis where the alternative hypothesis is two tailed such as 0 0 :   = H against the alternative hypothesis 0 1 :    H is known as two tailed test. In two tailed test the critical region is lie in both of the tails of the sampling distribution of the test statistic.
  • 12.
    SMALL SAMPLE TESTS Student’st - Test : If n x x x x ,...... , , 3 2 1 is a random sample of size ' 'n from a normal population with mean  and variance 2  then the student’s t - statistic is defined as n S x n S x t 2   − = − = Where, n x x  = , is the sample mean And ( )  − − = 2 2 1 1 x x n S i is An unbiased estimate of the population variance 2  and it follows students t - distribution with 1 − = n  degrees of freedom with probability density function given by ( )     −       +       = + t t t f ; 1 1 2 , 2 1 1 2 1 2      Alternatively, students t can also be written as 1 − − = n s x t  where ( )  − = 2 2 1 x x n s i , the sample variance and ( ) 2 2 1 s n ns − = . Assumptions for student’s t -test : The following assumptions are made in the students − t test (i) The parent population from which the sample is drawn is normal. (ii) The population observations are independent i.e., the given sample is random. (iii) The sample standard deviation is unknown. Applications of t - distribution : The t - distribution has a number of applications in statistics. Some of them are (i) t - test for significance of single sample mean, population variance being unknown. (ii) t - test for significance of difference between two sample means, the population variances being equal but unknown. (iii) t - test for significance of an observed sample correlation coefficient.
  • 13.
    t - testfor Single mean : Let n x x x x ,..... , , 3 2 1 be a random sample of size ‘ n ’ drawn from a normal population with sample mean x . To test the significant difference between the sample and population means, the null hypothesis can be taken as : 0 H There is no significant difference between the sample and population mean.  = x H : 0 To test the above null hypothesis the test statistic is given by n S x t  − = Where, n x x  = and ( )  − − = 2 2 1 1 x x n S i follows student’s t - distribution with ( ) 1 − n degrees of freedom. Compare the calculated value of t with the critical values for ( ) 1 − n degrees of freedom at certain level of significance. If Tab Cal t t  , then we can’t reject null hypothesis and we can say that there is no significant difference between sample and population mean. If Tab Cal t t  , then we can reject null hypothesis and we can say that there is a significant difference between sample and population mean. 95 % confidence limits for  are n S t x n S t t 05 . 0  =   99 % confidence limits for  are n S t x n S t t 01 . 0  =   Problem 1 : A machine is designed to produce insulating washers for electrical devices of an average thickness of 0.025 cm. A random sample of 10 washers was found to have an average thickness of 0.024cmwith a standard deviation of 0.002 cm. Test the significance of the deviation. Solution : Given , Sample size 10 = n Average thickness of washers in a sample cm x 024 . 0 = Sample standard deviation cm s 002 . 0 = Average thickness of washers in a population cm 0025 . 0 =  : 0 H There is no significant difference between the average thickness of washer in sample and population
  • 14.
    To test theabove null hypothesis, the test statistic is 1 − − = n s x t  1 10 002 . 0 025 . 0 024 . 0 − − = t 5 . 1 002 . 0 3 001 . 0 − =  − = t Tabulated value of t is 262 . 2 9 1 10 1 05 . 0 = = = = − − t t t t n Here 262 . 2  Cal t , we can’t reject the null hypothesis  There is no significant difference between the average thickness of washer in sample and population Problem 2 : A soap manufacturing company was distributing a particular brand of soap through a large number of retail shops. Before a heavy advertisement campaign, the mean sale per week, per shop was 140 dozens. After the campaign, a sample of 26 shops was taken and the mean sale was found to be 147 dozens with standard deviation 16. Can you consider the advertisement effective? Solution : Given , Number of shops taken in to sample 26 = n Average sale of soaps after campaign in as sample is dozens x 147 = Standard deviation of sample is 16 = s Average sales before advertisement campaign dozens 140 =  : 0 H There is no significant difference between the sample and population means 147 : 1   H (Right Tail Test ) To test the above null hypothesis 19 . 2 16 5 7 1 26 16 140 147 1 =  = − − = − − = n s x t  Tabulated value of t at 5 % level of significance ( Right tail test) is 708 . 1 25 1 26 1 = = = − − t t tn Here 708 . 1  cal t , We can reject the null hypothesis.  advertisement campaign is effective in increasing the sale of soaps. Problem 3 : Certain pesticide is packed into bags by a machine. A random sample of 10 bags is drawn and their contents are found to weigh ( in kgs) as follows
  • 15.
    50, 49, 52,44, 45, 48, 46, 45, 49, 45 Test if the average packing can be taken as 50 kg. Solution : x 50 49 52 44 45 48 46 45 49 45 473 2 x 2500 2401 2704 1936 2025 2304 2116 2025 2401 2025 22437 Sample size 10 = n Sample mean 3 . 47 10 473 = = =  n x x kg Mean weight of bag in population 50 =  kg Sample standard deviation 5318 . 2 41 . 6 3 . 47 7 . 2243 10 473 10 22437 2 2 2 2 = = − =       − = =         − =   n x n x s : 0 H Average weight of bag is 50kg. To test the above null hypothesis 1993 . 3 5318 . 2 3 7 . 2 1 10 5318 . 2 50 3 . 47 1 − =  − = − − = − − = n s x t  Tabulated value of t at 5 % level of significance is 262 . 2 9 1 10 1 = = = − − t t tn Here 262 . 2  cal t , We can’t reject the null hypothesis.  Average weight of a bag is 50 kg. t - Test for difference of means : Let 1 1 13 12 11 ,.... , , n x x x x and 2 2 23 22 21 ,.... , , n x x x x be the two random samples of sizes 1 n and 2 n drawn from a normal population. To test the significant difference between the two sample means, the null hypothesis can be framed as : 0 H There is no significant difference between the two sample means 2 1 0 : x x H = To test the above null hypothesis, the test statistic can be taken as
  • 16.
    2 2 1 2 2 1 2 1 ~ 1 1 − +         + − = n n t n n S x x t Where, 1 1 1 1 1 n x x n i i  = = , 2 1 2 2 2 n x x n j j  = =and ( ) 1 2 1 1 2 1 n x x s i  − = , ( ) 2 2 2 2 2 2 n x x s i  − = 2 2 1 2 2 2 2 1 1 2 − + + = n n s n s n S Follows student t distribution with 2 2 1 − + n n degrees of freedom. Compare the calculated value of t with the critical values for ( ) 2 2 1 − + n n degrees of freedom at certain level of significance. If Tab Cal t t  , then we can’t reject null hypothesis and we can say that there is no significant difference between the two sample means. If Tab Cal t t  , then we can reject null hypothesis and we can say that there is a significant difference between two sample means. Problem 1 : The average number of articles produced by two machines per day are 200 and 250 with Standard deviations 20 and 25 respectively on the base of records of 25 days production. Can you regard both the machines equally efficient at 5% level of significance. Solution : Given , Average number of articles produced by first machine per day from 25 days productions is 200 1 = x Average number of articles produced by second machine per day from 25 days productions is 250 2 = x Size of the first sample 25 1 = n Size of the second sample 25 2 = n Standard deviation from first sample 20 1 = s Standard deviation from second sample 25 2 = s : 0 H There is no significant difference between the production capacity of the two machines. 2 25 25 25 25 20 25 2 2 2 2 1 2 2 2 2 1 1 2 − +  +  = − + + = n n s n s n S 8542 . 533 48 25625 48 625 25 400 25 = =  +  =
  • 17.
    To test theabove null hypothesis 2 2 1 2 2 1 2 1 ~ 1 1 − +         + − = n n t n n S x x t ( )       + − = 25 1 25 1 8542 . 533 250 200 t 6509 . 7 5352 . 6 50 08 . 0 8542 . 533 50 − = − =  − = Tabulated value 05 . 0 t is 96 . 1 48 2 25 25 2 2 1 = = = − + − + t t t n n Here the calculated value of t is greater than the critical value of t at 5% level of significance. ., .e i We can reject the null hypothesis. There is a significant difference between the production capacity of the two machines. Problem 2 : The means of two random samples of sizes 9 and 7 are 196.42 and 198.82 respectively. The sum of the squares of the deviations from the mean are 26.94 and 18.73 respectively. Can the samples be considered to have been drawn from the same normal population ? Solution : Size of the first sample 9 1 = n Size of the second sample 7 2 = n Mean of the first sample 42 . 196 1 = x Mean of the second sample 82 . 198 2 = x Sum of squares of deviations from mean in first sample ( )  = − 94 . 26 2 1 1 x x i Sum of squares of deviations from mean in second sample ( )  = − 73 . 18 2 2 2 x x j We know that ( ) ( ) 2 1 1 2 1 1 1 2 1 1 2 1 s n x x n x x s i i = −  − =   ( ) ( ) 2 2 2 2 2 2 2 2 2 2 2 2 s n x x n x x s j j = −  − =   ( ) ( ) 2621 . 3 14 67 . 45 2 7 9 73 . 18 94 . 26 2 2 2 1 2 2 2 2 1 1 2 1 2 2 2 2 1 1 2 = = − + + = − + − + − = − + + =   n n x x x x n n s n s n S j i : 0 H There is no significant difference between the means of two samples.
  • 18.
    To test theabove null hypothesis 2 2 1 2 2 1 2 1 ~ 1 1 − +         + − = n n t n n S x x t ( )       + − = 7 1 9 1 2621 . 3 82 . 198 42 . 196 t 254 . 0 2621 . 3 40 . 2  − = 6365 . 2 9103 . 0 40 . 2 − = − = 6365 . 2 = Cal t Tabulated value 05 . 0 t is 15 . 2 14 2 7 9 2 2 1 = = = − + − + t t t n n Here the calculated value of t is greater than the critical value of t at 5% level of significance. ., .e i We can reject the null hypothesis. There is a significant difference between the means of two sample means. Problem 3 : two different types of drugs A and B were tried on certain patients for increasing weight, 5 persons were given drug A and 7 persons were given drug B. The increase in weight in pounds are given below. Drug A 8 12 13 9 3 Drug B 10 8 12 15 6 8 11 Do the two drugs differ significantly with regard to their effect in increasing weight. Solution : No of patients were given drug A is 5 1 = n No of patients were given drug B is 7 2 = n Mean of the first sample 9 5 45 5 3 9 13 12 8 1 1 1 = = + + + + = =  n x x Mean of the second sample 10 7 70 7 11 8 6 15 12 8 10 2 2 2 = = + + + + + + = =  n x x
  • 19.
    1 x 1 1 x x− ( )2 1 1 x x − 2 x 2 2 x x − ( )2 2 2 x x − 8 -1 1 10 0 0 12 3 9 8 -2 4 13 4 16 12 2 4 9 0 0 15 5 25 3 -6 36 6 -4 16 8 -2 4 11 1 1 45 62 70 54 ( ) ( ) 6 . 11 10 116 2 7 5 54 62 2 2 1 2 2 2 2 1 1 2 = = − + + = − + − + − =   n n x x x x S : 0 H There is no significant difference between the increase in weight of patients by taking drug A and drug B. To test the above null hypothesis 5014 . 0 9944 . 1 1 3429 . 0 6 . 11 1 7 1 5 1 6 . 11 10 9 1 1 2 1 2 2 1 − = − =  − =       + − =         + − = n n S x x t 5014 . 0 = Cal t Tabulated value 05 . 0 t is 23 . 2 10 2 7 5 2 2 1 = = = − + − + t t t n n Here the calculated value of t is less than the critical value of t at 5% level of significance. ., .e i We can’t reject the null hypothesis. There is no significant difference between the increase in weight of patients by taking drug A and drug B. PAIRED t- TEST : This test was designed to examine the differences between the corresponding values of a sample at two levels is significant or not. If ‘ n ’ pairs are considered as a sample, we can test the null hypothesis as there is no significant differences between the ‘ n ’ paired observations. This test can be applied only when the sample pairs are available. Some of the applications of this test is to check whether the students are benefitted through a particular type of coaching method, to check whether the two types of food stuffs increased the weight of chicks etc.
  • 20.
    Suppose ‘ n’ chicks are selected at random . Let the initial weights of these chicks be n x x x x ,......... , , 3 2 1 . Let these ‘ n ’ chicks are feeded with a particular brand of food stuffs and the after feeding them are n y y y y ,......... , , 3 2 1 . Hence we get ‘ n ’ paired observations ) , ...( ),........ , ( ), , ( ), , ( 3 3 2 2 1 1 n n y x y x y x y x . The null hypothesis can be considered as H0 : 2 1   = The following are the steps to test the null hypothesis. Step 1 : Calculate the differences ‘di’ for each sample pair. i.e., i i i y x d − = Step 2 : Calculate d . i.e., n d d i  = Step 3 : Calculate e  . i.e., ( ) ( )         − − = − − =    n d d n n d d i i i e 2 2 2 1 1 1  Step 4 : Compute the test statistic 1 ~ − = n e t n d t  This test statistic follows t-distribution with ‘ 1 − n ’ degrees of freedom. t is to be calculated and it should be compared with the tabulated value at the desired level of significance. If the calculated value of ‘t’ less than the critical value then we can’t reject the null hypothesis , otherwise we can reject the null hypothesis. Problem 1 : The sales of data of an item in six shops before and after a special promotional campaign are as under Shops A B C D E F Before Campaign 53 28 31 48 50 42 After Campaign 58 29 30 55 56 45 Can the campaign be judged a success? Solution : : 0 H There is no significant change in the sales after the special promotional campaign
  • 21.
    Shop i x i yi i i y x d − = 2 i d A 53 58 -5 25 B 28 29 -1 1 C 31 30 1 1 D 48 55 -7 49 E 50 56 -6 36 F 42 45 -3 9 - 21 121 5 . 3 6 21 − = − = =  n d d i ( ) ( ) 5 . 9 6 285 5 1 6 441 121 5 1 6 21 121 1 6 1 1 1 2 2 2 2 =       =       − =       − − − =         − − =   n d d n S i i To test the above null hypothesis the test statistic 7815 . 2 2583 . 1 5 . 3 6 5 . 9 5 . 3 2 − = − = − = = n S d t 7815 . 2 = Cal t The tabulated value of t is Here, , We can reject the null hypothesis. There is a significant change in the sales after the special promotional campaign. Problem 2 : The results of IQ test are given below. Find out whether there is any change in IQ after training progamme. Candidate 1 2 3 4 5 6 7 IQ Before Training 112 120 116 125 131 132 129 IQ After Training 120 124 118 129 136 136 125 Solution : There is no significant change in IQ after training programme 02 . 2 5 1 6 1 = = = − − t t tn 02 . 2  Cal t : 0 H
  • 22.
    Candidate 1 112 120-8 64 2 120 124 -4 16 3 116 118 -2 4 4 125 129 -4 16 5 131 136 -5 25 6 132 136 -4 16 7 129 125 4 16 -23 157 To test the above null hypothesis the test statistic The tabulated value of is 45 . 2 6 1 7 1 = = = − − t t tn Here, , We can’t reject the null hypothesis. There is no significant change in IQ after training programme. Problem 3 : A drug is given to 10 patients and the increments in their blood pressure were recorded to be 3 , 6, -2, 4, -3, 4, 6, 3, 2, 2. Test whether the drug has any effect on the change of the blood pressure. Solution : There is no significant difference in the blood pressure readings of the patients before and after the drug. i x i y i i i y x d − = 2 i d 2857 . 3 7 23 − = − = =  n d d i ( ) ( ) 5714 . 13 7 570 6 1 7 529 157 6 1 7 23 157 1 7 1 1 1 2 2 2 2 =       =       − =       − − − =         − − =   n d d n S i i 3597 . 2 3924 . 1 2857 . 3 7 5714 . 13 2857 . 3 2 − = − = − = = n S d t 3597 . 2 = Cal t t 45 . 2  Cal t : 0 H
  • 23.
    3 9 6 36 -24 4 16 -3 9 4 16 6 36 3 9 2 4 2 4 25 143 To test the above null hypothesis the test statistic The tabulated value of is Here, , We can reject the null hypothesis. There is a significant difference in the blood pressure readings of the patients before and after the drug. i i i y x d − = 2 i d 5 . 2 10 25 = = =  n d d i ( ) ( ) 9444 . 8 10 805 9 1 10 625 143 9 1 10 25 143 1 10 1 1 1 2 2 2 2 =       =       − =       − − =         − − =   n d d n S i i 6435 . 2 9457 . 0 5 . 2 10 9444 . 8 5 . 2 2 = = = = n S d t 6435 . 2 = Cal t t 26 . 2 9 1 10 1 = = = − − t t tn 26 . 2  Cal t
  • 24.
    F – TESTFOR EQUALITY OF POPULATION VARIANCES : Consider the two samples from a normal population as and . Let the mean and variances of the two populations be respectively. Ratio of two independent variates with the corresponding degrees of freedom is known as F statistic. For testing the equality between the two sample variances , the null hypothesis can be framed as There is no significant difference between the two variances To test the above null hypothesis , the test statistic is given by, Where 2 1 1 1 2 1 1         − =   n x n x s i i , 2 2 2 2 2 2 2         − =   n x n x s i i and , The above test statistic follows F-distribution with degrees of freedom. Compare the calculated value of ‘F’ with the critical value at the desired level of significance. If the calculated value of ‘F’ less than the critical value , then we can’t reject the null hypothesis, otherwise we can reject the null hypothesis at the desired level of significance. Applications or Uses of F- Test : 1. F- test for testing the significance of an observed sample multiple correlation. 2. F – test for testing the significance of an observed sample correlation ratio. 3. F – test for testing the linearity of regression. 4. F- test for testing the equality of several population means. 1 1 13 12 11 ,........ , , n x x x x 2 2 23 22 21 ,...... , , n x x x x 2 2 2 1 2 1 , ,     and 2  2 2 2 1 0 :   = H : 0 H ( ) 1 , 1 1 1 2 1 2 2 2 2 1 2 1 1 2 2 2 1 − −  − − = = n n F n s n n s n F   1 1 2 1 1 2 1 − = n s n  1 2 2 2 2 2 2 − = n s n  ( ) 1 , 1 2 1 − − n n test the significance of difference between the standard deviations of two samples
  • 25.
    Problem 1 :The time taken by workers in performing a job by method – I and method – II is given below Method – I 20 16 26 27 23 22 Method – II 27 33 42 35 32 34 38 Do the data show that the variances of time distribution from population from which these samples are drawn do not differ significantly ? Solution : There is no significant difference between the variances of the time distribution by the workers in performing a job by method – I and method – II Method – I Method – II 20 400 27 729 16 256 33 1089 26 676 42 1764 27 729 35 1225 23 529 32 1024 22 484 34 1156 38 1444 134 3074 241 8431 Sample variance using method – I is Sample variance using method –II is Here To test the null hypothesis , the test statistic : 0 H 1 x 2 1 x 2 x 2 2 x 5556 . 13 6 134 6 3074 2 2 1 1 1 2 1 2 1 =       − =         − =   n x n x s 1020 . 19 7 241 7 8431 2 2 2 2 2 2 2 2 2 =       − =         − =   n x n x s 2667 . 16 1 6 5556 . 13 6 1 1 2 1 1 2 1 = −  = − = n s n S 2857 . 22 1 7 1020 . 19 7 1 2 2 2 2 2 2 = −  = − = n s n S 2 1 2 2 S S 
  • 26.
    Tabulated value ofF is Since the calculated value of F is less than the tabulated value We can’t reject the null hypothesis. There is no significant difference between the variances of the time distribution by the workers in performing a job by method – I and method – II Problem 2 : Two horses A and B were tested according to the time ( in seconds) to run a particular track with the following results Horse A : 28 30 32 33 33 29 34 Horse B : 29 30 30 24 27 29 Test whether the two horses have the same running capacity. Solution : The two horses A and B have the same running capacity. Size of the first sample Size of the second sample Horse A Horse B 28 784 29 841 30 900 30 900 32 1024 30 900 33 1089 24 576 33 1089 27 729 29 841 29 841 34 1156 219 6883 169 4787 Sample variance using method – I is Sample variance using method –II is 37 . 1 2667 . 16 2857 . 22 2 1 2 2 = = = S S F ( ) ( ) 95 . 4 5 , 6 ) 1 6 , 1 7 ( 1 , 1 1 2 = = = − − − − F F F n n : 0 H 7 1 = n 6 2 = n 1 x 2 1 x 2 x 2 2 x 4898 . 4 7 219 7 6883 2 2 1 1 1 2 1 2 1 =       − =         − =   n x n x s 4722 . 4 6 169 6 4787 2 2 2 2 2 2 2 2 2 =       − =         − =   n x n x s
  • 27.
    Here To test thenull hypothesis , the test statistic Tabulated value of F is ( ) ( ) ( ) 39 . 4 6 , 5 1 7 , 1 6 1 , 1 1 2 = = = − − − − F F F n n Since the calculated value of F is less than the tabulated value We can’t reject the null hypothesis. The two horses A and B have the same running capacity. Problem 3 : In a sample of 8 observations, the sum of squares of deviations of items from their mean was 94.5. In another sample of 10 observations, the value was found to be 101.7. Test whether the difference is significant at 5 % level ? Solution : There is no significant difference between the two samples Size of the first sample Size of the second sample Sum of squares of deviations obtained from mean in a first sample is Sum of squares of deviations obtained from mean in a second sample is Here To test the null hypothesis , the test statistic Tabulated value of F is Since the calculated value of F is less than the tabulated value 2381 . 5 1 7 4898 . 4 7 1 1 2 1 1 2 1 = −  = − = n s n S 3666 . 5 1 6 4722 . 4 6 1 2 2 2 2 2 2 = −  = − = n s n S 2 1 2 2 S S  0245 . 1 2381 . 5 3666 . 5 2 1 2 2 = = = S S F : 0 H 8 1 = n 10 2 = n ( )  = − 5 . 94 2 1 1 x x ( )  = − 7 . 101 2 2 2 x x ( ) 5 . 13 1 8 5 . 94 1 1 2 1 1 2 1 = − = − − =  n x x S ( ) 3 . 11 1 10 7 . 101 1 2 2 2 2 2 2 = − = − − =  n x x S 2 2 2 1 S S  147 . 1 3 . 11 5 . 13 2 2 2 1 = = = S S F ( ) ( ) 29 . 3 9 , 7 ) 1 10 , 1 8 ( 1 , 1 2 1 = = = − − − − F F F n n
  • 28.
    We can’t rejectthe null hypothesis. There is no significant difference between the two samples. - Distribution : - Distribution was discovered by Prof. Helmert in 1875 and was developed by Karl Pearson in 1900. Karl Pearson applied - Distribution as a test of goodness of fit. Definitions of -variate: The square of standard normal variate is defined as a -variate Let then in a - variate with one degrees of freedom. In general , if are ‘n’ independent normal variates with mean and variance then Chi- Square Test : tests were based on the assumption that the sample were drawn from the normal population. However, there are many situations in which it is not possible to make any dependable assumption about the parent distribution from which the samples have been drawn. This lead to the development of a group of alternative techniques known as non- parametric or distribution free methods. Chi square test was first used by Karl Pearson in the year 1900. The describes the magnitude of the discrepancy between theory and observation. 2  2  2  2  2  ( ) 2 ,  N x  ( ) 1 , 0 N x Z  − =   2 2       − =    x z 2  n x x x x ,......... , , 3 2 1 i  2 i  ( )  =          − = 1 1 2 2 1 , 0 n i i i i N x     = = 1 1 2 2 n i i z  F t, 2 
  • 29.
    Applications of Chi– Square distribution : Chi- square distribution has a number of applications, some of which are enumerated below i. Chi- square test of goodness of fit. ii. - test for independence of attributes. iii. To test if the population has a specified value of the variable Chi – Square test for goodness of fit : Suppose we are given a set of observed frequencies obtained under some experiment and we want to test if the experimental results support a particular hypothesis or theory. Karl Pearson developed a test for testing the significant difference between experimental value and the theoretical value named test of goodness of fit. Steps for computation of and drawing the conclusions : Step 1 : Compute the expected frequencies corresponding to the observed frequencies under some theory or hypothesis. Step 2 : Compute the deviations for each frequency and then square them to obtain . Step 3 : Divide the square of the deviations by the corresponding expected frequency to obtain Step 4 : Add the values obtained in step 3 to compute Step 5 : Look up the tabulated values of for degrees of freedom at certain level of significance, usually 5 % or 1 % from the table of significant values of Step 6 : If the calculated value of is less than the corresponding tabulated value, then it is said to be non- significant at the required level of significance and we may conclude that there is good correspondence between theory and experiment. Step 7 : If the calculated value of is greater than the tabulated value, it is said to be significant and we may conclude that the experiment does not support the theory. Conditions for validation of Chi – Square test : The chi- square test statistic can be used only if the following conditions are satisfied. 1. , the total frequency , should be large. 2  2  − 2  2  n E E E E ,........ , , 3 2 1 n O O O O ,....... , , 3 2 1 ( ) i i E O − ( )2 i i E O − ( ) i i i E E O 2 − ( )          − = i i i E E O 2 2  2  ( ) 1 − n 2  2  2  N
  • 30.
    2. The sampleobservations should be independent. 3. The Total expected frequency must be equal to the total observed frequency. 4. No theoretical frequency should be small. If any theoretical frequency is less than 5, then we cannot apply test. Problem 1 : The number of automobile accidents per week in a certain community were as follows 12, 8, 20, 2, 14, 10, 15, 6, 9, 4 Are these frequencies in agreement with the belief that accident conditions were the same during the 10 week period. Solution : Given sample size Total number of accidents in 10 weeks period Average number of accidents per week in the given 10 week period Accident conditions were same during the 10 week period Week Observed Frequency Expected Frequency 1 12 10 2 4 0.4 2 8 10 -2 4 0.4 3 20 10 10 100 10 4 2 10 -8 64 6.4 5 14 10 4 16 1.6 6 10 10 0 0 0 7 15 10 5 25 2.5 8 6 10 -4 16 1.6 9 9 10 -1 1 0.1 10 4 10 -6 36 3.6 Total 100 100 26.6 2  10 = n 100 4 9 6 15 10 14 2 20 8 12 = + + + + + + + + + 10 10 100 = = : 0 H i O i E i i E O − ( )2 i i E O − ( ) i i i E E O 2 −
  • 31.
    To test theabove null hypothesis the test statistic Degrees of freedom = 10-1=9 Tabulated value Here, the calculated value of is greater than the tabulated value of . Hence we can reject the null hypothesis. Accident conditions were not same in the given 10 weeks period. Problem 2 : In a mendelian experiment on breeding four types of plants are expected to occur in the proportion of 9:3 : 3 : 1. The observed frequencies are 891 round and yellow, 316 wrinkled and yellow, 290 round and green and 119 wrinkled and green. Find the chi- square value and examine the correspondence between the theory and the experiment. Solution : In a Mendelian experiment on breeding, there is no significant difference between the theoretical and observed frequency . Total number of observed plants = 891 + 316 + 290 + 119 = 1616 Given , four types of plants are expected to occur in the proportion of 9:3 : 3 : 1 Round and yellow Wrinkled and yellow Round and green Wrinkled and green ( ) 6 . 26 2 2 =       − =  i i i E E O  919 . 16 2 9 2 1 10 2 1 = = = − −   n 2  2  ., .e i : 0 H 909 1616 16 9 1616 1 3 3 9 9 =  =  + + + = 303 1616 16 3 1616 1 3 3 9 3 =  =  + + + = 303 1616 16 3 1616 1 3 3 9 3 =  =  + + + = 101 1616 16 1 1616 1 3 3 9 1 =  =  + + + =
  • 32.
    Breed Observed Frequency Expected Frequency Round and yellow891 909 -18 324 0.3565 Wrinkled and yellow 316 303 13 169 0.5578 Round and green 290 303 -13 169 0.5578 Wrinkled and green 119 101 18 324 3.2079 Total 1616 1616 4.6799 To test the above null hypothesis the test statistic Degrees of freedom = 4-1=3 Tabulated value Here, the calculated value of is less than the tabulated value of . Hence we can not reject the null hypothesis. In a Mendelian experiment on breeding, there is no significant difference between the theoretical and observed frequency . Fitting the Binomial distribution and testing the goodness of fit : Let be the ‘ ’ frequencies for a random variable under consideration. By using these observed frequencies we have to fit a binomial distribution and hence calculate the expected frequencies . Here total expected frequency must be equal to the total observed frequency To test the significant difference between the observed and expected frequencies, the null hypothesis can be framed as There is no significant difference between the observed and expected frequencies i O i E i i E O − ( )2 i i E O − ( ) i i i E E O 2 − ( ) 6799 . 4 2 2 =       − =  i i i E E O  80 . 7 2 3 2 1 4 2 1 = = = − −   n 2  2  ., .e i n O O O O ,......... , , 3 2 1 n ( ) n x q p C x X P x n x x n ,........ 3 , 2 , 1 , 0 ; = = = − n E E E E ,......... , , 3 2 1   = i i E O e i ., . : 0 H
  • 33.
    Binomial distribution isbest fit to the given frequencies( observed frequencies) To test the above null hypothesis, The above statistic follows distribution with degrees of freedom Compare the calculated value of with the critical or tabulated value of If , we can reject the null hypothesis and we may conclude that Binomial distribution holds good for the given data. Otherwise we can reject the null hypothesis and we can conclude that Binomial distribution is not best fit to the given data. Problem 1 : Records taken of the number of male and female births in 800 families having four children are given below. No. of births Frequency Male Female 0 4 32 1 3 178 2 2 290 3 1 236 4 0 64 Test whether the data are consistent with the hypothesis that the binomial law holds and the chances of a male birth is equal to that of female birth. Solution : The probability of male and female birth are equal and binomial law holds From the data we can considered , that the probability of male and female births are equal Probability of male birth Probability of female birth The fitted binomial distribution for the given data is : 0 H ( )        − = i i i E E O 2 2  2  1 − − k n 2  2  2 2 Tab Cal    : 0 H q p = ., .e i 2 1 = p 2 1 = q ( ) n x q p C x X P x n x x n ,........ 3 , 2 , 1 , 0 ; = = = −
  • 34.
    Put in equation(1) then Put in equation (1) then Put in equation (1) then Put in equation (1) then Put in equation (1) then Expected frequencies 0 32 50 -18 324 6.48 1 178 200 -22 484 2.42 2 290 300 -10 100 0.3333 3 236 200 36 1296 6.48 4 64 50 14 196 3.92 19.6333 ( ) x x x C x X P −             = = 4 4 2 1 2 1 ( ) x x x C x X P − +       = = 4 4 2 1 ( ) ) 1 ( 4 , 3 , 2 , 1 , 0 ; 2 1 4 4 → =       = = x C x X P x 0 = x ( ) 0625 . 0 16 1 . 1 2 1 0 4 0 4 = =       = = C X P 1 = x ( ) 25 . 0 16 1 . 4 2 1 1 4 1 4 = =       = = C X P 2 = x ( ) 375 . 0 16 1 . 6 2 1 2 4 2 4 = =       = = C X P 3 = x ( ) 25 . 0 16 1 . 4 2 1 3 4 3 4 = =       = = C X P 4 = x ( ) 0625 . 0 16 1 . 1 2 1 4 4 4 0 4 = =       = = C X P ( ) ( ) 50 0625 . 0 800 0 . 0 =  = = = = X P N X E ( ) ( ) 200 25 . 0 800 1 . 1 =  = = = = X P N X E ( ) ( ) 300 375 . 0 800 2 . 2 =  = = = = X P N X E ( ) ( ) 200 25 . 0 800 3 . 3 =  = = = = X P N X E ( ) ( ) 50 0625 . 0 800 4 . 4 =  = = = = X P N X E i x i O i E i i E O − ( )2 i i E O − ( ) i i i E E O 2 −
  • 35.
    Here , ,We can reject the null hypothesis. The male and female births are not equally probable. Binomial distribution is not good fit for the given data. Fitting the Poisson distribution and testing the goodness of fit : Let be the ‘ ’ frequencies for a random variable under consideration. By using these observed frequencies we have to fit a Poisson distribution and hence calculate the expected frequencies . Here total expected frequency must be equal to the total observed frequency To test the significant difference between the observed and expected frequencies, the null hypothesis can be framed as There is no significant difference between the observed and expected frequencies Poisson distribution is best fit to the given frequencies( observed frequencies) To test the above null hypothesis, The above statistic follows distribution with degrees of freedom Compare the calculated value of with the critical or tabulated value of If , we can reject the null hypothesis and we may conclude that Poisson distribution holds good for the given data. Otherwise we can reject the null hypothesis and we can conclude that Poisson distribution is not best fit to the given data. Problem : The following mistakes per page were observed in a book. ( ) 6333 . 19 2 2 =       − =  i i i E E O  6333 . 19 2 = Cal  488 . 9 2 4 2 1 5 2 1 2 = = = = − −     n Tab 2 2 Tab Cal    n O O O O ,......... , , 3 2 1 n ( ) ,........ 3 , 2 , 1 , 0 ; ! = = = − x x e x X P x   n E E E E ,......... , , 3 2 1   = i i E O e i ., . : 0 H : 0 H ( )        − = i i i E E O 2 2  2  1 − − k n 2  2  2 2 Tab Cal   
  • 36.
    No. of. mistakesper page 0 1 2 3 4 Total No. of pages 211 90 19 5 0 325 Fit a Poisson distribution and test the goodness of fit. Solution : Poisson distribution is a good fit to the given data. Let be a random variable denotes the number of mistakes per page. 0 211 0 1 90 90 2 19 38 3 5 15 4 0 0 325 143 Average number of mistakes per page is In a Poisson distribution mean is The fitted Poisson distribution for the given data is then then then then then Expected frequencies : : 0 H x x f fx 44 . 0 325 143 = = =  N fx x 44 . 0 =   ( ) 4 , 3 , 2 , 1 , 0 ; ! 44 . 0 ! 44 . 0 = = = = − − x x e x e x X P x x   Put 0 = x ( ) 6440 . 0 ! 0 44 . 0 0 0 44 . 0 = = = − e X P Put 1 = x ( ) 2834 . 0 ! 1 44 . 0 1 1 44 . 0 = = = − e X P Put 2 = x ( ) 0623 . 0 ! 2 44 . 0 2 2 44 . 0 = = = − e X P Put 3 = x ( ) 0091 . 0 ! 3 44 . 0 3 3 44 . 0 = = = − e X P Put 4 = x ( ) 001 . 0 ! 4 44 . 0 4 4 44 . 0 = = = − e X P ( ) ( ) 3 . 209 3 . 209 6440 . 0 325 0 . 0  =  = = = = X P N X E
  • 37.
    Observed Frequencies ExpectedFrequencies 211 209.3 1.7 2.98 0.01381 90 92.1 -2.1 4.41 0.04788 19 20.3 5 24 3.0 23.6 0.4 0.16 0.00678 0 0.3 325 325 0.06847 ( ) 06847 . 0 2 2 =         − =  i i i E E O  Table value of is Here , Calculated value of is less than the tabulated value of . Hence , we can not reject the null hypothesis. We can conclude that , Poisson distribution is a good fit to the given data. - Test for independence of attributes : Let us suppose that the given population consisting of items is divided into ‘ ’ mutually disjoint (exclusive) and exhaustive classes with respect to the attribute . Similarly let us suppose that the same population is divided into ‘ ’ mutually disjoint and exhaustive classes can be represented in the following manifold contingency table. ( ) ( ) 1 . 92 105 . 92 2834 . 0 325 1 . 1  =  = = = = X P N X E ( ) ( ) 3 . 20 2475 . 20 0623 . 0 325 2 . 2  =  = = = = X P N X E ( ) ( ) 0 . 3 9575 . 2 0091 . 0 325 3 . 3  =  = = = = X P N X E ( ) ( ) 3 . 0 325 . 0 001 . 0 325 4 . 4  =  = = = = X P N X E i O i E ( ) i i E O − ( )2 i i E O − ( ) i i i E E O 2 − 2  841 . 3 2 1 2 3 1 5 2 1 = = = − − − −    k n 2  2  2  N r r A A A A .. ,......... , , 3 2 1 A s S B B B B ,...... , , 3 2 1 s r 
  • 38.
    manifold contingency table. B A ………… Total …… …… …… …… : : : : : : : : : : : : : : : : ( ) 2 B Ai …… …… : : : : : : : : : : : : : : : : …… …… Total …… …… Where is the frequency of the attribute . i.e., it is the number of persons possessing the attribute ; , is the number of persons possessing the attribute , and is the number of persons possessing both the attributes and ( ) Here, The attribute and are independent - test statistic is given by Here the statistic follows -distribution with degrees of freedom. Compare the calculated value of with the critical or tabulated value of If , we can reject the null hypothesis Otherwise we can reject the null hypothesis. s r  1 B 2 B j B s B 1 A ( ) 1 1 B A ( ) 2 1 B A ( ) j B A1 ( ) s B A1 ( ) 1 A 2 A ( ) 1 2 B A ( ) 2 2 B A ( ) j B A2 ( ) s B A2 ( ) 2 A i A ( ) 1 B Ai ( ) j i B A ( ) s i B A ( ) i A r A ( ) 1 B Ar ( ) 2 B Ar ( ) j r B A ( ) s r B A ( ) r A ( ) 1 B ( ) 2 B ( ) j B ( ) s B N ( ) i A th i i A i A r i ,.... 3 , 2 , 1 = ( ) j B j B s j ,.... 3 , 2 , 1 = ( ) j i B A i A j B r i ,.... 3 , 2 , 1 = s j ,.... 3 , 2 , 1 = ( )  = = s j j i i B A A 1 ( )  = = r i j i j B A B 1   = = = = r i s j j i B A N 1 1 A B ( ) ( )( ) N B A B A j i j i =  2  ( ) ( )   ( ) ( )   = = = =         − =         − = r i s j ij ij ij r i s j E j i E j i O j i E E O B A B A B A 1 1 2 1 1 2 2  2  2  ( ) ( ) 1 1 −  − s r 2  2  2 2 Tab Cal   
  • 39.
    Problem : Acertain drug was administered to 456 males out of a total 720 in a certain locality to test its efficacy against typhoid. The incidence of typhoid is shown below. Find out the effectiveness of the drug against the disease. Infection No infection Total Administering the drug 144 312 456 Without Administering the drug 192 72 264 Total 336 384 720 Solution : Incidence of typhoid and the administration of drug are independent By the independence of attributes, the expected frequencies are Observed Frequency Expected Frequency 144 212.8 -68.8 4733.44 22.2436 192 123.2 68.8 4733.44 38.4208 312 243.2 68.8 4733.44 19.4632 72 140.8 -68.8 4733.44 33.6182 113.7458 Test statistic Tabulated value of is Here the calculated value of is greater than the tabulated value of . : 0 H ( ) 8 . 212 720 456 336 144 =  = E ( ) 2 . 123 720 264 336 192 =  = E ( ) 2 . 243 720 456 384 312 =  = E ( ) 8 . 140 720 264 384 72 =  = E ij O ij E ij ij E O − ( )2 ij ij E O − ( ) ij ij ij E E O 2 − ( ) 7458 . 113 2 2 =         − =  Eij E O ij ij  2  ( )( ) ( )( ) 841 . 3 2 1 2 1 2 1 2 2 1 1 = = = − − − −    s r 2  2 
  • 40.
    We can saythat Incidence of typhoid and the administration of drug are not independent. Problem 2 : Data on the hair colour and the eye colour are given in the table. Calculate the value. Determine the association between the hair colour and the eye colour. Hair Colour Total Fair Brown Black Eye Colour Blue 15 20 5 40 Gray 20 20 10 50 Brown 25 20 15 60 Total 60 60 30 150 Solution : Hair colour and Eye colour are independent. By the independence of attribures. Hair Colour Total Fair Brown Black Eye Colour Blue 15 20 5 40 Gray 20 20 10 50 Brown 25 20 15 60 Total 60 60 30 150 : 0 H 16 150 60 40 =  16 150 60 40 =  8 150 30 40 =  20 150 60 50 =  20 150 60 50 =  10 150 30 50 =  24 150 60 60 =  24 150 60 60 =  12 150 30 60 = 
  • 41.
    Observed Frequency ExpectedFrequency 15 16 -1 1 0.0625 20 16 4 16 1 5 8 -3 9 1.125 20 20 0 0 0 20 20 0 0 0 10 10 0 0 0 25 24 1 1 0.0417 20 24 -4 16 0.6667 15 12 3 9 0.75 3.6459 Test statistic Tabulated value of is Here the calculated value of is less than the tabulated value of . We can not reject the null hypothesis. We can say that the eye colour and hair colour are independent. Non-Parametric Test Introduction: Most of the statistical tests we have discussed are based on the following two features. 1. The form of the frequency functions of the parent population from which the samples are drawn. 2. They were consult with testing statistical hypothesis about the parameters of a frequency distribution. For example all the exact sampling tests of significance are based on the assumption that the parent the population is normal and estimating the means and variances of these populations. Such tests which deal with the parameters of a population are known as parametric test. A non-parametric test is a test that does not depend on the particular form of the basic frequency function, from which the samples are drawn. In other words, a non-parametric test does not make any assumption recording the form of the population. In Non-parametric test we have to make some assumptions. They are i. Sample observations are independent. ij O ij E ij ij E O − ( )2 ij ij E O − ( ) ij ij ij E E O 2 − ( ) 6459 . 3 2 2 =         − =  Eij E O ij ij  2  ( )( ) ( )( ) 49 . 9 2 4 2 1 3 1 3 2 1 1 = = = − − − −    s r 2  2 
  • 42.
    ii. The variablesare taken as continuous. iii. Probability density function is also continuous. Advantages and disadvantages of non-parametric test over parametric test: Advantages: ➢ Non-parametric tests are more comparable, very simple and easy to apply with compared to the parametric tests. ➢ There is no need to make an assumption about “the form of the frequency function” of the parent population from which the sample is drawn. ➢ Non-parametric tests are useful to deal with the data which contains ranks. Disadvantages: ➢ Non parametric test can be used only if the measurements are ordinal. Even in such case if a parametric test exists then non parametric test does not give the effective results compare to the parametric test. ➢ Non parametric test does not exists for testing interaction effects is analysis of variance (ANOVA). ➢ Non parametric test are designed to test the statistical hypothesis only and they cannot applicable for estimating the parameter. Difference between parametric and non parametric test: Parametric Test Non-Parametric Test 1. In parametric test the test statistic specifies certain condition about the parameter. 1. In Non-parametric test the statistic does not specifies any condition about the parameters. 2. Parametric test required the measurements of the variables. 2. Non-parametric test does not required measurements of the variables. 3. In parametric test the sample must be specified and classified exactly 3. In Non-parametric test the sample data may not be (classified) exactly. 4. In parametric tests there is a need of knowing parent distribution of the population. 4. In Non-parametric test there is no need to knowing about the parent distribution of the population. 5. Parametric test can not be used for all signs. 5. Non-parametric test can be used for the data which contains the ranks with positive or negative signs.
  • 43.
    Run test: Run testis a subsequent of one or more identical symbols representing a common property of the data. Let us consider n x x x x ,...... , , 3 2 1 be the values of a random sample drawn from the population where the distribution function of the population is unknown. To test the randomness of the sample from the population , the null hypothesis can be considered as 0 H : The given sample observations are in random. against the alternative hypothesis. 1 H : The given sample observations are not independent (are not in random) To test the above null hypothesis the test statistic is given by ( ) 1 , 0 ~ 1 2 4 2 2 N n n n n r Z       − − + − = Here r = No. of runs and n = Sample size The above test statistic gives the calculated value of Z . Compare the calculated value of Z with the tabulated value or critical value. If Cal Z is greater than Tab Z at the specified level of significance then we can infer that the sample observation are not in random, otherwise we can say that the sample observations are in random. Problem : The wins and losses of a cricket team in 30 matches are W W L L L W W L L L W W W W L W L L L W W L W W W L W L W W . Whether the observations are random or not? Solution : W W L L L W W L L L W W W W L W L L L W W L W W W L W L W W Here, The number of runs 15 = r Sample size 30 = n : 0 H The given sample observations are in random To test the above null hypothesis, the test statistic is       − − + − = 1 2 4 2 2 n n n n r Z
  • 44.
    3716 . 0 6910 . 2 1 29 28 2 15 16 15 1 30 2 30 4 30 2 2 30 15 − = − =       − =       − − + − = 3716 . 0 = Cal Z 96 . 1 = Tab Z Here, Tab cal Z Z , We can not reject the null hypothesis. The given sample observations are in random. Sign Test: Sign test get its name, that it is applicable for the positive or negative signs. It is useful in research ,in which the competitive measurements are having the more important . It is used in the case of two related samples. Let n x x x x ,...... , , 3 2 1 and n y y y y ,...... , , 3 2 1 be the two random samples of same size which are drawn from two different populations. Let us consider the sign of the difference of corresponding sample observations. ., .e i i i i y x d − = where ,....... 3 , 2 , 1 = i If the two populations are continuous or independent then the probabilities of the positive or negative is ½ To test the equality of populations 0 H : The populations are unequal By testing the number of positive signs have the binomial variant with mean 2 n np = and variance 4 n In this case the conditional probability of a sign being Ve + then the difference is ½ in such situations t-Test is more effective. Thus we can used t-Test to sign test. When = d No of Ve + signs ( ) 2 n t E = and ( ) 4 n t V = To test the above null hypothesis the test statistic is given by
  • 45.
    1 ~ 4 2 − − = n t n n d t The abovetest statistic gives the calculated value of t . Compute the critical value of ‘ t ’ for the specified level of significance with 1 − n degrees of freedom If the calculated value of ‘t ’ is less than the critical value of ‘t ’ then we cannot reject the null hypothesis at the specified level of significance, otherwise we can reject the null hypothesis. Problem : Two samples are drawn from a large population and the sample points are Sampl e I 5 7 8 4 3 6 4 8 9 4 5 6 6 8 5 4 7 3 8 9 1 1 6 7 8 1 0 Sampl e II 4 8 6 3 5 7 2 5 7 5 6 5 5 7 6 5 4 8 3 7 1 0 8 6 5 1 1 Is the two samples drawn from the same population or not? Solution : : 0 H The given two samples are drawn from the same population. Sample I Sample II i d 5 4 1 7 8 -1 8 6 2 4 3 1 3 5 -2 6 7 -1 4 2 2 8 5 3 9 7 2 4 5 -1 5 6 -1 6 5 1 6 5 1 8 7 1 5 6 -1 4 5 -1 7 4 3 3 8 -5 8 3 5 9 7 2 11 10 1 6 8 -2 7 6 1 8 5 3 10 11 -1 Here, The number of positive signs 15 = d
  • 46.
    To test theabove null hypothesis the test statistic is given by 1 4 25 2 25 15 4 2 = − = − = n n d t 1 = Cal t Tabulated value of t is 064 . 2 24 1 25 = = = − t t tTab Here, Tab Cal t t  , we cannot reject the null hypothesis. The given two samples are not drawn from the same population WILCOXON SIGNED RANK TEST: CASE(I) (One sample): If n x x x x ,...... , , 3 2 1 be the random sample drawn from the population with median ‘M’. Compute the difference between the each observation with the median ( ) ( ) ( ) ( ) M x M x M x M x n − − − − ,..... , , 3 2 1 and note that whether the difference is either positive or negative and assume that the number of positive signs are r and calculate P             =   = r x x n n C P 0 2 1 In the above test statistic, the null hypothesis can be taken as 0 H : The given median of the population is significant. Compare the calculated value of P with critical value and draw the inference accordingly. If the P is less than the tabulated value then we can accept the null hypothesis and we can say that the median of the population is significant, otherwise we can reject the null hypothesis. CASE(II): Let n x x x x ,...... , , 3 2 1 and n y y y y ,...... , , 3 2 1 be the two sample of equal size drawn from the same population. Calculate the difference of each corresponding observations of the two samples. To test the equality between the two samples the null hypothesis can be taken as. 0 H : The given two samples are drawn from the sample population To test the above null hypothesis the test statistic is given by
  • 47.
    ( ) 1 , 0 ~ 4 2 N n n r Z − = Theabove test statistic gives the calculated value of Z . compute the critical value of Z for the specified level of significance. If the calculated value of Z is greater than the tabulated value, then we can reject the null hypothesis at the specified level of significance, otherwise we cannot reject the null hypothesis. Problem : Sample of 7 observations are drawn from the large population and the sample values are 67, 35, 34, 70, 65, 48, 60 Is the sample drawn from the population, whose median is 50. Solution : Given, Population median is 50 = M Size of the sample 7 = n x M M x − Sign 67 50 17 + 35 50 -15 − 34 50 -16 − 70 50 20 + 65 50 15 + 48 50 -2 − 60 50 10 + No of positive signs 4 = r The test statistic             =   = r x x n n C P 0 2 1             =  = 4 0 7 7 2 1 x x C   4 7 3 7 2 7 1 7 0 7 128 1 C C C C C + + + + =
  • 48.
      128 99 35 35 21 7 1 128 1 = + + + + = 7735 . 0 =  P Here, 5 . 0   Cal P Wecan reject the null hypothesis. The given sample is not drawn from the population whose median is 50. WILCOXON SIGNED RANK TEST: CASE(I) (One sample): If n x x x x ,...... , , 3 2 1 be the random sample drawn from the population with median ‘M’. Compute the difference between the each observation with the median ( ) ( ) ( ) ( ) M x M x M x M x n − − − − ,..... , , 3 2 1 and note that whether the difference is either positive or negative and assume that the number of positive signs are r and calculate P             =   = r x x n n C P 0 2 1 In the above test statistic, the null hypothesis can be taken as 0 H : The given median of the population is significant. Compare the calculated value of P with critical value and draw the inference accordingly. If the P is less than the tabulated value then we can accept the null hypothesis and we can say that the median of the population is significant, otherwise we can reject the null hypothesis. CASE(II): Let n x x x x ,...... , , 3 2 1 and n y y y y ,...... , , 3 2 1 be the two sample of equal size drawn from the same population. Calculate the difference of each corresponding observations of the two samples. To test the equality between the two samples the null hypothesis can be taken as. 0 H : The given two samples are drawn from the sample population To test the above null hypothesis the test statistic is given by ( ) 1 , 0 ~ 4 2 N n n r Z − =
  • 49.
    The above teststatistic gives the calculated value of Z . compute the critical value of Z for the specified level of significance. If the calculated value of Z is greater than the tabulated value, then we can reject the null hypothesis at the specified level of significance, otherwise we cannot reject the null hypothesis. Problem : Sample of 7 observations are drawn from the large population and the sample values are 67, 35, 34, 70, 65, 48, 60 Is the sample drawn from the population, whose median is 50. Solution : Given, Population median is 50 = M Size of the sample 7 = n x M M x − Sign 67 50 17 + 35 50 -15 − 34 50 -16 − 70 50 20 + 65 50 15 + 48 50 -2 − 60 50 10 + No of positive signs 4 = r The test statistic             =   = r x x n n C P 0 2 1             =  = 4 0 7 7 2 1 x x C   4 7 3 7 2 7 1 7 0 7 128 1 C C C C C + + + + =   128 99 35 35 21 7 1 128 1 = + + + + = 7735 . 0 =  P
  • 50.
    Here, 5 . 0   Cal P We canreject the null hypothesis. The given sample is not drawn from the population whose median is 50. ANALYSIS OF VARIANCE ANOVA Introduction : Analysis of variance (ANOVA) is the powerful tool for test of significance. Suppose we are interested in finding out whether the effect of fertilizers on the yields is significantly differ or not. One procedure to answer this question is to conduct − t test for 2 C n times, which is impossible. The alternative is to apply the technique of ANOVA. The main aim of analysis of variance is to test the homogeneity of several means or to test different treatment effects. ANOVA was introduced by Prof. R.A. Fisher in the year 1920. Variance is inherent in nature. Every experiment consists of a list of outcomes and we should not accept then without variation. Even though an experiment is conducted under similar conditions with total homogeneous experimental units, variance exists in the results. For example, the signature of the person varies from sign to sign, even though it is his own sign. The total variation in any set of numerical data is due to number of causes. Which may be classified as 1. Variation due to assignable causes 2. Variation due to chance causes The variation due to assignable causes can be detected and measured, where as the variation due to chance causes is beyond the control of human hand and cannot be treated separately. Definition : According to R.A. Fisher, analysis of variance is the separation of variance ascribable to one group of causes from the variance ascribable to another group . Assumptions of ANOVA : For the validity of Test F − in analysis of variance, the following assumptions are made. 1. The sample observations are independent.
  • 51.
    2. The parentpopulation from which the observations are drawn is Normal. 3. Various effects are additive in nature. One Way ANOVA : Let us suppose that ‘ N ’ observations ‘ ij x ’ ; k i ,..... 3 , 2 , 1 = i n j ,... 3 , 2 , 1 = of a random variable ‘ x ’ split into ‘k ’ classes on some basis with sizes k n n n n ,.... , , 3 2 1 respectively. These values are exhibited in the following classification table. Classes 1 2 3 ……… i n Total . i T Mean . i x 1 11 x 12 x 13 x ……… 1 1n x . 1 T . 1 x 2 21 x 22 x 23 x ……… 2 2n x . 2 T . 2 x 3 31 x 32 x 33 x ……… 3 3n x . 3 T . 3 x : : : : : : : i 1 i x 2 i x 3 i x ……… i in x . i T . i x : : : : : : : k 1 k x 2 k x 3 k x ……… k kn x . k T . k x Grand total   = = = = = k i i k i n j ij T x G i 1 . 1 1 The total variation in the observations ij x can be split into the following components. 1. Variation between the classes 2. Variation within the classes The first type of variation is due to assignable causes which can be identified and the second type of variation is due to chance causes. Null hypothesis : : o H The means of classification table are equal to their general mean.      = = = = = . . 3 . 2 . 1 ....... : k o H 0 ....... : 3 2 1 = = = = = k o H    
  • 52.
    Working procedure ofOne-Way ANOVA : The following are the steps to carryout ANOVA One –Way Classification Step 1 : Calculate the Grand total   = = = = = k i n j k i i ij i T x G 1 1 1 . Step 2 : Calculate the correction factor N G F C 2 . = Step 3 : Calculate the Row sum of squares  = = = k i n j ij i x S S R 1 1 2 . . Step 4 : Calculate the total Sum of squares F C S S R S S T . . . . . − = Step 5 : Calculate the sum of squares due to classes F C n T C S S k i i i . . . 1 2 . − =  = Step 6 : Calculate the sum of squares due to Errors C S S S S T E S S . . . . . . − − Step 7 : Calculate the mean sum of squares Mean sum of squares due to Classes 1 . . . . . − = k C S S C S S M Mean sum of squares due to Errors k N E S S E S S M − = . . . . . Step 8 : Compute the calculated value of F ., .e i E S S M C S S M FCal . . . . . . = Step 9 : Find the critical value of F from the Table F − at ( ) k N k − − , 1 degrees of freedom. Step 10 : Compare the calculated value of F with the critical values of F . If the Calculated value of F is less than the critical value of F then we cannot reject the null hypothesis. ANOVA Tables : Sources of Variation Degrees of Freedom Sum of Squares Mean Sum Of squares F-Ratio Cal F Tab F Due to Classes 1 − k C S S . . C S S M . . . E S S M C S S M FCal . . . . . . = ( ) k N k Tab F F − − = , 1 Due to errors k N − E S S . . E S S M . . . Total 1 − N S S T . . Compare the calculated value of F with the critical values of F . If the Calculated value of F is less than the critical value of F then we cannot reject the null hypothesis.
  • 53.
    ANOVA Two-Way Classification: Let us suppose that ‘ N ’ observations, which can be ‘h ’ groups and each group containing ‘ k ’ experimental units. ., .e i k h N  = Let ‘ ij x ’ be the yield of th j variety which receives th i treatment can be arranged in the following bivariate frequency distribution. Classes 1 2 3 ……… j ……… k Total . i T Mean . i x 1 11 x 12 x 13 x ……… j x1 ……… k x1 . 1 T . 1 x 2 21 x 22 x 23 x ……… j x2 ……… k x2 . 2 T . 2 x 3 31 x 32 x 33 x ……… j x3 ……… k x3 . 3 T . 3 x : : : : : ……… : : : i 1 i x 2 i x 3 i x ……… ij x ……… ik x . i T . i x : : : : : ……… : : : h 1 h x 2 h x 3 h x ……… hj x ……… hk x . h T . h x Total 1 . T 2 . T 3 . T ……… j T. ……… k T. G Mean 1 . x 2 . x 3 . x j x. k x. From the above Bivariate table Grand total     = = = = = = = k j j h i h i i k j ij T T x G 1 . 1 1 . 1 Mathematical model of Two way ANOVA : The mathematical model of Two-Way classification is ij j i ij E x + + + =    where h i ,...... 3 , 2 , 1 = and k j ,... 3 , 2 , 1 =
  • 54.
    Here = ij x Yieldof th j variety which receives th i treatment =  General mean effect = i  Effect of the th i Treatment (or) Effect of the th i Row = j  Effect of the th j Variety (or) Effect of the th j Column = ij E Random Error. Null Hypothesis : We set up the null hypothesis as that the treatments as well as varieties are homogeneous. ( )       = = = = = = = . . . 3 . 2 . 1 ..... ....... : h i OR OR H or H ( ) 0 ..... ....... : . . . 3 . 2 . 1 = = = = = = = h i OR OR H or H      ( )       = = = = = = = k j OC OT H or H . . 3 . 2 . 1 . ..... ....... : ( ) 0 ..... ....... : . . 3 . 2 . 1 . = = = = = = = k j OC OT H or H      Working procedure of Two-Way ANOVA : The following are the steps to carryout ANOVA One –Way Classification Step 1 : Calculate the Grand total    = = = = = = = k j j h i k j h i i ij T T x G 1 . 1 1 1 . Step 2 : Calculate the correction factor N G F C 2 . = Step 3 : Calculate the Row sum of squares  = = = h i k j ij x S S R 1 1 2 . . Step 4 : Calculate the total Sum of squares F C S S R S S T . . . . . − = Step 5 : Calculate the sum of squares due to Rows F C k T R S S h i i . . . 1 . − =  = Step 6 : Calculate the sum of squares due to Columns F C h T C S S k j j . . . 1 . − =  = Step 7 : Calculate the sum of squares due to Errors C S S R S S S S T E S S . . . . . . . . − − − Step 7 : Calculate the mean sum of squares Mean sum of squares due to Rows 1 . . . . . − = h R S S R S S M
  • 55.
    Mean sum ofsquares due to Columns 1 . . . . . − = k C S S C S S M Mean sum of squares due to Errors ( )( ) 1 1 . . . . . − − = k h E S S E S S M Step 8 : Compute the calculated values of F ., .e i E S S M R S S M F Cal R . . . . . . = E S S M C S S M F Cal C . . . . . . = Step 9 : Find the critical values of F from the Table F − at ( )( ) ( ) 1 1 , 1 − − − k h h and ( )( ) ( ) 1 1 , 1 − − − k h k degrees of freedom. Step 10 : Compare the calculated values of F with the critical values of F . If the Calculated value of F is less than the critical value of F then we cannot reject the null hypothesis. ANOVA Tables : Sources of Variation Degrees of Freedom Sum of Squares Mean Sum Of squares F-Ratio Cal F Tab F Due to Rows 1 − h R S S . . R S S M . . . E S S M R S S M F Cal R . . . . . . = E S S M C S S M F Cal C . . . . . . = ( )( ) ( ) 1 1 , 1 − − − = k h h R F F Tab ( )( ) ( ) 1 1 , 1 − − − = k h k C F F Tab Due to Classes 1 − k C S S . . C S S M . . . Due to errors ( )( ) 1 1 − − k h E S S . . E S S M . . . Total 1 − N S S T . . Compare the calculated values of F with the critical values of F . If the Calculated value of F is less than the critical value of F then we cannot reject the null hypothesis.
  • 56.
    CORRELATION Definition : Combined relationshipbetween two or more variable is known as “ Correlation” Correlation is broadly classified into three types. They are ➢ Positive Correlation ➢ Negative Correlation ➢ Zero Correlation Positive Correlation : If the two variables x and y moving in same direction then the correlation between such variables is known as “ positive correlation” and these variables are said to be positively correlated variables. In positive correlation, both the variables moves in same direction, i.e., If x increases then y must be increased, similarly if x decreases then y must be decreased. Example : Demand and cost of an item are positively correlated variables. Negative Correlation : If the two variables x and y moving in opposite directions then the correlation between such variables is known as “Negative correlation” and these variables are said to be negatively correlated variables. In negative correlation, both the variables moves in opposite directions, i.e., If x increases then y must be decreased, similarly if x decreases then y must be increased. Example : Supply and demand of an item are negatively correlated variables. Zero Correlation : If the two variables x and y does not depend upon each other or independent then the correlation between such variables are known as “Zero Correlation” and these variables are said to be un-correlated variables. Example : Marks in Telugu and marks in Mathematics. Measures of Correlation : There are several methods to measure the correlation between two variables. Some of them are
  • 57.
     Scatter DiagramMethod  Karl Pearson’s Coefficient of Correlation  Spearman’s Rank Correlations Coefficient Scatter Diagram Method : Scatter diagrams are the easiest way to graphically represent the relationship between two quantitative variables. They're just x-y plots, with the predictor variable as the x and the response variable as the y. In scatter diagram method we can obtain the measures of relationship between two variables, by plot the values on a graph by taking values of one variable (i.e., x ) on x-axis and the values of another variable (i.e., y ) on y- axis. Note : 1. If all the scattered sample points exactly lie on a straight line from bottom left corner to top right corner, then the correlation between the two variables is said to be perfect positive correlation. 2. If all the scattered sample points exactly lie on a straight line from top left corner to bottom right corner, then the correlation between the two variables is said to be perfect negative correlation.
  • 58.
    3. If thescattered sample points cluster more around a straight line from bottom left corner to top right corner then the correlation between the two variables is said to be moderately positive correlation. 4. If the scattered sample points cluster more around a straight line from top left corner to bottom right corner then the correlation between the two variables is said to be moderately negative correlation. Merits and demerits of Scatter diagram Method: Merits : 1. Scatter diagram method is the simplest method of studying relationship between two variables. 2. It takes less time to study the correlation between the two variables. 3. It is less expensive. Demerits : 1. We cannot able to study the correlation between the three or more variables using this method. 2. We cannot able to calculate the percentage of correlation between the variables using this method. (or) We cannot able to find the correlation between the two variables numerically using this method. Karl Pearson’s Coefficient of Correlation : This is an important method to calculate the correlation coefficient between two variables numerically. This method was introduced by Prof. Karl Pearson. Karl Pearson’s coefficient of correlation is denoted by ‘r ’. It is mathematically defined as ( ) ( )   ( ) ( )   2 2 2 2 ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( y E y E x E x E y E x E xy E xy Cov y V x V xy Cov r y x − − − = = =                   −                 −                 − =        2 2 2 2 n y n y n x n x n y n x n y x
  • 59.
    ( )( ) () ( )         −         − − =        2 2 2 2 2 2 2 n y y n n x x n n y x xy n ( )( ) ( )   ( )          − − − = 2 2 2 2 y y n x x n y x xy n Properties of Karl Pearson’s coefficient Correlation : 1. Correlation coefficient between x and y is equal to the correlation coefficient between x and y . i.e., yx xy r r = 2. Correlation coefficient is a numerical value and it does not contain units. 3. Karl Pearson’s coefficient of correlation is always lie between 1 1 + − and • If 1 = r then the correlation between x and y is perfect positive correlation. • If 1 − = r then the correlation between x and y is perfect negative correlation. • If 0 = r then the correlation between x and y is zero correlation. • If 0  r then the correlation between x and y is positive correlation. • If 0  r then the correlation between x and y is negative correlation. 4. If two variables x and y are independent then the correlation coefficient between x and y is equal to zero. i.e., 0 = r Note : If the correlation coefficient between x and y is zero, then they need not be independent. 5. Correlation coefficient is independent of change of origin and scale. i.e., uv xy r r =
  • 60.
    Problem 1: Findthe correlation coefficient between y and x using Karl Pearson’s Coefficient of Correlation from the following data. x 24 26 32 28 40 36 38 42 y 26 28 24 32 30 40 38 36 Calculation : x y 2 x 2 y xy 24 26 32 28 40 36 38 42 26 28 24 32 30 40 38 36 576 676 1024 784 1600 1296 1444 1764 676 784 576 1024 900 1600 1444 1296 624 728 768 896 1200 1440 1444 1512 266 254 9164 8300 8612 Here n= 8,      = = = = = 8612 8300 , 9164 , 254 , 266 2 2 xy and y x y x  Karl Pearson’s Coefficient of Correlation ( )( ) ( )   ( )          − − − = 2 2 2 2 y y n x x n y x xy n r   ( )   2 2 254 ) 8300 ( 8 ) 266 ( ) 9164 ( 8 ) 254 ( ) 266 ( ) 8612 ( 8 − − − = r    ( )( ) 6070 . 0 2194.426 1332 4815504 1332 1884 2556 1332 64516 - 66400 70756 - 73312 67564 - 68896 = = = = = r
  • 61.
    Problem 2: Findthe correlation coefficient between y and x using Karl Pearson’s Coefficient of Correlation from the following data. x 102 106 104 105 118 112 116 115 120 y 87 76 80 85 86 82 74 78 84 Calculation : x y 2 x 2 y xy 102 106 104 105 118 112 116 115 120 87 76 80 85 86 82 74 78 84 10404 11236 10816 11025 13924 12544 13456 13225 14400 7569 5776 6400 7225 7396 6724 5476 6084 7056 8874 8056 8320 8925 10148 9184 8584 8970 10080 998 732 111030 59706 81141 Here n= 9,      = = = = = 81141 59706 , 111030 , 732 , 998 2 2 xy and y x y x  Karl Pearson’s Coefficient of Correlation ( )( ) ( )   ( )          − − − = 2 2 2 2 y y n x x n y x xy n r   ( )   2 2 732 ) 59706 ( 9 ) 998 ( ) 111030 ( 9 ) 732 ( ) 998 ( ) 81141 ( 9 − − − = r    535824 - 537354 996004 - 999270 730536 - 730269 = r ( )( ) -0.1194 2235.3926 267 - 4996980 267 - 1530 3266 267 - = = = = r
  • 62.
    Spearman’s Rank CorrelationCoefficient : Spearman’s Rank correlation coefficient is a measure to study the degree of relationship between the non-measurable quantities such as intelligence, beauty, honest etc. If the given data contains the values of two characteristics then in such situations we have to allot the ranks as first rank to the first highest value, second rank to the second highest value and so on last rank to the least value. Prof C.E. Spearman introduced a formula to measure the relationship between two non measurable quantities as follows ( ) 1 6 1 2 2 − − =  n n di  Here i d = difference between the corresponding ranks Y X i R R d e i − = ., . n = Number of paired observations. This formula is used when the repetition of observations are not occur in the data. In case of repeated values or equal values then we have to allot the ranks serially and find the average of the ranks for the repeated or equal values. In case of repeated observations the spearman’s rank correlation coefficient is defined as ( ) ( ) ( ) ( ) 1 ......... 12 1 12 1 12 1 6 1 2 2 3 3 2 2 2 2 1 1 2 −       + − + − + − + − =  n n m m m m m m di  ( ) 1 6 1 2 2 −  − =  n n di  ( ) ( ) ( )       + − + − + − +  ......... 12 1 12 1 12 1 2 3 3 2 2 2 2 1 1 2 m m m m m m d Where i  Here i d = difference between the corresponding ranks Y X i R R d e i − = ., . n = Number of paired observations. .. ,......... , , 3 2 1 m m m are the number of tied ranks
  • 63.
    Properties Of Spearman’sRank Correlation Coefficient : ➢ Spearman’s Ran correlation coefficient is Always lie between 1 1 + − and 1 1 ., . +   −  e i ➢ If the values of y and x series takes the similar ranks , then the Spearman’s rank correlation coefficient between y and x is equal to one ( ) 1 =  then we can say that there is perfect positive correlation. ➢ If the values of y and x series takes opposite ranks, then the Spearman’s rank correlation coefficient between y and x is equal to ‘-1’ ( ) 1 − =  then we can say that there is perfect negative correlation. ➢ If 0 =  then we can say that the two variables are uncorrelated. ➢ Rank correlation coefficient does not affect if we allot ranks to the existing ranks. Merits and Demerits of Rank Correlation coefficient : Merits : • It is easy to understand. • It is easy to calculate. • Some times, it can be used as an approximate or quick estimate of correlation coefficient. • Rank Correlation coefficient does not effect if allot the existing ranks. • Karl Pearson’s coefficient of correlation is equal to one ( ) 1 = r then the Spearman’s rank correlation coefficient is also equal to one ( ) 1 =  but converse need not be true. Demerits : • In case of measurable variables rank correlation coefficient ignores the actual values. Hence, we may not get exact relationship between two variables. • Rank correlation coefficient does not produce regression lines. • We cannot estimate the ranks of one variable with the help of known ranks of the corresponding variable.
  • 64.
    Problem 1 :calculate Spearman’s rank correlation coefficient between y and x using the following data. x 25 28 34 30 36 38 40 42 y 34 30 36 32 38 35 33 37 Solution : x y x R y R y x i R R d − = 2 i d 25 28 34 30 36 38 40 42 34 30 36 32 38 35 33 37 8 7 5 6 4 3 2 1 5 8 3 7 1 4 6 2 3 -1 2 -1 3 -1 -4 -1 9 1 4 1 9 1 16 1 42 Here 8 = n ( ) 1 6 1 2 2 − − =  n n di  ( ) ( ) 1 8 8 42 6 1 2 − − =  ( ) ( ) 2 1 1 63 8 42 6 1 − = − =  5 . 0 =  Problem 2 : calculate Spearman’s rank correlation coefficient between y and x using the following data. x 23 24 26 24 28 25 29 24 28 30
  • 65.
    y 32 3036 34 32 33 36 38 40 38 Solution : ( ) ( ) ( ) ......... 12 1 12 1 12 1 2 3 3 2 2 2 2 1 1 2 2 + − + − + − + =    m m m m m m d d i i ( ) ( ) ( ) ( ) ( ) 12 1 2 2 12 1 2 2 12 1 2 2 12 1 3 3 12 1 2 2 5 . 81 2 2 2 2 2 − + − + − + − + − + = 12 6 12 6 12 6 12 24 12 6 5 . 81 + + + + + = 5 . 0 5 . 0 5 . 0 2 5 . 0 5 . 81 + + + + + = =85.5  Spearman’s Rank correlation coefficient ( ) 1 6 1 2 2 −  − =  n n di  ( ) ( ) 1 10 0 1 5 . 85 6 1 2 − − =  ( ) 99 0 1 513 1− =  x y x R y R y x i R R d − = 2 i d 1 2 3.5 3.5 5 6 8 8 8 10 1 2.5 2.5 4.5 4.5 6 7 8.5 8.5 10 1.5 -2 0.5 2 -5 -1 -2.5 5.5 2.5 -1.5 2.25 4 0.25 4 25 1 6.25 30.25 6.25 2.25 81.50
  • 66.
    990 513 1− =  5182 . 0 1− =  4818 . 0 =  Problem 3 :calculate Spearman’s rank correlation coefficient between y and x using the following data. ( ) ( ) ( ) ......... 12 1 12 1 12 1 2 3 3 2 2 2 2 1 1 2 2 + − + − + − + =    m m m m m m d d i i ( ) ( ) ( ) ( ) ( ) 12 1 2 2 12 1 2 2 12 1 3 3 12 1 2 2 12 1 4 4 258 2 2 2 2 2 − + − + − + − + − + = 12 6 12 6 12 24 12 6 12 60 258 + + + + + = x y x R y R y x i R R d − = 2 i d 1 3.5 3.5 3.5 3.5 6.5 6.5 8 9 10 2 2 2 4 5 6.5 6.5 8.5 8.5 10 -3 4.5 5 -5.5 -1.5 6 -5 8 -6.5 -2 9 20.25 25 30.25 2.25 36 25 64 42.25 4 258
  • 67.
    5 . 0 5 . 0 2 5 . 0 5 258 + + + + + = =266.5  Spearman’sRank correlation coefficient ( ) 1 6 1 2 2 −  − =  n n di  ( ) ( ) 1 10 0 1 5 . 266 6 1 2 − − =  ( ) 99 0 1 1599 1− =  990 1599 1− =  6152 . 1 1− =  6152 . 0 − = 