Sampling from a Data
Dr. Hrudaya Kumar Tripathy
What is Sampling ?
• The sample method involves taking a representative selection of the
population and using the data collected as research information.
• A sample is a “subgroup of a population”.
• As a way of obtaining a group of people or objects to study that were
representative of a large population or universe of interest. (Stacks &
Hocking, 1999)
•
Concept of Sampling
POPULATION ELEMENT
SUBJECT
SAMPLE
Types of Sampling
• Probabilty Sampling:
A sampling process where every single individual elements in the population
have an oppertunity to be choosen as a sample.
• Nonprobabilty Sampling:
A sampling process where every single individual elements in the population
may not have an opportunity to be choosen as a sample.
Convenience sample: The researcher chooses a sample that is readily available
in some non-random way.
Example: A researcher polls people as they walk by on the street.
Why it's probably biased: The location and time of day and other factors may
produce a biased sample of people.
Voluntary response sample: The researcher puts out a request for members of
a population to join the sample, and people decide whether or not to be in the
sample.
Example: A TV show host asks his viewers to visit his website and respond to an
online poll.
Why it's probably biased: People who take the time to respond tend to have similarly
strong opinions compared to the rest of the population.
Probabilty Sampling
Bad ways to sample
Probabilty Sampling
• Simple Random Sampling
• Stratified sampling
• Systematic sampling
• Cluster Sampling
• Multi stage Sampling
Good ways to sample
Simple Random Sampling
• Every element has an equal chance of getting selected to be the part sample.
• It is used when we don’t have any kind of prior information about the target
population.
• Random selection of sample with out any procedure or criteria.
For example: Random selection of
20 students from class of 50
student. Each student has equal
chance of getting selected. Here
probability of selection is 1/50
Why it's good: Random samples are usually fairly representative since they don't favor
certain members.
Stratified Sampling
• This technique divides the elements of the population into small subgroups
based on the similarity in such a way that the elements within the group are
homogeneous and heterogeneous among the other subgroups formed.
• And then the elements are randomly selected from each of these subgroups.
• We need to have prior information about the population to create
subgroups.
Example—A student council surveys 100
students by getting random samples of 25
freshmen, 25 sophomores, 25 juniors, and 25
seniors.
Why it's good: A stratified sample guarantees that members from each group will be
represented in the sample, so this sampling method is good when we want some members
from every group.
Cluster Sampling
• Process to choose the sample according to sections/ clusters.
• Our entire population is divided into clusters or sections and then the
clusters are randomly selected.
• All the elements of the cluster are used for sampling.
• Clusters are identified using details such as age, sex, location etc.
Cluster sampling can be done in following ways:
• Single Stage Cluster Sampling
• Two Stage Cluster Sampling
• Single Stage Cluster Sampling
Entire cluster is selected randomly for
sampling.
Two Stage Cluster Sampling
Here first we randomly select clusters
and then from those selected clusters we
randomly select elements for sampling
Cluster Sampling (cont..)
Example: An airline company wants to survey its customers one day, so they
randomly select 55 flights that day and survey every passenger on those
flights.
Why it's good: A cluster sample gets every member from some of the
groups, so it's good when each group reflects the population as a whole.
Systematic Clustering
• Here the selection of elements is systematic and not random except the first
element.
• Elements of a sample are chosen at regular intervals of population.
• All the elements are put together in a sequence first where each element
has the equal chance of being selected.
• Example: A principal takes an alphabetized list of student names and picks a
random starting point. Every 20th student is selected to take a survey.
For a sample of size n, we divide our population of size N into subgroups of k
elements.
We select our first element randomly from the first subgroup of k elements.
To select other elements of sample, perform following:
We know number of elements in each group is k i.e N/n
So if our first element is n1 then Second element is n1+k i.e n2
Third element n2+k i.e n3 and so on..
Taking an example of N=20, n=5
No of elements in each of the subgroups is N/n i.e 20/5 =4= k
Now, randomly select first element from the first subgroup.
If we select n1= 3, n2 = n1+k = 3+4 = 7, n3 = n2+k = 7+4 = 11
Systematic Clustering (cont..)
Area Sampling
Multi-Stage Sampling
• It is the combination of one or more methods
described earlier.
• Population is divided into multiple clusters and
then these clusters are further divided and
grouped into various sub groups based on
similarity.
• One or more clusters can be randomly selected
from each sub-groups.
• This process continues until the cluster can’t be
divided anymore.
• For example country can be divided into states,
cities, urban and rural and all the areas with
similar characteristics can be merged together to
form a sub-groups.
Process which depends on the gegrophical/prospective positions.
QUIZ
1. A restaurant leaves comment cards on all of its tables and encourages
customers to participate in a brief survey to learn about their overall
experience. What type of sampling is this?
A: Convenience sampling B: Voluntary response samplingB: Voluntary response sampling
2. A quality control worker at a factory selects the first 10 items she sees
as her sample for the day. What type of sampling is this?
A: Convenience sampling B: Voluntary response samplingA: Convenience sampling
3. Each student at a school has a student identification number.
Counselors have a computer generate 50 random identification numbers
and those students are asked to take a survey.
A: Simple random sampling B: Stratified random sampling
C: Cluster random sampling D: Systematic random sampling
A: Simple random sampling
4. A principal orders t-shirts and wants to check some of them to make
sure they were printed properly. She randomly selects 2 of the 10 boxes
of shirts and checks every shirt in those 2 boxes.
A: Simple random sampling B: Stratified random sampling
C: Cluster random sampling D: Systematic random samplingC: Cluster random sampling
5. A school chooses 3 randomly selected athletes from each of its sports
teams to participate in a survey about athletics at the school.
A: Simple random sampling B: Stratified random sampling
C: Cluster random sampling D: Systematic random sampling
6. While students are lined up for school pictures, a teacher passes out a
survey to every 10th student.
A: Simple random sampling B: Stratified random sampling
C: Cluster random sampling D: Systematic random sampling
B: Stratified random sampling
D: Systematic random sampling
Nonprobabilty Sampling
• Convenience Sampling
• Purpose Sampling/Judgemental Sampling
• Quota Sampling
• Referral /Snowball Sampling: Process of getting a sample by one stage to
another stage after getting recomondation.
Convenience Sampling
• Here the samples are selected based on the availability.
• This method is used when the availability of sample is rare and also costly.
• So based on the convenience samples are selected.
• Process of choosing a sample according to suitabilty.
For example: Researchers prefer this during the initial stages of survey
research, as it’s quick and easy to deliver results.
Purposive Sampling
• This is based on the intention or the purpose of study.
• Only those elements will be selected from the population which suits the
best for the purpose of our study.
• Choosing a sample because of represent the certain purpose.
For example: If we want to understand the thought process of the people who
are interested in pursuing master’s degree then the selection criteria would be
“Are you interested for Masters in..?”
All the people who respond with a “No” will be excluded from our sample.
Quota Sampling
• This type of sampling depends of some pre-set standard.
• It selects the representative sample from the population.
• Proportion of characteristics/ trait in sample should be same as population.
• Elements are selected until exact proportions of certain types of data is
obtained or sufficient data in different categories is collected.
For example: If our population has 45% females and 55% males then our
sample should reflect the same percentage of males and females.
Referral /Snowball Sampling
• This technique is used in the situations
where the population is completely
unknown and rare.
• Therefore we will take the help from the
first element which we select for the
population and ask him to recommend
o t h e r e l e m e nt s w h o w i l l f i t t h e
description of the sample needed.
• So this referral technique goes on,
increasing the size of population like a
snowball.
Sampling from a Data Stream........
Data Sampling?
Data sampling is a statistical analysis technique used to select,
manipulate and analyze a representative subset of data points
in order to identify patterns and trends in the larger data set
being examined.
Stream Queries
• There are two ways that queries get asked about streams.
• Ad-hoc Queries: Normal queries asked one time about streams.
• Example: What is the maximum value seen so far in stream S?
• Standing Queries: These queries are, in a sense, permanently
executing, and produce outputs at appropriate times. Queries
that are in principle, asked about the stream at all time.
• Example: Report each maximum value ever seen in stream S.
Problems on Data Streams
• Types of queries one wants on answer on a stream:
– Filtering a data stream
• Select elements with property x from the stream
– Counting distinct elements
• Number of distinct elements in the last k elements of the stream
– Estimating moments
• Estimate avg./std. dev. of last k elements
– Finding frequent elements
Applications – (1)
• Mining query streams
• Google wants to know what queries are more frequent today than
yesterday
• Mining click streams
• Yahoo wants to know which of its pages are getting an unusual
number of hits in the past hour
• Mining social network news feeds
• e.g., look for trending topics on Twitter, Facebook
27
Applications – (2)
• Sensor Networks
• Many sensors feeding into a central controller
• Telephone call records
• Data feeds into customer bills as well as settlements between
telephone companies
• IP packets monitored at a switch
• Gather information for optimal routing
• Detect denial-of-service attacks
28

Lecture7.1 data sampling

  • 1.
    Sampling from aData Dr. Hrudaya Kumar Tripathy
  • 2.
    What is Sampling? • The sample method involves taking a representative selection of the population and using the data collected as research information. • A sample is a “subgroup of a population”. • As a way of obtaining a group of people or objects to study that were representative of a large population or universe of interest. (Stacks & Hocking, 1999) •
  • 3.
    Concept of Sampling POPULATIONELEMENT SUBJECT SAMPLE
  • 4.
    Types of Sampling •Probabilty Sampling: A sampling process where every single individual elements in the population have an oppertunity to be choosen as a sample. • Nonprobabilty Sampling: A sampling process where every single individual elements in the population may not have an opportunity to be choosen as a sample.
  • 5.
    Convenience sample: Theresearcher chooses a sample that is readily available in some non-random way. Example: A researcher polls people as they walk by on the street. Why it's probably biased: The location and time of day and other factors may produce a biased sample of people. Voluntary response sample: The researcher puts out a request for members of a population to join the sample, and people decide whether or not to be in the sample. Example: A TV show host asks his viewers to visit his website and respond to an online poll. Why it's probably biased: People who take the time to respond tend to have similarly strong opinions compared to the rest of the population. Probabilty Sampling Bad ways to sample
  • 6.
    Probabilty Sampling • SimpleRandom Sampling • Stratified sampling • Systematic sampling • Cluster Sampling • Multi stage Sampling Good ways to sample
  • 7.
    Simple Random Sampling •Every element has an equal chance of getting selected to be the part sample. • It is used when we don’t have any kind of prior information about the target population. • Random selection of sample with out any procedure or criteria. For example: Random selection of 20 students from class of 50 student. Each student has equal chance of getting selected. Here probability of selection is 1/50 Why it's good: Random samples are usually fairly representative since they don't favor certain members.
  • 8.
    Stratified Sampling • Thistechnique divides the elements of the population into small subgroups based on the similarity in such a way that the elements within the group are homogeneous and heterogeneous among the other subgroups formed. • And then the elements are randomly selected from each of these subgroups. • We need to have prior information about the population to create subgroups. Example—A student council surveys 100 students by getting random samples of 25 freshmen, 25 sophomores, 25 juniors, and 25 seniors. Why it's good: A stratified sample guarantees that members from each group will be represented in the sample, so this sampling method is good when we want some members from every group.
  • 9.
    Cluster Sampling • Processto choose the sample according to sections/ clusters. • Our entire population is divided into clusters or sections and then the clusters are randomly selected. • All the elements of the cluster are used for sampling. • Clusters are identified using details such as age, sex, location etc. Cluster sampling can be done in following ways: • Single Stage Cluster Sampling • Two Stage Cluster Sampling
  • 10.
    • Single StageCluster Sampling Entire cluster is selected randomly for sampling. Two Stage Cluster Sampling Here first we randomly select clusters and then from those selected clusters we randomly select elements for sampling
  • 11.
    Cluster Sampling (cont..) Example:An airline company wants to survey its customers one day, so they randomly select 55 flights that day and survey every passenger on those flights. Why it's good: A cluster sample gets every member from some of the groups, so it's good when each group reflects the population as a whole.
  • 12.
    Systematic Clustering • Herethe selection of elements is systematic and not random except the first element. • Elements of a sample are chosen at regular intervals of population. • All the elements are put together in a sequence first where each element has the equal chance of being selected. • Example: A principal takes an alphabetized list of student names and picks a random starting point. Every 20th student is selected to take a survey.
  • 13.
    For a sampleof size n, we divide our population of size N into subgroups of k elements. We select our first element randomly from the first subgroup of k elements. To select other elements of sample, perform following: We know number of elements in each group is k i.e N/n So if our first element is n1 then Second element is n1+k i.e n2 Third element n2+k i.e n3 and so on.. Taking an example of N=20, n=5 No of elements in each of the subgroups is N/n i.e 20/5 =4= k Now, randomly select first element from the first subgroup. If we select n1= 3, n2 = n1+k = 3+4 = 7, n3 = n2+k = 7+4 = 11 Systematic Clustering (cont..)
  • 14.
    Area Sampling Multi-Stage Sampling •It is the combination of one or more methods described earlier. • Population is divided into multiple clusters and then these clusters are further divided and grouped into various sub groups based on similarity. • One or more clusters can be randomly selected from each sub-groups. • This process continues until the cluster can’t be divided anymore. • For example country can be divided into states, cities, urban and rural and all the areas with similar characteristics can be merged together to form a sub-groups. Process which depends on the gegrophical/prospective positions.
  • 15.
    QUIZ 1. A restaurantleaves comment cards on all of its tables and encourages customers to participate in a brief survey to learn about their overall experience. What type of sampling is this? A: Convenience sampling B: Voluntary response samplingB: Voluntary response sampling 2. A quality control worker at a factory selects the first 10 items she sees as her sample for the day. What type of sampling is this? A: Convenience sampling B: Voluntary response samplingA: Convenience sampling
  • 16.
    3. Each studentat a school has a student identification number. Counselors have a computer generate 50 random identification numbers and those students are asked to take a survey. A: Simple random sampling B: Stratified random sampling C: Cluster random sampling D: Systematic random sampling A: Simple random sampling 4. A principal orders t-shirts and wants to check some of them to make sure they were printed properly. She randomly selects 2 of the 10 boxes of shirts and checks every shirt in those 2 boxes. A: Simple random sampling B: Stratified random sampling C: Cluster random sampling D: Systematic random samplingC: Cluster random sampling
  • 17.
    5. A schoolchooses 3 randomly selected athletes from each of its sports teams to participate in a survey about athletics at the school. A: Simple random sampling B: Stratified random sampling C: Cluster random sampling D: Systematic random sampling 6. While students are lined up for school pictures, a teacher passes out a survey to every 10th student. A: Simple random sampling B: Stratified random sampling C: Cluster random sampling D: Systematic random sampling B: Stratified random sampling D: Systematic random sampling
  • 18.
    Nonprobabilty Sampling • ConvenienceSampling • Purpose Sampling/Judgemental Sampling • Quota Sampling • Referral /Snowball Sampling: Process of getting a sample by one stage to another stage after getting recomondation.
  • 19.
    Convenience Sampling • Herethe samples are selected based on the availability. • This method is used when the availability of sample is rare and also costly. • So based on the convenience samples are selected. • Process of choosing a sample according to suitabilty. For example: Researchers prefer this during the initial stages of survey research, as it’s quick and easy to deliver results.
  • 20.
    Purposive Sampling • Thisis based on the intention or the purpose of study. • Only those elements will be selected from the population which suits the best for the purpose of our study. • Choosing a sample because of represent the certain purpose. For example: If we want to understand the thought process of the people who are interested in pursuing master’s degree then the selection criteria would be “Are you interested for Masters in..?” All the people who respond with a “No” will be excluded from our sample.
  • 21.
    Quota Sampling • Thistype of sampling depends of some pre-set standard. • It selects the representative sample from the population. • Proportion of characteristics/ trait in sample should be same as population. • Elements are selected until exact proportions of certain types of data is obtained or sufficient data in different categories is collected. For example: If our population has 45% females and 55% males then our sample should reflect the same percentage of males and females.
  • 22.
    Referral /Snowball Sampling •This technique is used in the situations where the population is completely unknown and rare. • Therefore we will take the help from the first element which we select for the population and ask him to recommend o t h e r e l e m e nt s w h o w i l l f i t t h e description of the sample needed. • So this referral technique goes on, increasing the size of population like a snowball.
  • 23.
    Sampling from aData Stream........
  • 24.
    Data Sampling? Data samplingis a statistical analysis technique used to select, manipulate and analyze a representative subset of data points in order to identify patterns and trends in the larger data set being examined.
  • 25.
    Stream Queries • Thereare two ways that queries get asked about streams. • Ad-hoc Queries: Normal queries asked one time about streams. • Example: What is the maximum value seen so far in stream S? • Standing Queries: These queries are, in a sense, permanently executing, and produce outputs at appropriate times. Queries that are in principle, asked about the stream at all time. • Example: Report each maximum value ever seen in stream S.
  • 26.
    Problems on DataStreams • Types of queries one wants on answer on a stream: – Filtering a data stream • Select elements with property x from the stream – Counting distinct elements • Number of distinct elements in the last k elements of the stream – Estimating moments • Estimate avg./std. dev. of last k elements – Finding frequent elements
  • 27.
    Applications – (1) •Mining query streams • Google wants to know what queries are more frequent today than yesterday • Mining click streams • Yahoo wants to know which of its pages are getting an unusual number of hits in the past hour • Mining social network news feeds • e.g., look for trending topics on Twitter, Facebook 27
  • 28.
    Applications – (2) •Sensor Networks • Many sensors feeding into a central controller • Telephone call records • Data feeds into customer bills as well as settlements between telephone companies • IP packets monitored at a switch • Gather information for optimal routing • Detect denial-of-service attacks 28