Sampling &
Its Types
A complete visual framework for teaching sampling theory, methods, selection logic, bias, and application in health research โ designed for Community Medicine faculty.
Before teaching any sampling method, students must master the foundational vocabulary. These are the building blocks upon which all sampling theory rests.
Example: All adults with hypertension in India
Often too large to study directly โ need a sample
Example: Adults with hypertension in Cuttack district
Must be clearly defined with inclusion/exclusion criteria
Example: Voter list, hospital register, school roll
If the frame is flawed, the sample is flawed
Example: Individual (simple), Household (cluster), School (two-stage)
Must match your research question
Statistic: Value calculated from the sample (known; our estimate)
Example: True prevalence of TB (parameter) vs. prevalence in our sample (statistic)
Sampling Bias: Systematic error; does NOT decrease with larger sample; caused by flawed method
A large biased sample is WORSE than a small unbiased one
| Term | Definition | Symbol | Example in Health Research | Common Confusion |
|---|---|---|---|---|
| Population (N) | All individuals about whom inference is to be made | N | All TB patients in India | Confused with study population (accessible group) |
| Sample (n) | Subset of population actually studied | n | 200 TB patients in Odisha | Students often confuse sample size with sample method |
| Sampling Fraction | n/N โ proportion of population included | f = n/N | 200/20,000 = 1% of TB patients studied | Larger fraction โ better sample if method is biased |
| Representativeness | Degree to which sample reflects population characteristics | โ | Sample has same age/sex distribution as source | Large sample โ representative sample |
| Precision | Narrowness of confidence interval; repeatability | 1/SE | Prevalence: 12% ยฑ 2% vs 12% ยฑ 8% | Precision โ accuracy (can be precise but biased) |
| Confidence Interval | Range likely to contain the true parameter | CI | 95% CI: 10%โ14% for TB prevalence | NOT "95% chance the true value is in this range" |
| Design Effect (DEFF) | Ratio of variance with complex design vs SRS | DEFF | DEFF=2 means cluster sampling needs 2ร sample size | Students forget to account for DEFF in cluster studies |
| Intraclass Correlation | Similarity of units within a cluster | ICC/ฯ | Children in same school more similar in vaccination status | Higher ICC โ bigger DEFF โ need more clusters |
The entire universe of sampling methods, organised hierarchically. Click any method to see its full profile.
Every unit has a known, non-zero chance of selection
Selection based on judgement, availability, or convenience
Specialised designs for specific research contexts
Every unit has a known, non-zero probability of being selected. This is the gold standard for quantitative health research as it allows generalisability.
| Method | Mechanism | Sampling Frame Needed? | Best Used When | Advantages | Disadvantages | Real-World Example |
|---|---|---|---|---|---|---|
| Simple Random SRS |
Lottery / Random number table / Computer random; each unit has equal probability = n/N | Yes (complete) | Small, homogeneous, well-listed populations | Unbiased; easy to understand; forms basis of statistical theory | Requires complete sampling frame; not practical for large dispersed populations; may miss minorities | Selecting 200 patients from hospital register of 2000 using random numbers |
| Systematic Every kth unit |
k = N/n; random start between 1 and k; select every kth unit thereafter | Yes (ordered list) | Large populations with sequential lists (registers, records) | Easy to execute; spread across list; no need to number all units if list exists | Periodicity bias if list has periodic pattern (e.g., every 7th patient is a Monday case) | Antenatal clinic: k=10; randomly start at 4, then select 4, 14, 24, 34... |
| Stratified SRS within strata |
Divide population into homogeneous subgroups (strata); SRS within each stratum; proportional or disproportionate allocation | Yes (per stratum) | Heterogeneous populations; need subgroup estimates; want to ensure minority representation | Ensures representation; reduces variance vs SRS; permits subgroup analysis | Complex; must know stratum sizes; disproportionate allocation requires weighting | NFS surveys: stratify by urban/rural, then state, then household type |
| Cluster Whole clusters selected |
Divide population into clusters; randomly select clusters; study ALL units in selected clusters | No (only cluster list) | Geographically dispersed; no individual sampling frame; field surveys | Practical; cost-efficient; no complete frame needed; feasible in field | Less precise than SRS (clustering effect); DEFF >1; need more subjects | District nutritional survey: randomly select 30 villages; study all children in selected villages |
| Multi-stage Nested random |
Two or more stages of random selection; each stage uses a sampling frame for that level | Yes (at each stage) | Large national studies; hierarchical populations (districtsโblocksโvillagesโhouseholds) | Practical for national surveys; economical; flexible design | Complex; errors compound across stages; needs lists at each level | NFHS: State โ District โ PSU (village/ward) โ Household โ Individual |
| PPS Sampling Size-weighted |
Probability of selecting a cluster proportional to its size; ensures equal probability of individual selection | Yes (with size data) | Clusters of unequal size; want equal individual probability | Equal probability for all individuals; no weighting needed in analysis | Need size information for all clusters; complex to implement | EPI cluster sampling: villages selected proportional to their population size |
Proportional Allocation: nแตข = n ร (Nแตข / N) where Nแตข = stratum size
Design Effect (DEFF) = 1 + (mโ1) ร ICC ยท m = cluster size, ICC = intraclass correlation
Effective sample size = n รท DEFF ยท Required n (cluster) = SRS sample ร DEFF
Selection is not based on random chance. Not all units have a known probability of selection. Used in qualitative research, exploratory studies, and when probability sampling is impossible.
| Method | Mechanism | Bias Risk | Best Used For | Strengths | Limitations | Health Research Example |
|---|---|---|---|---|---|---|
| Convenience Accidental |
Select whoever is readily available; patients in OPD, students in class, mall visitors | HIGH | Pilot studies; feasibility testing; quick surveys | Cheapest; fastest; easy to execute; good for hypothesis generation | Highly biased; not representative; cannot generalise; selection entirely determined by convenience | Exit interviews with OPD patients to pilot a questionnaire on patient satisfaction |
| Purposive Judgement |
Researcher deliberately selects "information-rich" cases based on specific characteristics | MODERATE | Qualitative research; case studies; key informant interviews | Focused on relevant cases; efficient for specific objectives; expert knowledge used | Researcher bias in selection; not generalisable; dependent on researcher's judgement | Selecting ASHA workers with โฅ5 years experience for in-depth interviews on community health |
| Quota Non-random strata |
Set fixed quotas for subgroups (age, sex, etc.); fill quotas by convenience within each subgroup | MODERATE | Market research; large surveys where frame unavailable; needs subgroup balance | Ensures proportional representation of subgroups; faster than stratified random; no frame needed | Selection within quota is non-random; interviewer bias; cannot calculate sampling error | Community survey: quota of 50 males and 50 females in each age group, recruited at convenience |
| Snowball Chain referral |
Initial seeds recruited; each participant refers others; chain grows like a snowball | MODERATE | Hidden or hard-to-reach populations; stigmatised groups | Only practical method for some populations; builds trust networks; reaches hidden groups | Selection bias towards well-connected individuals; network clustering; non-representative | Studying risk behaviour in IV drug users; FSW health surveys; undocumented migrants |
| Volunteer Self-selection |
Individuals volunteer in response to advertisement or invitation | VERY HIGH | Clinical trials (with random allocation after recruitment); experimental studies | Motivated participants; good compliance; ethical (consent built in) | Healthy worker effect; volunteers are atypical; extreme self-selection bias | Vaccine trial: volunteers respond to ad; then randomised to vaccine vs placebo |
| Consecutive Sequential |
Every eligible patient presenting over a defined time period is recruited; no random selection | LOWโMOD | Hospital-based clinical studies; OPD-based research | Simple; minimises selection bias within available patients; complete capture of eligible cases | Limited to patients attending that facility; selection bias due to healthcare-seeking behaviour | All newly diagnosed diabetic patients in medicine OPD over 6 months included in study |
Stratified: Same subgroups; YES random selection within strata; unbiased within stratum; can calculate sampling error
They look similar in design but differ fundamentally in how units within groups are selected.
RDS: Advanced snowball with mathematical corrections for network effects; gives population estimates; used in HIV research with FSW, MSM
RDS can produce valid prevalence estimates where Snowball cannot.
The master comparison table โ use this for teaching contrasts, exam preparation, and decision-making in research design.
| Criterion | SRS | Systematic | Stratified | Cluster | Multi-stage | Convenience | Purposive | Snowball |
|---|---|---|---|---|---|---|---|---|
| Type | Probability | Probability | Probability | Probability | Probability | Non-Prob | Non-Prob | Non-Prob |
| Sampling Frame Required | Yes (complete) | Yes (ordered) | Yes (per strata) | No (cluster list only) | Partial (at each level) | No | No | No |
| Representativeness | High | High (if no periodicity) | Very High | Moderate | ModerateโHigh | Low | Low | Low |
| Statistical Inference | Yes | Yes | Yes | Yes (with DEFF) | Yes (with weights) | No | No | No (RDS: limited) |
| Cost / Complexity | LowโModerate | Low | Moderate | LowโModerate | High | Very Low | Low | Low |
| Variance / Precision | Benchmark | Equal or better than SRS | Better than SRS | Worse than SRS (DEFF >1) | Variable | Cannot estimate | Cannot estimate | Cannot estimate |
| Bias Risk | Very Low | Low (periodicity risk) | Very Low | Moderate (homogeneity) | LowโModerate | High | Moderate | ModerateโHigh |
| Best Research Type | Prevalence studies; RCTs | Hospital-based; sequential lists | Comparative studies; surveys | Field surveys; national studies | NFHS; DHS; large national surveys | Pilot; qualitative | Qualitative; key informants | Hidden populations |
| Indian Health Example | PHC patient study using OPD register | Every 5th antenatal visit to clinic | Urban/rural stratified TB survey | Village-based NCD survey | NFHS-5; DLHS; AHS | Questionnaire pilot in OPD | ASHA worker interviews | IDU risk behaviour survey |
Yes + Small population: Use SRS or Systematic Random
Yes + Heterogeneous population: Use Stratified Random
No complete frame + Field survey: Use Cluster or Multi-stage
National scale survey: Multi-stage with PPS (like NFHS)
In-depth qualitative: Purposive sampling
Need subgroup balance (no frame): Quota sampling
Hidden population (IDU, FSW): Snowball or RDS
Hospital-based clinical study: Consecutive sampling
One of the most commonly asked questions in research: "How many subjects do I need?" Sample size is determined by statistical requirements, not budget or convenience.
Analytical (Comparison): n = Zยฒ ร 2pq / dยฒ (each group) or use Kelsey formula for OR/RR
Cluster adjustment: n_cluster = n_SRS ร DEFF ยท DEFF = 1 + (mโ1) ร ICC
Finite population correction: n_final = n / [1 + (nโ1)/N] (when sampling fraction >5%)
| Study Type | Formula | Key Inputs | Example Calculation | Software Tool |
|---|---|---|---|---|
| Cross-sectional Prevalence estimation |
n = Zยฒpq/dยฒ | p = expected prevalence; d = allowable error; confidence level | p=0.20, d=0.05, 95% CI: n = (1.96)ยฒร0.20ร0.80/(0.05)ยฒ = 246 | OpenEpi; EpiInfo; G*Power |
| Case-Control Odds Ratio |
Kelsey / Schlesselman formula; based on OR and pโ | Expected OR; exposure prevalence in controls; ฮฑ; power (1โฮฒ) | OR=2.0, pโ=0.30, ฮฑ=0.05, power=80%: nโ133 per group | OpenEpi; EpiInfo; PASS |
| Cohort / RCT Risk difference or RR |
n = Zยฒ(pโqโ+pโqโ)/(pโโpโ)ยฒ | Incidence in exposed/unexposed; ฮฑ; power; dropout rate | pโ=0.15, pโ=0.30, ฮฑ=0.05, power=80%: nโ130/group (add 10โ20% attrition) | OpenEpi; G*Power; Stata |
| Cluster Sampling Design effect adjustment |
n_cluster = n_SRS ร DEFF | n_SRS; DEFF (assume 1.5โ2.0 if ICC unknown); cluster size m | n_SRS=246, DEFF=1.5: n_cluster = 246ร1.5 = 369; if 30 clusters: 369/30 = 13/cluster | EPI cluster; WHO LQAS tables |
| Qualitative Research Saturation-based |
No formula; theoretical saturation | Research question complexity; homogeneity of group; data richness | Typical: 15โ30 in-depth interviews; 3โ5 focus group discussions of 6โ10 participants | Not applicable; literature guidance |
| Factor | Change | Effect on Sample Size | Reason | Teaching Analogy |
|---|---|---|---|---|
| Prevalence (p) | p โ 50% | โ Increases | Maximum variance at p=0.5 (pq is maximised) | If you don't know heads vs tails, you need more tosses |
| Confidence Level | 95% โ 99% | โ Increases (Z: 1.96 โ 2.576) | More certainty requires wider margin coverage | More certain = more evidence needed |
| Allowable Error (d) | 5% โ 2% | โ Greatly Increases (ร6.25) | d is squared in denominator; halving d quadruples n | Smaller target needs more shots to hit it reliably |
| Desired Power | 80% โ 90% | โ Increases | Higher power reduces Type II error; needs more data | Better detection = larger radar screen |
| Dropout/Non-response | Add 10โ20% | โ Adds buffer | Some subjects will drop out; need reserves | Order extra food in case guests bring friends |
| Cluster Effect (DEFF) | DEFF = 2 | โ Doubles n | Clustering reduces effective information per subject | Asking one village = not same as asking 30 individuals from 30 villages |
| Population Size (N) | N โ greatly | โ Little effect above N=10,000 | Large populations: FPC correction negligible | A teaspoon from the ocean gives same info as from a pool |
| Effect Size | Small effect | โ Greatly Increases | Harder to detect smaller differences | Need more tests to find a faint signal in noise |
Proven strategies to make sampling genuinely understood โ not just memorised. Based on active learning and conceptual contrast teaching.
| Pair | Similarity (Why Students Confuse Them) | Key Difference | Exam Tip |
|---|---|---|---|
| Stratified vs Quota | Both divide population into subgroups before sampling | Stratified: RANDOM selection within strata โ probability method. Quota: CONVENIENCE selection within quotas โ non-probability | If random selection within groups โ Stratified. If researcher fills quotas by convenience โ Quota |
| Cluster vs Stratified | Both use groups/subpopulations as part of design | Cluster: Randomly SELECT clusters, study ALL units within. Stratified: Create strata, randomly select INDIVIDUALS within each | Cluster = select whole groups then study everything inside. Stratified = study a random sample from each group |
| Sampling Error vs Sampling Bias | Both affect accuracy of estimates from samples | Error: Random; decreases with n; unavoidable. Bias: Systematic; does NOT decrease with n; caused by poor method | A biased large sample is worse than a small unbiased one. Bias can only be fixed by changing the method, not by adding subjects |
| Multi-stage vs Cluster | Both involve selecting groups (clusters) at some point | Cluster: One-stage โ select clusters, study all. Multi-stage: Two or more stages of selection (e.g., districts โ villages โ households โ individuals) | NFHS uses multi-stage (PSUs โ households โ individuals). A simple village survey using whole villages is cluster |
| Systematic vs SRS | Both are probability methods; both give unbiased samples from lists | SRS: Truly random each time. Systematic: Periodic โ vulnerable to periodicity bias if list has cyclical pattern | If OPD register has every Monday = highest severity, systematic sampling at interval=7 will always select Monday cases โ biased |
| Snowball vs Purposive | Both are non-probability; both used in qualitative research | Snowball: Participants recruit others (chain referral); grows from initial seeds. Purposive: Researcher actively selects information-rich cases based on criteria | Snowball = participants drive recruitment. Purposive = researcher drives recruitment based on judgement |
| Research Question | Recommended Method | Justification | Sampling Frame | Practical Challenge |
|---|---|---|---|---|
| Prevalence of anaemia among adolescent girls in Odisha | Multi-stage + Stratified | Large state; heterogeneous urban/rural; hierarchical population structure | School/PHC registers at each stage; voter rolls for households | Non-school-going girls missed; consent from parents |
| Immunisation coverage in a district โ WHO survey | EPI 30ร7 Cluster (PPS) | WHO's validated method; no individual frame available; clusters selected by population size | Village list with population for PPS; no individual-level frame needed | Random walk method for household selection within cluster |
| Risk behaviour among truck drivers on national highway | Snowball / RDS | Hidden population; no sampling frame exists; trust-based recruitment needed | None available | Chain-referral bias; need multiple seeds at different highway stops |
| Comparing treatment outcomes: DOTS vs self-administered therapy in TB patients | Systematic from RNTCP register | Sequential patient list available; need unbiased allocation to study arms | District RNTCP patient register | Ensure register is complete; check for periodicity in registration patterns |
| Qualitative study on barriers to institutional delivery among tribal women | Purposive Sampling | Qualitative; need information-rich cases; tribal women with experience of home delivery | No frame; ASHA worker referrals for identification | Language barriers; trust building; purposive selection criteria must be explicit |
| Monitoring vaccine coverage in 30 PHC areas after campaign | LQAS (Lot Quality Assurance) | Need accept/reject decision for each PHC area; small sample per area; operational monitoring | PHC area population list; community health workers' records | Threshold and sample size based on LQAS tables; decision rule must be pre-specified |
| Blood pressure survey in a medical college OPD (pilot study) | Consecutive Sampling | Pilot only; all eligible patients in OPD over 2 weeks; simple and complete within available setting | OPD attendance register as guide; not a frame | Healthcare-seeking bias; generalisation limited to OPD-attending population only |