Skip to content

A Python library for Difference-in-Differences (DiD) causal inference analysis with an sklearn-like API and statsmodels-style outputs.

Notifications You must be signed in to change notification settings

igerber/diff-diff

Repository files navigation

diff-diff

A Python library for Difference-in-Differences (DiD) causal inference analysis with an sklearn-like API and statsmodels-style outputs.

Installation

pip install diff-diff

Or install from source:

git clone https://github.com/igerber/diff-diff.git
cd diff-diff
pip install -e .

Quick Start

import pandas as pd
from diff_diff import DifferenceInDifferences

# Create sample data
data = pd.DataFrame({
    'outcome': [10, 11, 15, 18, 9, 10, 12, 13],
    'treated': [1, 1, 1, 1, 0, 0, 0, 0],
    'post': [0, 0, 1, 1, 0, 0, 1, 1]
})

# Fit the model
did = DifferenceInDifferences()
results = did.fit(data, outcome='outcome', treatment='treated', time='post')

# View results
print(results)  # DiDResults(ATT=3.0000, SE=1.7321, p=0.1583)
results.print_summary()

Output:

======================================================================
             Difference-in-Differences Estimation Results
======================================================================

Observations:                      8
Treated units:                     4
Control units:                     4
R-squared:                    0.9055

----------------------------------------------------------------------
Parameter           Estimate    Std. Err.     t-stat      P>|t|
----------------------------------------------------------------------
ATT                   3.0000       1.7321      1.732     0.1583
----------------------------------------------------------------------

95% Confidence Interval: [-1.8089, 7.8089]

Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
======================================================================

Features

  • sklearn-like API: Familiar fit() interface with get_params() and set_params()
  • Pythonic results: Easy access to coefficients, standard errors, and confidence intervals
  • Multiple interfaces: Column names or R-style formulas
  • Robust inference: Heteroskedasticity-robust (HC1) and cluster-robust standard errors
  • Wild cluster bootstrap: Valid inference with few clusters (<50) using Rademacher, Webb, or Mammen weights
  • Panel data support: Two-way fixed effects estimator for panel designs
  • Multi-period analysis: Event-study style DiD with period-specific treatment effects
  • Staggered adoption: Callaway-Sant'Anna (2021) and Sun-Abraham (2021) estimators for heterogeneous treatment timing
  • Triple Difference (DDD): Ortiz-Villavicencio & Sant'Anna (2025) estimators with proper covariate handling
  • Synthetic DiD: Combined DiD with synthetic control for improved robustness
  • Event study plots: Publication-ready visualization of treatment effects
  • Parallel trends testing: Multiple methods including equivalence tests
  • Goodman-Bacon decomposition: Diagnose TWFE bias by decomposing into 2x2 comparisons
  • Placebo tests: Comprehensive diagnostics including fake timing, fake group, permutation, and leave-one-out tests
  • Honest DiD sensitivity analysis: Rambachan-Roth (2023) bounds and breakdown analysis for parallel trends violations
  • Pre-trends power analysis: Roth (2022) minimum detectable violation (MDV) and power curves for pre-trends tests
  • Power analysis: MDE, sample size, and power calculations for study design; simulation-based power for any estimator
  • Data prep utilities: Helper functions for common data preparation tasks
  • Validated against R: Benchmarked against did, synthdid, and fixest packages (see benchmarks)

Tutorials

We provide Jupyter notebook tutorials in docs/tutorials/:

Notebook Description
01_basic_did.ipynb Basic 2x2 DiD, formula interface, covariates, fixed effects, cluster-robust SE, wild bootstrap
02_staggered_did.ipynb Staggered adoption with Callaway-Sant'Anna and Sun-Abraham, group-time effects, aggregation methods, Bacon decomposition
03_synthetic_did.ipynb Synthetic DiD, unit/time weights, inference methods, regularization
04_parallel_trends.ipynb Testing parallel trends, equivalence tests, placebo tests, diagnostics
05_honest_did.ipynb Honest DiD sensitivity analysis, bounds, breakdown values, visualization
06_power_analysis.ipynb Power analysis, MDE, sample size calculations, simulation-based power
07_pretrends_power.ipynb Pre-trends power analysis (Roth 2022), MDV, power curves
08_triple_diff.ipynb Triple Difference (DDD) estimation with proper covariate handling

Data Preparation

diff-diff provides utility functions to help prepare your data for DiD analysis. These functions handle common data transformation tasks like creating treatment indicators, reshaping panel data, and validating data formats.

Generate Sample Data

Create synthetic data with a known treatment effect for testing and learning:

from diff_diff import generate_did_data, DifferenceInDifferences

# Generate panel data with 100 units, 4 periods, and a treatment effect of 5
data = generate_did_data(
    n_units=100,
    n_periods=4,
    treatment_effect=5.0,
    treatment_fraction=0.5,  # 50% of units are treated
    treatment_period=2,       # Treatment starts at period 2
    seed=42
)

# Verify the estimator recovers the treatment effect
did = DifferenceInDifferences()
results = did.fit(data, outcome='outcome', treatment='treated', time='post')
print(f"Estimated ATT: {results.att:.2f} (true: 5.0)")

Create Treatment Indicators

Convert categorical variables or numeric thresholds to binary treatment indicators:

from diff_diff import make_treatment_indicator

# From categorical variable
df = make_treatment_indicator(
    data,
    column='state',
    treated_values=['CA', 'NY', 'TX']  # These states are treated
)

# From numeric threshold (e.g., firms above median size)
df = make_treatment_indicator(
    data,
    column='firm_size',
    threshold=data['firm_size'].median()
)

# Treat units below threshold
df = make_treatment_indicator(
    data,
    column='income',
    threshold=50000,
    above_threshold=False  # Units with income <= 50000 are treated
)

Create Post-Treatment Indicators

Convert time/date columns to binary post-treatment indicators:

from diff_diff import make_post_indicator

# From specific post-treatment periods
df = make_post_indicator(
    data,
    time_column='year',
    post_periods=[2020, 2021, 2022]
)

# From treatment start date
df = make_post_indicator(
    data,
    time_column='year',
    treatment_start=2020  # All years >= 2020 are post-treatment
)

# Works with datetime columns
df = make_post_indicator(
    data,
    time_column='date',
    treatment_start='2020-01-01'
)

Reshape Wide to Long Format

Convert wide-format data (one row per unit, multiple time columns) to long format:

from diff_diff import wide_to_long

# Wide format: columns like sales_2019, sales_2020, sales_2021
wide_df = pd.DataFrame({
    'firm_id': [1, 2, 3],
    'industry': ['tech', 'retail', 'tech'],
    'sales_2019': [100, 150, 200],
    'sales_2020': [110, 160, 210],
    'sales_2021': [120, 170, 220]
})

# Convert to long format for DiD
long_df = wide_to_long(
    wide_df,
    value_columns=['sales_2019', 'sales_2020', 'sales_2021'],
    id_column='firm_id',
    time_name='year',
    value_name='sales',
    time_values=[2019, 2020, 2021]
)
# Result: 9 rows (3 firms × 3 years), columns: firm_id, year, sales, industry

Balance Panel Data

Ensure all units have observations for all time periods:

from diff_diff import balance_panel

# Keep only units with complete data (drop incomplete units)
balanced = balance_panel(
    data,
    unit_column='firm_id',
    time_column='year',
    method='inner'
)

# Include all unit-period combinations (creates NaN for missing)
balanced = balance_panel(
    data,
    unit_column='firm_id',
    time_column='year',
    method='outer'
)

# Fill missing values
balanced = balance_panel(
    data,
    unit_column='firm_id',
    time_column='year',
    method='fill',
    fill_value=0  # Or None for forward/backward fill
)

Validate Data

Check that your data meets DiD requirements before fitting:

from diff_diff import validate_did_data

# Validate and get informative error messages
result = validate_did_data(
    data,
    outcome='sales',
    treatment='treated',
    time='post',
    unit='firm_id',      # Optional: for panel-specific validation
    raise_on_error=False  # Return dict instead of raising
)

if result['valid']:
    print("Data is ready for DiD analysis!")
    print(f"Summary: {result['summary']}")
else:
    print("Issues found:")
    for error in result['errors']:
        print(f"  - {error}")

for warning in result['warnings']:
    print(f"Warning: {warning}")

Summarize Data by Groups

Get summary statistics for each treatment-time cell:

from diff_diff import summarize_did_data

summary = summarize_did_data(
    data,
    outcome='sales',
    treatment='treated',
    time='post'
)
print(summary)

Output:

                        n      mean       std       min       max
Control - Pre        250  100.5000   15.2340   65.0000  145.0000
Control - Post       250  105.2000   16.1230   68.0000  152.0000
Treated - Pre        250  101.2000   14.8900   67.0000  143.0000
Treated - Post       250  115.8000   17.5600   72.0000  165.0000
DiD Estimate           -    9.9000         -         -         -

Create Event Time for Staggered Designs

For designs where treatment occurs at different times:

from diff_diff import create_event_time

# Add event-time column relative to treatment timing
df = create_event_time(
    data,
    time_column='year',
    treatment_time_column='treatment_year'
)
# Result: event_time = -2, -1, 0, 1, 2 relative to treatment

Aggregate to Cohort Means

Aggregate unit-level data for visualization:

from diff_diff import aggregate_to_cohorts

cohort_data = aggregate_to_cohorts(
    data,
    unit_column='firm_id',
    time_column='year',
    treatment_column='treated',
    outcome='sales'
)
# Result: mean outcome by treatment group and period

Rank Control Units

Select the best control units for DiD or Synthetic DiD analysis by ranking them based on pre-treatment outcome similarity:

from diff_diff import rank_control_units, generate_did_data

# Generate sample data
data = generate_did_data(n_units=50, n_periods=6, seed=42)

# Rank control units by their similarity to treated units
ranking = rank_control_units(
    data,
    unit_column='unit',
    time_column='period',
    outcome_column='outcome',
    treatment_column='treated',
    n_top=10  # Return top 10 controls
)

print(ranking[['unit', 'quality_score', 'pre_trend_rmse']])

Output:

   unit  quality_score  pre_trend_rmse
0    35         1.0000          0.4521
1    42         0.9234          0.5123
2    28         0.8876          0.5892
...

With covariates for matching:

# Add covariate-based matching
ranking = rank_control_units(
    data,
    unit_column='unit',
    time_column='period',
    outcome_column='outcome',
    treatment_column='treated',
    covariates=['size', 'age'],  # Match on these too
    outcome_weight=0.7,          # 70% weight on outcome trends
    covariate_weight=0.3         # 30% weight on covariate similarity
)

Filter data for SyntheticDiD using top controls:

from diff_diff import SyntheticDiD

# Get top control units
top_controls = ranking['unit'].tolist()

# Filter data to treated + top controls
filtered_data = data[
    (data['treated'] == 1) | (data['unit'].isin(top_controls))
]

# Fit SyntheticDiD with selected controls
sdid = SyntheticDiD()
results = sdid.fit(
    filtered_data,
    outcome='outcome',
    treatment='treated',
    unit='unit',
    time='period',
    post_periods=[3, 4, 5]
)

Usage

Basic DiD with Column Names

from diff_diff import DifferenceInDifferences

did = DifferenceInDifferences(robust=True, alpha=0.05)
results = did.fit(
    data,
    outcome='sales',
    treatment='treated',
    time='post_policy'
)

# Access results
print(f"ATT: {results.att:.4f}")
print(f"Standard Error: {results.se:.4f}")
print(f"P-value: {results.p_value:.4f}")
print(f"95% CI: {results.conf_int}")
print(f"Significant: {results.is_significant}")

Using Formula Interface

# R-style formula syntax
results = did.fit(data, formula='outcome ~ treated * post')

# Explicit interaction syntax
results = did.fit(data, formula='outcome ~ treated + post + treated:post')

# With covariates
results = did.fit(data, formula='outcome ~ treated * post + age + income')

Including Covariates

results = did.fit(
    data,
    outcome='outcome',
    treatment='treated',
    time='post',
    covariates=['age', 'income', 'education']
)

Fixed Effects

Use fixed_effects for low-dimensional categorical controls (creates dummy variables):

# State and industry fixed effects
results = did.fit(
    data,
    outcome='sales',
    treatment='treated',
    time='post',
    fixed_effects=['state', 'industry']
)

# Access fixed effect coefficients
state_coefs = {k: v for k, v in results.coefficients.items() if k.startswith('state_')}

Use absorb for high-dimensional fixed effects (more efficient, uses within-transformation):

# Absorb firm-level fixed effects (efficient for many firms)
results = did.fit(
    data,
    outcome='sales',
    treatment='treated',
    time='post',
    absorb=['firm_id']
)

Combine covariates with fixed effects:

results = did.fit(
    data,
    outcome='sales',
    treatment='treated',
    time='post',
    covariates=['size', 'age'],           # Linear controls
    fixed_effects=['industry'],            # Low-dimensional FE (dummies)
    absorb=['firm_id']                     # High-dimensional FE (absorbed)
)

Cluster-Robust Standard Errors

did = DifferenceInDifferences(cluster='state')
results = did.fit(
    data,
    outcome='outcome',
    treatment='treated',
    time='post'
)

Wild Cluster Bootstrap

When you have few clusters (<50), standard cluster-robust SEs are biased. Wild cluster bootstrap provides valid inference even with 5-10 clusters.

# Use wild bootstrap for inference
did = DifferenceInDifferences(
    cluster='state',
    inference='wild_bootstrap',
    n_bootstrap=999,
    bootstrap_weights='rademacher',  # or 'webb' for <10 clusters, 'mammen'
    seed=42
)
results = did.fit(data, outcome='y', treatment='treated', time='post')

# Results include bootstrap-based SE and p-value
print(f"ATT: {results.att:.3f} (SE: {results.se:.3f})")
print(f"P-value: {results.p_value:.4f}")
print(f"95% CI: {results.conf_int}")
print(f"Inference method: {results.inference_method}")
print(f"Number of clusters: {results.n_clusters}")

Weight types:

  • 'rademacher' - Default, ±1 with p=0.5, good for most cases
  • 'webb' - 6-point distribution, recommended for <10 clusters
  • 'mammen' - Two-point distribution, alternative to Rademacher

Works with DifferenceInDifferences and TwoWayFixedEffects estimators.

Two-Way Fixed Effects (Panel Data)

from diff_diff import TwoWayFixedEffects

twfe = TwoWayFixedEffects()
results = twfe.fit(
    panel_data,
    outcome='outcome',
    treatment='treated',
    time='year',
    unit='firm_id'
)

Multi-Period DiD (Event Study)

For settings with multiple pre- and post-treatment periods:

from diff_diff import MultiPeriodDiD

# Fit with multiple time periods
did = MultiPeriodDiD()
results = did.fit(
    panel_data,
    outcome='sales',
    treatment='treated',
    time='period',
    post_periods=[3, 4, 5],      # Periods 3-5 are post-treatment
    reference_period=0           # Reference period for comparison
)

# View period-specific treatment effects
for period, effect in results.period_effects.items():
    print(f"Period {period}: {effect.effect:.3f} (SE: {effect.se:.3f})")

# View average treatment effect across post-periods
print(f"Average ATT: {results.avg_att:.3f}")
print(f"Average SE: {results.avg_se:.3f}")

# Full summary with all period effects
results.print_summary()

Output:

================================================================================
            Multi-Period Difference-in-Differences Estimation Results
================================================================================

Observations:                      600
Pre-treatment periods:             3
Post-treatment periods:            3

--------------------------------------------------------------------------------
Average Treatment Effect
--------------------------------------------------------------------------------
Average ATT       5.2000       0.8234      6.315      0.0000
--------------------------------------------------------------------------------
95% Confidence Interval: [3.5862, 6.8138]

Period-Specific Effects:
--------------------------------------------------------------------------------
Period            Effect     Std. Err.     t-stat      P>|t|
--------------------------------------------------------------------------------
3                 4.5000       0.9512      4.731      0.0000***
4                 5.2000       0.8876      5.858      0.0000***
5                 5.9000       0.9123      6.468      0.0000***
--------------------------------------------------------------------------------

Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
================================================================================

Staggered Difference-in-Differences (Callaway-Sant'Anna)

When treatment is adopted at different times by different units, traditional TWFE estimators can be biased. The Callaway-Sant'Anna estimator provides unbiased estimates with staggered adoption.

from diff_diff import CallawaySantAnna

# Panel data with staggered treatment
# 'first_treat' = period when unit was first treated (0 if never treated)
cs = CallawaySantAnna()
results = cs.fit(
    panel_data,
    outcome='sales',
    unit='firm_id',
    time='year',
    first_treat='first_treat',  # 0 for never-treated, else first treatment year
    aggregate='event_study'      # Compute event study effects
)

# View results
results.print_summary()

# Access group-time effects ATT(g,t)
for (group, time), effect in results.group_time_effects.items():
    print(f"Cohort {group}, Period {time}: {effect['effect']:.3f}")

# Event study effects (averaged by relative time)
for rel_time, effect in results.event_study_effects.items():
    print(f"e={rel_time}: {effect['effect']:.3f} (SE: {effect['se']:.3f})")

# Convert to DataFrame
df = results.to_dataframe(level='event_study')

Output:

=====================================================================================
          Callaway-Sant'Anna Staggered Difference-in-Differences Results
=====================================================================================

Total observations:                     600
Treated units:                           35
Control units:                           15
Treatment cohorts:                        3
Time periods:                             8
Control group:                never_treated

-------------------------------------------------------------------------------------
                  Overall Average Treatment Effect on the Treated
-------------------------------------------------------------------------------------
Parameter         Estimate     Std. Err.     t-stat      P>|t|   Sig.
-------------------------------------------------------------------------------------
ATT                 2.5000       0.3521       7.101     0.0000   ***
-------------------------------------------------------------------------------------

95% Confidence Interval: [1.8099, 3.1901]

-------------------------------------------------------------------------------------
                          Event Study (Dynamic) Effects
-------------------------------------------------------------------------------------
Rel. Period       Estimate     Std. Err.     t-stat      P>|t|   Sig.
-------------------------------------------------------------------------------------
0                   2.1000       0.4521       4.645     0.0000   ***
1                   2.5000       0.4123       6.064     0.0000   ***
2                   2.8000       0.5234       5.349     0.0000   ***
-------------------------------------------------------------------------------------

Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
=====================================================================================

When to use Callaway-Sant'Anna vs TWFE:

Scenario Use TWFE Use Callaway-Sant'Anna
All units treated at same time
Staggered adoption, homogeneous effects
Staggered adoption, heterogeneous effects
Need event study with staggered timing
Fewer than ~20 treated units Depends on design

Parameters:

CallawaySantAnna(
    control_group='never_treated',  # or 'not_yet_treated'
    anticipation=0,                  # Periods before treatment with effects
    estimation_method='dr',          # 'dr', 'ipw', or 'reg'
    alpha=0.05,                      # Significance level
    cluster=None,                    # Column for cluster SEs
    n_bootstrap=0,                   # Bootstrap iterations (0 = analytical SEs)
    bootstrap_weight_type='rademacher',  # 'rademacher', 'mammen', or 'webb'
    seed=None                        # Random seed
)

Multiplier bootstrap for inference:

With few clusters or when analytical standard errors may be unreliable, use the multiplier bootstrap for valid inference. This implements the approach from Callaway & Sant'Anna (2021).

# Bootstrap inference with 999 iterations
cs = CallawaySantAnna(
    n_bootstrap=999,
    bootstrap_weight_type='rademacher',  # or 'mammen', 'webb'
    seed=42
)
results = cs.fit(
    data,
    outcome='sales',
    unit='firm_id',
    time='year',
    first_treat='first_treat',
    aggregate='event_study'
)

# Access bootstrap results
print(f"Overall ATT: {results.overall_att:.3f}")
print(f"Bootstrap SE: {results.bootstrap_results.overall_att_se:.3f}")
print(f"Bootstrap 95% CI: {results.bootstrap_results.overall_att_ci}")
print(f"Bootstrap p-value: {results.bootstrap_results.overall_att_p_value:.4f}")

# Event study bootstrap inference
for rel_time, se in results.bootstrap_results.event_study_ses.items():
    ci = results.bootstrap_results.event_study_cis[rel_time]
    print(f"e={rel_time}: SE={se:.3f}, 95% CI=[{ci[0]:.3f}, {ci[1]:.3f}]")

Bootstrap weight types:

  • 'rademacher' - Default, ±1 with p=0.5, good for most cases
  • 'mammen' - Two-point distribution matching first 3 moments
  • 'webb' - Six-point distribution, recommended for very few clusters (<10)

Covariate adjustment for conditional parallel trends:

When parallel trends only holds conditional on covariates, use the covariates parameter:

# Doubly robust estimation with covariates
cs = CallawaySantAnna(estimation_method='dr')  # 'dr', 'ipw', or 'reg'
results = cs.fit(
    data,
    outcome='sales',
    unit='firm_id',
    time='year',
    first_treat='first_treat',
    covariates=['size', 'age', 'industry'],  # Covariates for conditional PT
    aggregate='event_study'
)

Sun-Abraham Interaction-Weighted Estimator

The Sun-Abraham (2021) estimator provides an alternative to Callaway-Sant'Anna using an interaction-weighted (IW) regression approach. Running both estimators serves as a useful robustness check—when they agree, results are more credible.

from diff_diff import SunAbraham

# Basic usage
sa = SunAbraham()
results = sa.fit(
    panel_data,
    outcome='sales',
    unit='firm_id',
    time='year',
    first_treat='first_treat'  # 0 for never-treated, else first treatment year
)

# View results
results.print_summary()

# Event study effects (by relative time to treatment)
for rel_time, effect in results.event_study_effects.items():
    print(f"e={rel_time}: {effect['effect']:.3f} (SE: {effect['se']:.3f})")

# Overall ATT
print(f"Overall ATT: {results.overall_att:.3f} (SE: {results.overall_se:.3f})")

# Cohort weights (how each cohort contributes to each event-time estimate)
for rel_time, weights in results.cohort_weights.items():
    print(f"e={rel_time}: {weights}")

Parameters:

SunAbraham(
    control_group='never_treated',  # or 'not_yet_treated'
    anticipation=0,                  # Periods before treatment with effects
    alpha=0.05,                      # Significance level
    cluster=None,                    # Column for cluster SEs
    n_bootstrap=0,                   # Bootstrap iterations (0 = analytical SEs)
    bootstrap_weights='rademacher',  # 'rademacher', 'mammen', or 'webb'
    seed=None                        # Random seed
)

Bootstrap inference:

# Bootstrap inference with 999 iterations
sa = SunAbraham(
    n_bootstrap=999,
    bootstrap_weights='rademacher',
    seed=42
)
results = sa.fit(
    data,
    outcome='sales',
    unit='firm_id',
    time='year',
    first_treat='first_treat'
)

# Access bootstrap results
print(f"Overall ATT: {results.overall_att:.3f}")
print(f"Bootstrap SE: {results.bootstrap_results.overall_att_se:.3f}")
print(f"Bootstrap 95% CI: {results.bootstrap_results.overall_att_ci}")
print(f"Bootstrap p-value: {results.bootstrap_results.overall_att_p_value:.4f}")

When to use Sun-Abraham vs Callaway-Sant'Anna:

Aspect Sun-Abraham Callaway-Sant'Anna
Approach Interaction-weighted regression 2x2 DiD aggregation
Efficiency More efficient under homogeneous effects More robust to heterogeneity
Weighting Weights by cohort share at each relative time Weights by sample size
Use case Robustness check, regression-based inference Primary staggered DiD estimator

Both estimators should give similar results when:

  • Treatment effects are relatively homogeneous across cohorts
  • Parallel trends holds

Running both as robustness check:

from diff_diff import CallawaySantAnna, SunAbraham

# Callaway-Sant'Anna
cs = CallawaySantAnna()
cs_results = cs.fit(data, outcome='y', unit='unit', time='time', first_treat='first_treat')

# Sun-Abraham
sa = SunAbraham()
sa_results = sa.fit(data, outcome='y', unit='unit', time='time', first_treat='first_treat')

# Compare
print(f"Callaway-Sant'Anna ATT: {cs_results.overall_att:.3f}")
print(f"Sun-Abraham ATT: {sa_results.overall_att:.3f}")

# If results differ substantially, investigate heterogeneity

Triple Difference (DDD)

Triple Difference (DDD) is used when treatment requires satisfying two criteria: belonging to a treated group AND being in an eligible partition. The TripleDifference class implements the methodology from Ortiz-Villavicencio & Sant'Anna (2025), which correctly handles covariate adjustment (unlike naive implementations).

from diff_diff import TripleDifference, triple_difference

# Basic usage
ddd = TripleDifference(estimation_method='dr')  # doubly robust (recommended)
results = ddd.fit(
    data,
    outcome='wages',
    group='policy_state',       # 1=state enacted policy, 0=control state
    partition='female',         # 1=women (affected by policy), 0=men
    time='post'                 # 1=post-policy, 0=pre-policy
)

# View results
results.print_summary()
print(f"ATT: {results.att:.3f} (SE: {results.se:.3f})")

# With covariates (properly incorporated, unlike naive DDD)
results = ddd.fit(
    data,
    outcome='wages',
    group='policy_state',
    partition='female',
    time='post',
    covariates=['age', 'education', 'experience']
)

Estimation methods:

Method Description When to use
"dr" Doubly robust Recommended. Consistent if either outcome or propensity model is correct
"reg" Regression adjustment Simple outcome regression with full interactions
"ipw" Inverse probability weighting When propensity score model is well-specified
# Compare estimation methods
for method in ['reg', 'ipw', 'dr']:
    est = TripleDifference(estimation_method=method)
    res = est.fit(data, outcome='y', group='g', partition='p', time='t')
    print(f"{method}: ATT={res.att:.3f} (SE={res.se:.3f})")

Convenience function:

# One-liner estimation
results = triple_difference(
    data,
    outcome='wages',
    group='policy_state',
    partition='female',
    time='post',
    covariates=['age', 'education'],
    estimation_method='dr'
)

Why use DDD instead of DiD?

DDD allows for violations of parallel trends that are:

  • Group-specific (e.g., economic shocks in treatment states)
  • Partition-specific (e.g., trends affecting women everywhere)

As long as these biases are additive, DDD differences them out. The key assumption is that the differential trend between eligible and ineligible units would be the same across groups.

Event Study Visualization

Create publication-ready event study plots:

from diff_diff import plot_event_study, MultiPeriodDiD, CallawaySantAnna, SunAbraham

# From MultiPeriodDiD
did = MultiPeriodDiD()
results = did.fit(data, outcome='y', treatment='treated',
                  time='period', post_periods=[3, 4, 5])
plot_event_study(results, title="Treatment Effects Over Time")

# From CallawaySantAnna (with event study aggregation)
cs = CallawaySantAnna()
results = cs.fit(data, outcome='y', unit='unit', time='period',
                 first_treat='first_treat', aggregate='event_study')
plot_event_study(results, title="Staggered DiD Event Study (CS)")

# From SunAbraham
sa = SunAbraham()
results = sa.fit(data, outcome='y', unit='unit', time='period',
                 first_treat='first_treat')
plot_event_study(results, title="Staggered DiD Event Study (SA)")

# From a DataFrame
df = pd.DataFrame({
    'period': [-2, -1, 0, 1, 2],
    'effect': [0.1, 0.05, 0.0, 2.5, 2.8],
    'se': [0.3, 0.25, 0.0, 0.4, 0.45]
})
plot_event_study(df, reference_period=0)

# With customization
ax = plot_event_study(
    results,
    title="Dynamic Treatment Effects",
    xlabel="Years Relative to Treatment",
    ylabel="Effect on Sales ($1000s)",
    color="#2563eb",
    marker="o",
    shade_pre=True,           # Shade pre-treatment region
    show_zero_line=True,      # Horizontal line at y=0
    show_reference_line=True, # Vertical line at reference period
    figsize=(10, 6),
    show=False                # Don't call plt.show(), return axes
)

Synthetic Difference-in-Differences

Synthetic DiD combines the strengths of Difference-in-Differences and Synthetic Control methods by re-weighting control units to better match treated units' pre-treatment outcomes.

from diff_diff import SyntheticDiD

# Fit Synthetic DiD model
sdid = SyntheticDiD()
results = sdid.fit(
    panel_data,
    outcome='gdp_growth',
    treatment='treated',
    unit='state',
    time='year',
    post_periods=[2015, 2016, 2017, 2018]
)

# View results
results.print_summary()
print(f"ATT: {results.att:.3f} (SE: {results.se:.3f})")

# Examine unit weights (which control units matter most)
weights_df = results.get_unit_weights_df()
print(weights_df.head(10))

# Examine time weights
time_weights_df = results.get_time_weights_df()
print(time_weights_df)

Output:

===========================================================================
         Synthetic Difference-in-Differences Estimation Results
===========================================================================

Observations:                      500
Treated units:                       1
Control units:                      49
Pre-treatment periods:               6
Post-treatment periods:              4
Regularization (lambda):        0.0000
Pre-treatment fit (RMSE):       0.1234

---------------------------------------------------------------------------
Parameter         Estimate     Std. Err.     t-stat      P>|t|
---------------------------------------------------------------------------
ATT                 2.5000       0.4521      5.530      0.0000
---------------------------------------------------------------------------

95% Confidence Interval: [1.6139, 3.3861]

---------------------------------------------------------------------------
                   Top Unit Weights (Synthetic Control)
---------------------------------------------------------------------------
  Unit state_12: 0.3521
  Unit state_5: 0.2156
  Unit state_23: 0.1834
  Unit state_8: 0.1245
  Unit state_31: 0.0892
  (8 units with weight > 0.001)

Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
===========================================================================

When to Use Synthetic DiD Over Vanilla DiD

Use Synthetic DiD instead of standard DiD when:

  1. Few treated units: When you have only one or a small number of treated units (e.g., a single state passed a policy), standard DiD averages across all controls equally. Synthetic DiD finds the optimal weighted combination of controls.

    # Example: California passed a policy, want to estimate its effect
    # Standard DiD would compare CA to the average of all other states
    # Synthetic DiD finds states that together best match CA's pre-treatment trend
  2. Parallel trends is questionable: When treated and control groups have different pre-treatment levels or trends, Synthetic DiD can construct a better counterfactual by matching the pre-treatment trajectory.

    # Example: A tech hub city vs rural areas
    # Rural areas may not be a good comparison on average
    # Synthetic DiD can weight urban/suburban controls more heavily
  3. Heterogeneous control units: When control units are very different from each other, equal weighting (as in standard DiD) is suboptimal.

    # Example: Comparing a treated developing country to other countries
    # Some control countries may be much more similar economically
    # Synthetic DiD upweights the most comparable controls
  4. You want transparency: Synthetic DiD provides explicit unit weights showing which controls contribute most to the comparison.

    # See exactly which units are driving the counterfactual
    print(results.get_unit_weights_df())

Key differences from standard DiD:

Aspect Standard DiD Synthetic DiD
Control weighting Equal (1/N) Optimized to match pre-treatment
Time weighting Equal across periods Can emphasize informative periods
N treated required Can be many Works with 1 treated unit
Parallel trends Assumed Partially relaxed via matching
Interpretability Simple average Explicit weights

Parameters:

SyntheticDiD(
    lambda_reg=0.0,     # Regularization toward uniform weights (0 = no reg)
    zeta=1.0,           # Time weight regularization (higher = more uniform)
    alpha=0.05,         # Significance level
    n_bootstrap=200,    # Bootstrap iterations for SE (0 = placebo-based)
    seed=None           # Random seed for reproducibility
)

Working with Results

Export Results

# As dictionary
results.to_dict()
# {'att': 3.5, 'se': 1.26, 'p_value': 0.037, ...}

# As DataFrame
df = results.to_dataframe()

Check Significance

if results.is_significant:
    print(f"Effect is significant at {did.alpha} level")

# Get significance stars
print(f"ATT: {results.att}{results.significance_stars}")
# ATT: 3.5000*

Access Full Regression Output

# All coefficients
results.coefficients
# {'const': 9.5, 'treated': 1.0, 'post': 2.5, 'treated:post': 3.5}

# Variance-covariance matrix
results.vcov

# Residuals and fitted values
results.residuals
results.fitted_values

# R-squared
results.r_squared

Checking Assumptions

Parallel Trends

Simple slope-based test:

from diff_diff.utils import check_parallel_trends

trends = check_parallel_trends(
    data,
    outcome='outcome',
    time='period',
    treatment_group='treated'
)

print(f"Treated trend: {trends['treated_trend']:.4f}")
print(f"Control trend: {trends['control_trend']:.4f}")
print(f"Difference p-value: {trends['p_value']:.4f}")

Robust distributional test (Wasserstein distance):

from diff_diff.utils import check_parallel_trends_robust

results = check_parallel_trends_robust(
    data,
    outcome='outcome',
    time='period',
    treatment_group='treated',
    unit='firm_id',              # Unit identifier for panel data
    pre_periods=[2018, 2019],    # Pre-treatment periods
    n_permutations=1000          # Permutations for p-value
)

print(f"Wasserstein distance: {results['wasserstein_distance']:.4f}")
print(f"Wasserstein p-value: {results['wasserstein_p_value']:.4f}")
print(f"KS test p-value: {results['ks_p_value']:.4f}")
print(f"Parallel trends plausible: {results['parallel_trends_plausible']}")

The Wasserstein (Earth Mover's) distance compares the full distribution of outcome changes, not just means. This is more robust to:

  • Non-normal distributions
  • Heterogeneous effects across units
  • Outliers

Equivalence testing (TOST):

from diff_diff.utils import equivalence_test_trends

results = equivalence_test_trends(
    data,
    outcome='outcome',
    time='period',
    treatment_group='treated',
    unit='firm_id',
    equivalence_margin=0.5       # Define "practically equivalent"
)

print(f"Mean difference: {results['mean_difference']:.4f}")
print(f"TOST p-value: {results['tost_p_value']:.4f}")
print(f"Trends equivalent: {results['equivalent']}")

Honest DiD Sensitivity Analysis (Rambachan-Roth)

Pre-trends tests have low power and can exacerbate bias. Honest DiD (Rambachan & Roth 2023) provides sensitivity analysis showing how robust your results are to violations of parallel trends.

from diff_diff import HonestDiD, MultiPeriodDiD

# First, fit a standard event study
did = MultiPeriodDiD()
event_results = did.fit(
    data,
    outcome='outcome',
    treatment='treated',
    time='period',
    post_periods=[5, 6, 7, 8, 9]
)

# Compute honest bounds with relative magnitudes restriction
# M=1 means post-treatment violations can be up to 1x the worst pre-treatment violation
honest = HonestDiD(method='relative_magnitude', M=1.0)
honest_results = honest.fit(event_results)

print(honest_results.summary())
print(f"Original estimate: {honest_results.original_estimate:.4f}")
print(f"Robust 95% CI: [{honest_results.ci_lb:.4f}, {honest_results.ci_ub:.4f}]")
print(f"Effect robust to violations: {honest_results.is_significant}")

Sensitivity analysis over M values:

# How do results change as we allow larger violations?
sensitivity = honest.sensitivity_analysis(
    event_results,
    M_grid=[0, 0.5, 1.0, 1.5, 2.0]
)

print(sensitivity.summary())
print(f"Breakdown value: M = {sensitivity.breakdown_M}")
# Breakdown = smallest M where the robust CI includes zero

Breakdown value:

The breakdown value tells you how robust your conclusion is:

breakdown = honest.breakdown_value(event_results)
if breakdown >= 1.0:
    print("Result holds even if post-treatment violations are as bad as pre-treatment")
else:
    print(f"Result requires violations smaller than {breakdown:.1f}x pre-treatment")

Smoothness restriction (alternative approach):

# Bounds second differences of trend violations
# M=0 means linear extrapolation of pre-trends
honest_smooth = HonestDiD(method='smoothness', M=0.5)
smooth_results = honest_smooth.fit(event_results)

Visualization:

from diff_diff import plot_sensitivity, plot_honest_event_study

# Plot sensitivity analysis
plot_sensitivity(sensitivity, title="Sensitivity to Parallel Trends Violations")

# Event study with honest confidence intervals
plot_honest_event_study(event_results, honest_results)

Pre-Trends Power Analysis (Roth 2022)

A passing pre-trends test doesn't mean parallel trends holds—it may just mean the test has low power. Pre-Trends Power Analysis (Roth 2022) answers: "What violations could my pre-trends test have detected?"

from diff_diff import PreTrendsPower, MultiPeriodDiD

# First, fit an event study
did = MultiPeriodDiD()
event_results = did.fit(
    data,
    outcome='outcome',
    treatment='treated',
    time='period',
    post_periods=[5, 6, 7, 8, 9]
)

# Analyze pre-trends test power
pt = PreTrendsPower(alpha=0.05, power=0.80)
power_results = pt.fit(event_results)

print(power_results.summary())
print(f"Minimum Detectable Violation (MDV): {power_results.mdv:.4f}")
print(f"Power to detect violations of size MDV: {power_results.power:.1%}")

Key concepts:

  • Minimum Detectable Violation (MDV): Smallest violation magnitude that would be detected with your target power (e.g., 80%). Passing the pre-trends test does NOT rule out violations up to this size.
  • Power: Probability of detecting a violation of given size if it exists.
  • Violation types: Linear trend, constant violation, last-period only, or custom patterns.

Power curve visualization:

from diff_diff import plot_pretrends_power

# Generate power curve across violation magnitudes
curve = pt.power_curve(event_results)

# Plot the power curve
plot_pretrends_power(curve, title="Pre-Trends Test Power Curve")

# Or from the curve object directly
curve.plot()

Different violation patterns:

# Linear trend violations (default) - most common assumption
pt_linear = PreTrendsPower(violation_type='linear')

# Constant violation in all pre-periods
pt_constant = PreTrendsPower(violation_type='constant')

# Violation only in the last pre-period (sharp break)
pt_last = PreTrendsPower(violation_type='last_period')

# Custom violation pattern
custom_weights = np.array([0.1, 0.3, 0.6])  # Increasing violations
pt_custom = PreTrendsPower(violation_type='custom', violation_weights=custom_weights)

Combining with HonestDiD:

Pre-trends power analysis and HonestDiD are complementary:

  1. Pre-trends power tells you what the test could have detected
  2. HonestDiD tells you how robust your results are to violations
from diff_diff import HonestDiD, PreTrendsPower

# If MDV is large relative to your estimated effect, be cautious
pt = PreTrendsPower()
power_results = pt.fit(event_results)
sensitivity = pt.sensitivity_to_honest_did(event_results)
print(sensitivity['interpretation'])

# Use HonestDiD for robust inference
honest = HonestDiD(method='relative_magnitude', M=1.0)
honest_results = honest.fit(event_results)

Placebo Tests

Placebo tests help validate the parallel trends assumption by checking whether effects appear where they shouldn't (before treatment or in untreated groups).

Fake timing test:

from diff_diff import run_placebo_test

# Test: Is there an effect before treatment actually occurred?
# Actual treatment is at period 3 (post_periods=[3, 4, 5])
# We test if a "fake" treatment at period 1 shows an effect
results = run_placebo_test(
    data,
    outcome='outcome',
    treatment='treated',
    time='period',
    test_type='fake_timing',
    fake_treatment_period=1,  # Pretend treatment was in period 1
    post_periods=[3, 4, 5]    # Actual post-treatment periods
)

print(results.summary())
# If parallel trends hold, placebo_effect should be ~0 and not significant
print(f"Placebo effect: {results.placebo_effect:.3f} (p={results.p_value:.3f})")
print(f"Is significant (bad): {results.is_significant}")

Fake group test:

# Test: Is there an effect among never-treated units?
# Get some control unit IDs to use as "fake treated"
control_units = data[data['treated'] == 0]['firm_id'].unique()[:5]

results = run_placebo_test(
    data,
    outcome='outcome',
    treatment='treated',
    time='period',
    unit='firm_id',
    test_type='fake_group',
    fake_treatment_group=list(control_units),  # List of control unit IDs
    post_periods=[3, 4, 5]
)

Permutation test:

# Randomly reassign treatment and compute distribution of effects
# Note: requires binary post indicator (use 'post' column, not 'period')
results = run_placebo_test(
    data,
    outcome='outcome',
    treatment='treated',
    time='post',           # Binary post-treatment indicator
    unit='firm_id',
    test_type='permutation',
    n_permutations=1000,
    seed=42
)

print(f"Original effect: {results.original_effect:.3f}")
print(f"Permutation p-value: {results.p_value:.4f}")
# Low p-value indicates the effect is unlikely to be due to chance

Leave-one-out sensitivity:

# Test sensitivity to individual treated units
# Note: requires binary post indicator (use 'post' column, not 'period')
results = run_placebo_test(
    data,
    outcome='outcome',
    treatment='treated',
    time='post',           # Binary post-treatment indicator
    unit='firm_id',
    test_type='leave_one_out'
)

# Check if any single unit drives the result
print(results.leave_one_out_effects)  # Effect when each unit is dropped

Run all placebo tests:

from diff_diff import run_all_placebo_tests

# Comprehensive diagnostic suite
# Note: This function runs fake_timing tests on pre-treatment periods.
# The permutation and leave_one_out tests require a binary post indicator,
# so they may return errors if the data uses multi-period time column.
all_results = run_all_placebo_tests(
    data,
    outcome='outcome',
    treatment='treated',
    time='period',
    unit='firm_id',
    pre_periods=[0, 1, 2],
    post_periods=[3, 4, 5],
    n_permutations=500,
    seed=42
)

for test_name, result in all_results.items():
    if hasattr(result, 'p_value'):
        print(f"{test_name}: p={result.p_value:.3f}, significant={result.is_significant}")
    elif isinstance(result, dict) and 'error' in result:
        print(f"{test_name}: Error - {result['error']}")

API Reference

DifferenceInDifferences

DifferenceInDifferences(
    robust=True,      # Use HC1 robust standard errors
    cluster=None,     # Column for cluster-robust SEs
    alpha=0.05        # Significance level for CIs
)

Methods:

Method Description
fit(data, outcome, treatment, time, ...) Fit the DiD model
summary() Get formatted summary string
print_summary() Print summary to stdout
get_params() Get estimator parameters (sklearn-compatible)
set_params(**params) Set estimator parameters (sklearn-compatible)

fit() Parameters:

Parameter Type Description
data DataFrame Input data
outcome str Outcome variable column name
treatment str Treatment indicator column (0/1)
time str Post-treatment indicator column (0/1)
formula str R-style formula (alternative to column names)
covariates list Linear control variables
fixed_effects list Categorical FE columns (creates dummies)
absorb list High-dimensional FE (within-transformation)

DiDResults

Attributes:

Attribute Description
att Average Treatment effect on the Treated
se Standard error of ATT
t_stat T-statistic
p_value P-value for H0: ATT = 0
conf_int Tuple of (lower, upper) confidence bounds
n_obs Number of observations
n_treated Number of treated units
n_control Number of control units
r_squared R-squared of regression
coefficients Dictionary of all coefficients
is_significant Boolean for significance at alpha
significance_stars String of significance stars

Methods:

Method Description
summary(alpha) Get formatted summary string
print_summary(alpha) Print summary to stdout
to_dict() Convert to dictionary
to_dataframe() Convert to pandas DataFrame

MultiPeriodDiD

MultiPeriodDiD(
    robust=True,      # Use HC1 robust standard errors
    cluster=None,     # Column for cluster-robust SEs
    alpha=0.05        # Significance level for CIs
)

fit() Parameters:

Parameter Type Description
data DataFrame Input data
outcome str Outcome variable column name
treatment str Treatment indicator column (0/1)
time str Time period column (multiple values)
post_periods list List of post-treatment period values
covariates list Linear control variables
fixed_effects list Categorical FE columns (creates dummies)
absorb list High-dimensional FE (within-transformation)
reference_period any Omitted period for time dummies

MultiPeriodDiDResults

Attributes:

Attribute Description
period_effects Dict mapping periods to PeriodEffect objects
avg_att Average ATT across post-treatment periods
avg_se Standard error of average ATT
avg_t_stat T-statistic for average ATT
avg_p_value P-value for average ATT
avg_conf_int Confidence interval for average ATT
n_obs Number of observations
pre_periods List of pre-treatment periods
post_periods List of post-treatment periods

Methods:

Method Description
get_effect(period) Get PeriodEffect for specific period
summary(alpha) Get formatted summary string
print_summary(alpha) Print summary to stdout
to_dict() Convert to dictionary
to_dataframe() Convert to pandas DataFrame

PeriodEffect

Attributes:

Attribute Description
period Time period identifier
effect Treatment effect estimate
se Standard error
t_stat T-statistic
p_value P-value
conf_int Confidence interval
is_significant Boolean for significance at 0.05
significance_stars String of significance stars

SyntheticDiD

SyntheticDiD(
    lambda_reg=0.0,     # L2 regularization for unit weights
    zeta=1.0,           # Regularization for time weights
    alpha=0.05,         # Significance level for CIs
    n_bootstrap=200,    # Bootstrap iterations for SE
    seed=None           # Random seed for reproducibility
)

fit() Parameters:

Parameter Type Description
data DataFrame Panel data
outcome str Outcome variable column name
treatment str Treatment indicator column (0/1)
unit str Unit identifier column
time str Time period column
post_periods list List of post-treatment period values
covariates list Covariates to residualize out

SyntheticDiDResults

Attributes:

Attribute Description
att Average Treatment effect on the Treated
se Standard error (bootstrap or placebo-based)
t_stat T-statistic
p_value P-value
conf_int Confidence interval
n_obs Number of observations
n_treated Number of treated units
n_control Number of control units
unit_weights Dict mapping control unit IDs to weights
time_weights Dict mapping pre-treatment periods to weights
pre_periods List of pre-treatment periods
post_periods List of post-treatment periods
pre_treatment_fit RMSE of synthetic vs treated in pre-period
placebo_effects Array of placebo effect estimates

Methods:

Method Description
summary(alpha) Get formatted summary string
print_summary(alpha) Print summary to stdout
to_dict() Convert to dictionary
to_dataframe() Convert to pandas DataFrame
get_unit_weights_df() Get unit weights as DataFrame
get_time_weights_df() Get time weights as DataFrame

SunAbraham

SunAbraham(
    control_group='never_treated',  # or 'not_yet_treated'
    anticipation=0,           # Periods of anticipation effects
    alpha=0.05,               # Significance level for CIs
    cluster=None,             # Column for cluster-robust SEs
    n_bootstrap=0,            # Bootstrap iterations (0 = analytical SEs)
    bootstrap_weights='rademacher',  # 'rademacher', 'mammen', or 'webb'
    seed=None                 # Random seed
)

fit() Parameters:

Parameter Type Description
data DataFrame Panel data
outcome str Outcome variable column name
unit str Unit identifier column
time str Time period column
first_treat str Column with first treatment period (0 for never-treated)
covariates list Covariate column names
min_pre_periods int Minimum pre-treatment periods to include
min_post_periods int Minimum post-treatment periods to include

SunAbrahamResults

Attributes:

Attribute Description
event_study_effects Dict mapping relative time to effect info
overall_att Overall average treatment effect
overall_se Standard error of overall ATT
overall_t_stat T-statistic for overall ATT
overall_p_value P-value for overall ATT
overall_conf_int Confidence interval for overall ATT
cohort_weights Dict mapping relative time to cohort weights
groups List of treatment cohorts
time_periods List of all time periods
n_obs Total number of observations
n_treated_units Number of ever-treated units
n_control_units Number of never-treated units
is_significant Boolean for significance at alpha
significance_stars String of significance stars
bootstrap_results SABootstrapResults (if bootstrap enabled)

Methods:

Method Description
summary(alpha) Get formatted summary string
print_summary(alpha) Print summary to stdout
to_dataframe(level) Convert to DataFrame ('event_study' or 'cohort')

TripleDifference

TripleDifference(
    estimation_method='dr',   # 'dr' (doubly robust), 'reg', or 'ipw'
    robust=True,              # Use HC1 robust standard errors
    cluster=None,             # Column for cluster-robust SEs
    alpha=0.05,               # Significance level for CIs
    pscore_trim=0.01          # Propensity score trimming threshold
)

fit() Parameters:

Parameter Type Description
data DataFrame Input data
outcome str Outcome variable column name
group str Group indicator column (0/1): 1=treated group
partition str Partition/eligibility indicator column (0/1): 1=eligible
time str Time indicator column (0/1): 1=post-treatment
covariates list Covariate column names for adjustment

TripleDifferenceResults

Attributes:

Attribute Description
att Average Treatment effect on the Treated
se Standard error of ATT
t_stat T-statistic
p_value P-value for H0: ATT = 0
conf_int Tuple of (lower, upper) confidence bounds
n_obs Total number of observations
n_treated_eligible Obs in treated group & eligible partition
n_treated_ineligible Obs in treated group & ineligible partition
n_control_eligible Obs in control group & eligible partition
n_control_ineligible Obs in control group & ineligible partition
estimation_method Method used ('dr', 'reg', or 'ipw')
group_means Dict of cell means for diagnostics
pscore_stats Propensity score statistics (IPW/DR only)
is_significant Boolean for significance at alpha
significance_stars String of significance stars

Methods:

Method Description
summary(alpha) Get formatted summary string
print_summary(alpha) Print summary to stdout
to_dict() Convert to dictionary
to_dataframe() Convert to pandas DataFrame

HonestDiD

HonestDiD(
    method='relative_magnitude',  # 'relative_magnitude' or 'smoothness'
    M=None,               # Restriction parameter (default: 1.0 for RM, 0.0 for SD)
    alpha=0.05,           # Significance level for CIs
    l_vec=None            # Linear combination vector for target parameter
)

fit() Parameters:

Parameter Type Description
results MultiPeriodDiDResults Results from MultiPeriodDiD.fit()
M float Restriction parameter (overrides constructor value)

Methods:

Method Description
fit(results, M) Compute bounds for given event study results
sensitivity_analysis(results, M_grid) Compute bounds over grid of M values
breakdown_value(results, tol) Find smallest M where CI includes zero

HonestDiDResults

Attributes:

Attribute Description
original_estimate Point estimate under parallel trends
lb Lower bound of identified set
ub Upper bound of identified set
ci_lb Lower bound of robust confidence interval
ci_ub Upper bound of robust confidence interval
ci_width Width of robust CI
M Restriction parameter used
method Restriction method ('relative_magnitude' or 'smoothness')
alpha Significance level
is_significant True if robust CI excludes zero

Methods:

Method Description
summary() Get formatted summary string
to_dict() Convert to dictionary
to_dataframe() Convert to pandas DataFrame

SensitivityResults

Attributes:

Attribute Description
M_grid Array of M values analyzed
results List of HonestDiDResults for each M
breakdown_M Smallest M where CI includes zero (None if always significant)

Methods:

Method Description
summary() Get formatted summary string
plot(ax) Plot sensitivity analysis
to_dataframe() Convert to pandas DataFrame

PreTrendsPower

PreTrendsPower(
    alpha=0.05,           # Significance level for pre-trends test
    power=0.80,           # Target power for MDV calculation
    violation_type='linear',  # 'linear', 'constant', 'last_period', 'custom'
    violation_weights=None    # Custom weights (required if violation_type='custom')
)

fit() Parameters:

Parameter Type Description
results MultiPeriodDiDResults Results from event study
M float Specific violation magnitude to evaluate

Methods:

Method Description
fit(results, M) Compute power analysis for given event study
power_at(results, M) Compute power for specific violation magnitude
power_curve(results, M_grid, n_points) Compute power across range of M values
sensitivity_to_honest_did(results) Compare with HonestDiD analysis

PreTrendsPowerResults

Attributes:

Attribute Description
power Power to detect the specified violation
mdv Minimum detectable violation at target power
violation_magnitude Violation magnitude (M) tested
violation_type Type of violation pattern
alpha Significance level
target_power Target power level
n_pre_periods Number of pre-treatment periods
test_statistic Expected test statistic under violation
critical_value Critical value for pre-trends test
noncentrality Non-centrality parameter
is_informative Heuristic check if test is informative
power_adequate Whether power meets target

Methods:

Method Description
summary() Get formatted summary string
print_summary() Print summary to stdout
to_dict() Convert to dictionary
to_dataframe() Convert to pandas DataFrame

PreTrendsPowerCurve

Attributes:

Attribute Description
M_values Array of violation magnitudes
powers Array of power values
mdv Minimum detectable violation
alpha Significance level
target_power Target power level
violation_type Type of violation pattern

Methods:

Method Description
plot(ax, show_mdv, show_target) Plot power curve
to_dataframe() Convert to DataFrame with M and power columns

Data Preparation Functions

generate_did_data

generate_did_data(
    n_units=100,          # Number of units
    n_periods=4,          # Number of time periods
    treatment_effect=5.0, # True ATT
    treatment_fraction=0.5,  # Fraction treated
    treatment_period=2,   # First post-treatment period
    unit_fe_sd=2.0,       # Unit fixed effect std dev
    time_trend=0.5,       # Linear time trend
    noise_sd=1.0,         # Idiosyncratic noise std dev
    seed=None             # Random seed
)

Returns DataFrame with columns: unit, period, treated, post, outcome, true_effect.

make_treatment_indicator

make_treatment_indicator(
    data,                 # Input DataFrame
    column,               # Column to create treatment from
    treated_values=None,  # Value(s) indicating treatment
    threshold=None,       # Numeric threshold for treatment
    above_threshold=True, # If True, >= threshold is treated
    new_column='treated'  # Output column name
)

make_post_indicator

make_post_indicator(
    data,                  # Input DataFrame
    time_column,           # Time/period column
    post_periods=None,     # Specific post-treatment period(s)
    treatment_start=None,  # First post-treatment period
    new_column='post'      # Output column name
)

wide_to_long

wide_to_long(
    data,                  # Wide-format DataFrame
    value_columns,         # List of time-varying columns
    id_column,             # Unit identifier column
    time_name='period',    # Name for time column
    value_name='value',    # Name for value column
    time_values=None       # Values for time periods
)

balance_panel

balance_panel(
    data,                  # Panel DataFrame
    unit_column,           # Unit identifier column
    time_column,           # Time period column
    method='inner',        # 'inner', 'outer', or 'fill'
    fill_value=None        # Value for filling (if method='fill')
)

validate_did_data

validate_did_data(
    data,                  # DataFrame to validate
    outcome,               # Outcome column name
    treatment,             # Treatment column name
    time,                  # Time/post column name
    unit=None,             # Unit column (for panel validation)
    raise_on_error=True    # Raise ValueError or return dict
)

Returns dict with valid, errors, warnings, and summary keys.

summarize_did_data

summarize_did_data(
    data,                  # Input DataFrame
    outcome,               # Outcome column name
    treatment,             # Treatment column name
    time,                  # Time/post column name
    unit=None              # Unit column (optional)
)

Returns DataFrame with summary statistics by treatment-time cell.

create_event_time

create_event_time(
    data,                  # Panel DataFrame
    time_column,           # Calendar time column
    treatment_time_column, # Column with treatment timing
    new_column='event_time' # Output column name
)

aggregate_to_cohorts

aggregate_to_cohorts(
    data,                  # Unit-level panel data
    unit_column,           # Unit identifier column
    time_column,           # Time period column
    treatment_column,      # Treatment indicator column
    outcome,               # Outcome variable column
    covariates=None        # Additional columns to aggregate
)

rank_control_units

rank_control_units(
    data,                          # Panel data in long format
    unit_column,                   # Unit identifier column
    time_column,                   # Time period column
    outcome_column,                # Outcome variable column
    treatment_column=None,         # Treatment indicator column (0/1)
    treated_units=None,            # Explicit list of treated unit IDs
    pre_periods=None,              # Pre-treatment periods (default: first half)
    covariates=None,               # Covariate columns for matching
    outcome_weight=0.7,            # Weight for outcome trend similarity (0-1)
    covariate_weight=0.3,          # Weight for covariate distance (0-1)
    exclude_units=None,            # Units to exclude from control pool
    require_units=None,            # Units that must appear in output
    n_top=None,                    # Return only top N controls
    suggest_treatment_candidates=False,  # Identify treatment candidates
    n_treatment_candidates=5,      # Number of treatment candidates
    lambda_reg=0.0                 # Regularization for synthetic weights
)

Returns DataFrame with columns: unit, quality_score, outcome_trend_score, covariate_score, synthetic_weight, pre_trend_rmse, is_required.

Requirements

  • Python >= 3.9
  • numpy >= 1.20
  • pandas >= 1.3
  • scipy >= 1.7

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black diff_diff tests
ruff check diff_diff tests

References

This library implements methods from the following scholarly works:

Difference-in-Differences

  • Ashenfelter, O., & Card, D. (1985). "Using the Longitudinal Structure of Earnings to Estimate the Effect of Training Programs." The Review of Economics and Statistics, 67(4), 648-660. https://doi.org/10.2307/1924810

  • Card, D., & Krueger, A. B. (1994). "Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania." The American Economic Review, 84(4), 772-793. https://www.jstor.org/stable/2118030

  • Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press. Chapter 5: Differences-in-Differences.

Two-Way Fixed Effects

  • Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press.

  • Imai, K., & Kim, I. S. (2021). "On the Use of Two-Way Fixed Effects Regression Models for Causal Inference with Panel Data." Political Analysis, 29(3), 405-415. https://doi.org/10.1017/pan.2020.33

Robust Standard Errors

  • White, H. (1980). "A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity." Econometrica, 48(4), 817-838. https://doi.org/10.2307/1912934

  • MacKinnon, J. G., & White, H. (1985). "Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties." Journal of Econometrics, 29(3), 305-325. https://doi.org/10.1016/0304-4076(85)90158-7

  • Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2011). "Robust Inference With Multiway Clustering." Journal of Business & Economic Statistics, 29(2), 238-249. https://doi.org/10.1198/jbes.2010.07136

Wild Cluster Bootstrap

Placebo Tests and DiD Diagnostics

  • Bertrand, M., Duflo, E., & Mullainathan, S. (2004). "How Much Should We Trust Differences-in-Differences Estimates?" The Quarterly Journal of Economics, 119(1), 249-275. https://doi.org/10.1162/003355304772839588

Synthetic Control Method

  • Abadie, A., & Gardeazabal, J. (2003). "The Economic Costs of Conflict: A Case Study of the Basque Country." The American Economic Review, 93(1), 113-132. https://doi.org/10.1257/000282803321455188

  • Abadie, A., Diamond, A., & Hainmueller, J. (2010). "Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program." Journal of the American Statistical Association, 105(490), 493-505. https://doi.org/10.1198/jasa.2009.ap08746

  • Abadie, A., Diamond, A., & Hainmueller, J. (2015). "Comparative Politics and the Synthetic Control Method." American Journal of Political Science, 59(2), 495-510. https://doi.org/10.1111/ajps.12116

Synthetic Difference-in-Differences

  • Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). "Synthetic Difference-in-Differences." American Economic Review, 111(12), 4088-4118. https://doi.org/10.1257/aer.20190159

Triple Difference (DDD)

  • Ortiz-Villavicencio, M., & Sant'Anna, P. H. C. (2025). "Better Understanding Triple Differences Estimators." Working Paper. https://arxiv.org/abs/2505.09942

    This paper shows that common DDD implementations (taking the difference between two DiDs, or applying three-way fixed effects regressions) are generally invalid when identification requires conditioning on covariates. The TripleDifference class implements their regression adjustment, inverse probability weighting, and doubly robust estimators.

  • Gruber, J. (1994). "The Incidence of Mandated Maternity Benefits." American Economic Review, 84(3), 622-641. https://www.jstor.org/stable/2118071

    Classic paper introducing the Triple Difference design for policy evaluation.

  • Olden, A., & Møen, J. (2022). "The Triple Difference Estimator." The Econometrics Journal, 25(3), 531-553. https://doi.org/10.1093/ectj/utac010

Parallel Trends and Pre-Trend Testing

  • Roth, J. (2022). "Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends." American Economic Review: Insights, 4(3), 305-322. https://doi.org/10.1257/aeri.20210236

  • Lakens, D. (2017). "Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses." Social Psychological and Personality Science, 8(4), 355-362. https://doi.org/10.1177/1948550617697177

Honest DiD / Sensitivity Analysis

The HonestDiD module implements sensitivity analysis methods for relaxing the parallel trends assumption:

  • Rambachan, A., & Roth, J. (2023). "A More Credible Approach to Parallel Trends." The Review of Economic Studies, 90(5), 2555-2591. https://doi.org/10.1093/restud/rdad018

    This paper introduces the "Honest DiD" framework implemented in our HonestDiD class:

    • Relative Magnitudes (ΔRM): Bounds post-treatment violations by a multiple of observed pre-treatment violations
    • Smoothness (ΔSD): Bounds on second differences of trend violations, allowing for linear extrapolation of pre-trends
    • Breakdown Analysis: Finding the smallest violation magnitude that would overturn conclusions
    • Robust Confidence Intervals: Valid inference under partial identification
  • Roth, J., & Sant'Anna, P. H. C. (2023). "When Is Parallel Trends Sensitive to Functional Form?" Econometrica, 91(2), 737-747. https://doi.org/10.3982/ECTA19402

    Discusses functional form sensitivity in parallel trends assumptions, relevant to understanding when smoothness restrictions are appropriate.

Multi-Period and Staggered Adoption

Power Analysis

  • Bloom, H. S. (1995). "Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs." Evaluation Review, 19(5), 547-556. https://doi.org/10.1177/0193841X9501900504

  • Burlig, F., Preonas, L., & Woerman, M. (2020). "Panel Data and Experimental Design." Journal of Development Economics, 144, 102458. https://doi.org/10.1016/j.jdeveco.2020.102458

    Essential reference for power analysis in panel DiD designs. Discusses how serial correlation (ICC) affects power and provides formulas for panel data settings.

  • Djimeu, E. W., & Houndolo, D.-G. (2016). "Power Calculation for Causal Inference in Social Science: Sample Size and Minimum Detectable Effect Determination." Journal of Development Effectiveness, 8(4), 508-527. https://doi.org/10.1080/19439342.2016.1244555

General Causal Inference

  • Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.

  • Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press. https://mixtape.scunning.com/

License

MIT License

About

A Python library for Difference-in-Differences (DiD) causal inference analysis with an sklearn-like API and statsmodels-style outputs.

Topics

Resources

Stars

Watchers

Forks