---
title: "Six Ways to Evaluate a Policy using R: Comparative Case Studies of Proposition 99"
subtitle: "DiD, ITS, RDD, Synthetic Control, and CausalImpact on California's 1988 cigarette tax"
author: "Carlos Mendez"
date: "2026-05-15"
format:
  html:
    toc: true
    toc-depth: 3
    code-fold: true
    code-summary: "Show code"
    theme: darkly
    fig-width: 9
    fig-height: 5.5
    fig-dpi: 300
execute:
  warning: false
  message: false
---

## 1. Overview

How do you measure the causal effect of a policy when you cannot randomize who gets treated? In January 1989, California raised its cigarette tax by 25 cents per pack. The reform was called **Proposition 99**. Per-capita cigarette sales in California then fell from 116 packs in 1988 to 60 packs in 2000 --- almost a 50% drop. But the country as a whole was also smoking less. So the question this tutorial is built around is deceptively simple:

> **How much of California's drop was caused by Proposition 99, and how much would have happened anyway?**

This tutorial is inspired by the workshop [causalpolicy.nl](https://causalpolicy.nl/) by the ODISSEI Social Data Science team. We run **six method families on the same dataset** and place every estimate on a single forest plot. The disagreements are then visible at a glance.

| # | Method family | One-line idea |
|---|---|---|
| 1 | Naive pre-post | Compare California's mean before and after 1989. |
| 2 | Difference-in-Differences (DiD) | Subtract a control state's pre/post change from California's. |
| 3 | Interrupted Time Series (ITS) | Extrapolate California's *own* pre-trend forward. Two flavours: linear growth curve and auto-selected ARIMA. |
| 4 | Regression Discontinuity on time (RDD) | Fit a piecewise line with a level and slope break at the policy date. |
| 5 | Synthetic Control | Build a weighted blend of donor states that mimics California's pre-period. |
| 6 | CausalImpact | Fit a Bayesian time-series model that uses donor states as predictors. |

Every method shares the same underlying logic. It builds a **counterfactual** --- what California's smoking *would have looked like* without Proposition 99 --- and reports the gap between observed and counterfactual as the estimated effect. What changes from method to method is *how* the counterfactual is built.

The case study is famous. The original Synthetic Control paper by [Abadie, Diamond, and Hainmueller (2010)](https://www.aeaweb.org/articles?id=10.1257/jasa.2010.ap08746) used exactly this dataset. We replicate their estimate within rounding, then watch what happens when five other estimators are swapped in.

**The headline finding.** Five of the six methods agree on a 13--20 pack reduction per capita. One method (DiD against a single Nevada control) collapses to noise. One method (ITS with auto-selected ARIMA) flips sign entirely. The disagreement is the lesson.

If you want to go deeper on a specific method after this tour, two sister tutorials cover the same territory in much greater detail. [Difference-in-Differences for Policy Evaluation](https://carlos-mendez.org/post/r_did/) walks through staggered adoption, Callaway--Sant'Anna group-time ATTs, and HonestDiD sensitivity analysis. [Bayesian Spatial Synthetic Control](https://carlos-mendez.org/post/r_sc_bayes_spatial/) revisits Proposition 99 with a spatial Bayesian extension of the synthetic-control machinery.

**Learning objectives:**

- Understand why a within-unit pre-post comparison is biased --- and how each causal estimator tries to fix that bias.
- Build, fit, and interpret DiD, ITS (growth-curve and ARIMA), RDD-on-time, Synthetic Control (`tidysynth`), and CausalImpact models in R.
- Read a `synthetic_control()` pipeline end-to-end: predictors, donor weights, placebo permutations, balance tables.
- Compare six estimators on a single forest plot and explain *why* they disagree where they do.
- Apply **estimand discipline** --- name the causal quantity each method targets before quoting any number.

### How to read this tutorial

Each method section follows the same four-part rhythm:

1. **The idea.** One sentence on what the method does conceptually.
2. **The code.** A short, focused R block.
3. **The output.** The numbers printed by the model.
4. **What it means.** A plain-language interpretation that ties back to the case-study question.

If you are short on time, **read the bold one-liners** in each method section for a fast tour. Read the full prose when you need the details. The Cross-method comparison (§12) and Discussion (§13) put all seven estimates side-by-side and explain the pattern.

### The shared logic of every method

The diagram below makes the common skeleton explicit. Each method needs three ingredients: California's observed outcome, a counterfactual (constructed from a different data source by each method), and the gap between the two. The gap is the estimated effect.

```{mermaid}
flowchart LR
    OBS["California observed<br/>1989–2000<br/>(mean ≈ 60 packs)"]
    CF["Counterfactual<br/>What California would have looked like<br/>WITHOUT Proposition 99"]
    EFF["Effect =<br/>Observed − Counterfactual"]
    OBS --> EFF
    CF --> EFF

    SRC1["Method 1: California's pre-1989 mean"] -.-> CF
    SRC2["Method 2: Nevada's pre→post change"] -.-> CF
    SRC3["Method 3: California's own pre-trend extrapolated"] -.-> CF
    SRC4["Method 4: piecewise line around 1989"] -.-> CF
    SRC5["Method 5: weighted blend of donor states"] -.-> CF
    SRC6["Method 6: Bayesian time-series fit on donors"] -.-> CF

    style OBS fill:#d97757,stroke:#141413,color:#fff
    style CF fill:#6a9bcc,stroke:#141413,color:#fff
    style EFF fill:#00d4c8,stroke:#141413,color:#141413
```

Read the diagram from left to right. The orange box (California observed) is fixed --- every method sees the same data. The blue box (counterfactual) is the *construction*, and the six dashed arrows feeding it show how each method differs in its source of information. The teal box on the right is the universal output: a number measuring the gap. The whole rest of this tutorial is a guided tour of those six dashed arrows.

### Which method when?

The six methods are not interchangeable. Each one is appropriate for a different data situation. The decision tree below walks through three diagnostic questions and steers you to the matching family. Apply it whenever you face a new policy-evaluation problem.

```{mermaid}
flowchart TB
    Q1{"Do you have a credible<br/>donor pool of untreated units?<br/>(roughly 10+ similar units<br/>followed over the same window)"}
    Q1 -->|Yes| Q2{"Do you want frequentist<br/>placebo inference,<br/>or Bayesian credible<br/>intervals?"}
    Q1 -->|No, only one good control| DiD["Difference-in-Differences<br/>(this tutorial: §6)<br/><br/>Cost: parallel-trends assumption<br/>on a single comparison unit"]
    Q1 -->|No, only the treated unit| Q3{"Is the policy date the<br/>only structural break<br/>in the series?"}

    Q2 -->|Frequentist + tidy code| SCM["Synthetic Control<br/>(this tutorial: §10)<br/><br/>Cost: convex-combination<br/>of donors must match the<br/>treated pre-period"]
    Q2 -->|Bayesian + uncertainty bands| CI["CausalImpact<br/>(this tutorial: §11)<br/><br/>Cost: state-space prior;<br/>covariate-set choice<br/>affects the estimate"]

    Q3 -->|Yes, sharp jump at threshold| RDD["Regression Discontinuity on time<br/>(this tutorial: §9)<br/><br/>Cost: assumes no other shock<br/>coincides with the threshold"]
    Q3 -->|No, model the smooth pre-trend| ITS["Interrupted Time Series<br/>(this tutorial: §7 and §8)<br/><br/>Cost: pre-trend extrapolation<br/>must be specified correctly"]

    NAIVE["Naive pre-post<br/>(this tutorial: §5)<br/><br/>Use only as a baseline.<br/>Never as a causal estimate."] -.->|"baseline for everyone"| Q1

    style Q1 fill:#1f2b5e,stroke:#141413,color:#fff
    style Q2 fill:#1f2b5e,stroke:#141413,color:#fff
    style Q3 fill:#1f2b5e,stroke:#141413,color:#fff
    style DiD fill:#6a9bcc,stroke:#141413,color:#fff
    style SCM fill:#d97757,stroke:#141413,color:#fff
    style CI fill:#d97757,stroke:#141413,color:#fff
    style RDD fill:#6a9bcc,stroke:#141413,color:#fff
    style ITS fill:#6a9bcc,stroke:#141413,color:#fff
    style NAIVE fill:#7a8395,stroke:#141413,color:#fff
```

The orange terminal nodes (Synthetic Control, CausalImpact) are the most defensible families when a donor pool exists --- and they happen to be the methods that produce the consensus estimate later in this tutorial. The blue nodes are valid choices in their respective data situations but carry stronger identifying assumptions. The grey naive-pre-post node is the universal baseline that *everyone* should compute first --- never as the final answer, always as the bias yardstick.

### Key concepts at a glance

This post leans on a small vocabulary repeatedly. The rest of the tutorial assumes you can move between these terms quickly.

**1. Counterfactual.** The outcome a treated unit *would have shown* in the absence of treatment. It is the thing you cannot observe but must somehow construct in order to estimate a causal effect. *Example*: in this post, "California's cigarette sales in 1995 if Proposition 99 had not passed" is the counterfactual. Every method we cover builds one differently.

**2. Parallel trends.** The identifying assumption behind classical DiD: in the absence of treatment, the treated and control units would have moved in *parallel* over time. Differences in *levels* are allowed; differences in *trends* are not.

**3. Interrupted Time Series (ITS).** A class of methods that fits a model to the treated unit's *pre-period* data, extrapolates that model into the post-period as a counterfactual, and averages the residual gap. ITS does not need a comparison unit, but it pays for that in stronger modelling assumptions about the pre-trend.

**4. Regression Discontinuity on time.** A regression discontinuity design where the *running variable* is the calendar year and the *threshold* is the policy adoption date. Practically, it is a piecewise linear regression of the form `cigsale ~ year + post + year:post` that allows both a level jump and a slope change at the threshold.

**5. Donor pool.** The set of untreated units from which Synthetic Control draws weights to build a synthetic version of the treated unit. The data-driven weighting algorithm chooses how much of each donor to use.

**6. Posterior credible interval.** A Bayesian interval that has a 95% probability of containing the true parameter, *given* the data and the prior. It is the Bayesian counterpart to a frequentist 95% confidence interval, but with a far more natural interpretation.

**7. Average Treatment effect on the Treated (ATT) versus Average Treatment Effect (ATE).** The **ATT** is the effect of the policy *on the units that actually received it*. The **ATE** is the effect averaged over *every unit in the population*, treated or not. These two quantities are equal only when treatment effects are constant across units.

**8. Root Mean Squared Prediction Error and its post/pre ratio.** The Root Mean Squared Prediction Error (RMSPE) is the typical size of the gap between observed and predicted outcomes in a given period. Synthetic Control reports it separately for the pre-period (fit quality) and post-period (effect size). The **post/pre ratio** (the MSPE ratio) is a unitless measure of how much the post-period gap exceeds the pre-period gap.

**9. Placebo Fisher exact p-value.** A non-parametric p-value built by ranking the treated unit against a distribution of placebo "treatments" --- each computed by pretending an untreated unit had been treated instead. The p-value is the treated unit's rank divided by the total number of units.

**10. Bayesian Structural Time Series (BSTS).** A Bayesian model that decomposes a time series into interpretable additive components: a local-level trend, a regression on external predictors, and a noise term. After fitting on the pre-period, the model is projected forward and the posterior distribution over observed-minus-predicted gives the policy effect with a credible interval.

## 2. The potential outcomes framework: causal inference as a missing-data problem

Before diving into estimators, we need a vocabulary for what they are estimating. The cleanest one is the **potential outcomes framework**, due to Neyman (1923) and Rubin (1974). Its central insight is that *causal inference is a missing-data problem*. The data we wish we had is rarely the data we observe.

### 2.1 Two outcomes per unit, one observed

For each unit $i$ at time $t$, imagine two *potential* outcomes:

- $Y_{it}(1)$ --- cigarette sales in state $i$ at year $t$ **with** Proposition 99 in force.
- $Y_{it}(0)$ --- cigarette sales in state $i$ at year $t$ **without** Proposition 99 in force.

Let $D_{it} \in \{0, 1\}$ be the treatment indicator. Here $D_{it} = 1$ for California from 1989 onward, and $D_{it} = 0$ everywhere else (every other state, plus California up to and including 1988). The outcome we *observe* is one of the two potential outcomes, never both:

$$Y_{it} \,=\, D_{it}\, Y_{it}(1) \,+\, (1 - D_{it})\, Y_{it}(0).$$

In words: if California in 1995 was treated, we observe $Y_{1995}(1)$ --- *not* the counterfactual $Y_{1995}(0)$, which is what California's smoking *would have been* in 1995 had Proposition 99 never passed. That counterfactual is the missing data.

### 2.2 The fundamental problem of causal inference

Make the missing data concrete. The table below shows what is observed (✓), what is undefined because the state was never treated (---), and what is missing-and-must-be-imputed (**?**) for a handful of rows.

| State | Year | $D_{it}$ | $Y_{it}(0)$ | $Y_{it}(1)$ | Observed |
|---|---:|---:|---:|---:|---:|
| California | 1988 | 0 | 90.1 ✓ | ? | 90.1 |
| California | 1989 | 1 | **?** | 82.4 ✓ | 82.4 |
| California | 1995 | 1 | **?** | 64.4 ✓ | 64.4 |
| California | 2000 | 1 | **?** | 41.6 ✓ | 41.6 |
| Nevada | 1988 | 0 | 134.4 ✓ | --- | 134.4 |
| Nevada | 1995 | 0 | 113.0 ✓ | --- | 113.0 |
| Utah | 1988 | 0 | 64.7 ✓ | --- | 64.7 |
| Utah | 1995 | 0 | 55.0 ✓ | --- | 55.0 |

This is what Holland (1986) called the **fundamental problem of causal inference**: for any treated unit at any time, we observe at most one of the two potential outcomes, and the other one is missing. For California after 1989 the missing column is $Y_{it}(0)$ --- every "**?**" in the third column of the table. *Every method in this tutorial is a way to fill in those question marks.*

### 2.3 Three estimands: ITE, ATE, ATT

With both potential outcomes defined, the natural causal contrasts follow.

**Individual treatment effect (ITE)** for unit $i$ at time $t$:

$$\tau_{it} = Y_{it}(1) - Y_{it}(0).$$

In words: how much $i$'s outcome at $t$ would shift *because of* the policy. The ITE is the gold standard, but it is *never* directly observable for any single $(i, t)$. We only ever see one of the two potential outcomes.

**Average treatment effect (ATE)** over a population:

$$\text{ATE} = \mathbb{E}\big[Y_{it}(1) - Y_{it}(0)\big].$$

In words: how much smoking *would have changed in the average state-year* if all states had been treated. The ATE is identified by a randomised experiment --- but Proposition 99 was not randomised, and most states would never adopt it. The ATE is *not* what we are after here.

**Average treatment effect on the treated (ATT)**, restricted to units that actually received the treatment:

$$\text{ATT} = \mathbb{E}\big[Y_{it}(1) - Y_{it}(0) \,\big|\, D_{it} = 1\big].$$

In words: how much smoking changed *in California, in the post-1989 years* because of Proposition 99. This is what every causal method in this tutorial targets. It is the right question to ask when only one unit is treated and the policy is essentially a one-shot event.

For Proposition 99, the ATT averaged over 1989--2000 expands to:

$$\text{ATT}_{\text{CA, post}} = \frac{1}{T_{\text{post}}} \sum_{t > t^*} \Big[Y_{1t}(1) - Y_{1t}(0)\Big],$$

where unit $i = 1$ denotes California, $t^* = 1988$ is the last pre-period year, and $T_{\text{post}} = 12$ is the number of post-period years (1989--2000). The first term inside the brackets --- $Y_{1t}(1)$ --- is *observed*. The second term --- $Y_{1t}(0)$ --- is *missing*. The whole tutorial is a tour of methods that estimate the missing term using different data sources.

### 2.4 Each method is a way to impute the missing $Y(0)$

The shared logic Mermaid diagram in §1 already showed this graphically. Here is the same idea in equation form, one row per method.

| Method | Estimator of the missing $Y_{1t}(0)$ |
|---|---|
| Naive pre-post | $\widehat{Y_{1t}(0)} = \overline{Y}_{1, \text{pre}}$ --- California's own pre-period mean. |
| Difference-in-Differences | $\widehat{Y_{1t}(0)} = \overline{Y}_{1, \text{pre}} + \big(\overline{Y}_{0, \text{post}} - \overline{Y}_{0, \text{pre}}\big)$ --- add Nevada's pre-to-post change. |
| Interrupted Time Series (growth) | $\widehat{Y_{1t}(0)} = \hat\alpha + \hat\beta\, t$ --- extrapolate California's own pre-period linear fit. |
| Interrupted Time Series (ARIMA) | $\widehat{Y_{1t}(0)} = $ forecast from an ARIMA model fitted on 1970--1988. |
| Regression Discontinuity on time | $\widehat{Y_{1t}(0)} = \hat\alpha + \hat\beta\, (t - t^*)$ --- extrapolate California's pre-period piecewise fit. |
| Synthetic Control | $\widehat{Y_{1t}(0)} = \sum_{i \in \text{donors}} w_i^* Y_{it}$ --- weighted blend of donor states. |
| CausalImpact | $\widehat{Y_{1t}(0)} = \mu_t + \beta^\top x_t$ --- Bayesian structural time-series fit on donor data. |

Read this table as the syllabus for the rest of the tutorial. Each subsequent section takes one row and shows the R code that builds the corresponding $\widehat{Y_{1t}(0)}$, then subtracts it from the observed California series to recover the ATT.

The two big design choices that distinguish good causal methods from bad ones are visible right here: **(a)** *what data does the method use to build $\widehat{Y_{1t}(0)}$* (California's own past, one neighbouring state, a weighted blend of many states, or a Bayesian model on the whole donor pool), and **(b)** *what identifying assumption does that data source require* (no national trend, parallel trends with the neighbour, correctly-specified pre-trend, or convexity-of-donor-combinations). The cross-method comparison at the end of the tutorial is fundamentally a comparison of how robust each $\widehat{Y_{1t}(0)}$ construction is to that source's identifying assumption being violated.

## 3. Setup and packages

We use `pak::pkg_install()` to install the **exact** versions of every package that was used when this post was published. Pinning versions is the project's reproducibility hook: a reader who renders this notebook on a fresh machine months from now should get the same numbers, even if upstream packages have changed default behaviour or deprecated functions. A global `set.seed(42)` then fixes the random-forest imputation and the CausalImpact MCMC sampler so MCMC draws are bit-reproducible too.

```{r}
#| label: setup-packages

# pak is the only bootstrap dependency. Everything else gets installed
# at a specific version, so the notebook is reproducible on a fresh
# machine.
if (!requireNamespace("pak", quietly = TRUE)) {
  install.packages("pak", repos = "https://cloud.r-project.org")
}

# Install the EXACT versions used when this post was published.
# On a warm machine where the right versions are already installed,
# pak::pkg_install short-circuits and this call returns in under a
# second.
pak::pkg_install(c(
  "tidyverse@2.0.0",      # data manipulation + ggplot2
  "sandwich@3.1.1",       # HAC variance estimator
  "lmtest@0.9.40",        # coeftest
  "tidysynth@0.2.1",      # synthetic control (tidy API)
  "fpp3@1.0.3",           # forecasting (tsibble, fable, ARIMA)
  "mice@3.17.0",          # multiple imputation
  "ranger@0.17.0",        # backend for mice method = "rf"
  "CausalImpact@1.3.0",   # Bayesian structural time series
  "broom@1.0.8",          # tidy model output
  "glue@1.8.0",           # string interpolation
  "forcats@1.0.0"         # factor reordering (used in §10.2 V-matrix chart)
))

# Attach the packages we use directly.
library(tidyverse); library(sandwich); library(lmtest)
library(tidysynth); library(fpp3);     library(mice)
library(ranger);    library(CausalImpact)
library(broom);     library(glue);     library(forcats)

# Fix the global random seed so every run reproduces the same MICE
# imputation and the same CausalImpact MCMC sample.
set.seed(42)
```

## 4. Data: download and inspect

We download a pre-prepared `proposition99.rds` straight from this project's GitHub repository so the document is self-contained on a fresh machine. The download is cached locally so re-renders skip the network fetch.

```{r}
#| label: data-download

# Pull the prepared dataset from this project's GitHub raw URL.
# The cache file is written to the current working directory.
DATA_URL  <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_causalpolicy_workshop/proposition99.rds"
CACHE_RDS <- "proposition99.rds"

if (!file.exists(CACHE_RDS)) {
  # First run: pull the file from GitHub and write to disk.
  download.file(DATA_URL, destfile = CACHE_RDS, mode = "wb")
}

# Load the RDS file and coerce to a tibble for nicer printing.
prop99 <- read_rds(CACHE_RDS) |> as_tibble()

cat("Rows:", nrow(prop99), " Cols:", ncol(prop99), "\n")
cat("Columns:", paste(names(prop99), collapse = ", "), "\n")
cat("States:", length(unique(prop99$state)),
    " Years:", min(prop99$year), "-", max(prop99$year), "\n\n")
head(prop99)
```

The panel is 39 states $\times$ 31 years for 1,209 observations in total. The treated unit is California, the intervention year is January 1989 (so the last full pre-period year is 1988), and the outcome is `cigsale` --- per-capita cigarette pack sales. Of the four covariates, `cigsale` and `retprice` (the retail price) are fully observed, while `lnincome` is missing 195 rows (16.1%), `age15to24` is missing 390 (32.3%), and `beer` is missing 663 (54.8%). The covariate gaps matter for CausalImpact (§11), where we will fill them with random-forest imputation; the other five methods either ignore covariates entirely or do not need them.

A quick descriptive comparison of California's pre vs post means confirms the puzzle that motivates the rest of the tutorial.

```{r}
#| label: data-california

# Subset to California and add a Pre/Post factor based on the policy date.
prop99_cali <- prop99 |>
  filter(state == "California") |>
  mutate(prepost = factor(year > 1988, labels = c("Pre", "Post")))

# Group-level descriptives: count, mean, and SD of cigsale per period.
prop99_cali |>
  group_by(prepost) |>
  summarize(n            = n(),
            mean_cigsale = mean(cigsale),
            sd_cigsale   = sd(cigsale),
            .groups = "drop")
```

California's average per-capita cigarette sales fell from 116.0 packs (1970--1988) to 60.4 packs (1989--2000) --- a within-state drop of 55.6 packs, or 47.9% of the pre-period mean. That is the *raw* before/after change. The rest of the tutorial is about how much of that 55.6-pack drop we can credibly attribute to Proposition 99 rather than to the broader American secular decline in smoking.

> **About Proposition 99.** California voters passed Proposition 99 --- formally the Tobacco Tax and Health Protection Act --- on the November 1988 ballot with 58% of the vote. It raised the state cigarette tax by 25 cents per pack starting January 1, 1989, and earmarked the revenue for anti-smoking education, health services, and tobacco-related research. The initiative was championed by the American Cancer Society, the American Lung Association, and the American Medical Association, against opposition financed by the tobacco industry. The case became the canonical Synthetic Control application after Abadie, Diamond, and Hainmueller (2010) used it to introduce the method --- which is why we revisit it here with six estimators rather than just one.

Before doing any modelling, it helps to see all 39 series at once.

```{r}
#| label: fig-raw-series
#| fig-cap: "Per-capita cigarette sales for all 39 states 1970–2000, with California highlighted in orange and a dashed vertical line at the 1989 policy threshold."

# Flag each row as either the treated unit or one of the 38 donor states.
eda_data <- prop99 |>
  mutate(unit_type = if_else(state == "California",
                             "California (treated)", "Donor state"))

# One line per state-year, with California highlighted in warm orange and
# a dashed vertical line at the 1989 policy threshold.
ggplot(eda_data, aes(x = year, y = cigsale, group = state,
                     color = unit_type, linewidth = unit_type,
                     alpha = unit_type)) +
  geom_line() +
  geom_vline(xintercept = 1988.5, color = "#d97757",
             linetype = "dashed", linewidth = 0.7) +
  scale_color_manual(values = c("California (treated)" = "#d97757",
                                "Donor state" = "#6a9bcc")) +
  scale_linewidth_manual(values = c("California (treated)" = 1.2,
                                    "Donor state" = 0.45)) +
  scale_alpha_manual(values = c("California (treated)" = 1.0,
                                "Donor state" = 0.45)) +
  labs(title = "Per-capita cigarette sales, 1970-2000",
       subtitle = "California in orange; 38 donor states in blue",
       x = "Year", y = "Cigarette sales (packs per capita)",
       color = NULL) +
  guides(linewidth = "none", alpha = "none")
```

California (orange) sits inside the donor cloud throughout the 1970s and 1980s, then visibly separates downward after the dashed Proposition 99 line. The pre-1988 trajectory is already slightly below the donor median, but it is not anomalous; the sharp post-1988 separation is. Visually, this is the signal every causal estimator is trying to quantify.

## 5. Method 1 --- Naive pre-post comparison

**The idea.** Compare California's mean cigarette sales before 1989 with its mean after 1989. Call the difference the "effect".

**R tooling.** Base R `lm()` for the OLS; [`sandwich::vcovHAC()`](https://cran.r-project.org/package=sandwich) for the heteroskedasticity-and-autocorrelation-consistent variance estimator; [`lmtest::coeftest()`](https://cran.r-project.org/package=lmtest) to retest the coefficients with that variance matrix.

**The equation.**

$$\hat\tau_{\text{naive}} = \overline{Y}_{1, \text{post}} - \overline{Y}_{1, \text{pre}}.$$

In words: the naive estimate is the difference between California's observed post-period mean and California's observed pre-period mean. Mapping to the potential-outcomes framework from §2, this corresponds to imputing $\widehat{Y_{1t}(0)} = \overline{Y}_{1, \text{pre}}$ --- i.e., assuming California's counterfactual smoking would have been frozen at the pre-period average.

**Why this is wrong but still useful.** The implicit counterfactual "California's pre-period level continues unchanged" is almost certainly wrong, because smoking was declining nationwide. But the estimate is so cheap to compute that it makes a useful baseline. The five later methods will each try to fix what is broken here.

We follow the workshop's narrow 1984--1993 window for direct comparability with the rest of the workshop. Using a longer window (e.g., 1970--2000) would change the numbers but not the qualitative point.

```{r}
#| label: fit-naive

# OLS of California's cigsale on a Pre/Post dummy, restricted to the
# workshop's 1984-1993 window.
fit_prepost <- lm(cigsale ~ prepost,
                  data = prop99_cali |> filter(year > 1983, year < 1994))

# Replace the default OLS standard errors with HAC (heteroskedasticity-
# and-autocorrelation-consistent) errors, which short time series need.
coeftest(fit_prepost, vcov. = vcovHAC)
```

**Reading the output.** California's mean over 1984--1988 was 98.98 packs/capita. The `prepostPost` coefficient says the 1989--1993 mean is about 27 packs *lower*. The HAC robust standard error is roughly 5.3 ($p < 0.001$). The HAC correction comes from `sandwich::vcovHAC` and accounts for the heteroskedasticity and autocorrelation that short time series typically exhibit. A classical OLS standard error would be wildly overconfident here.

**The estimand here is purely descriptive.** This is a within-state difference of means, *not* a causal estimate. Any nationwide secular decline in smoking gets silently bundled into the coefficient. That bundling is exactly what the next five methods try to undo.

**Common pitfall.** Confusing the within-state pre-post difference with a causal effect. Anything that shifted the entire country between the two windows --- anti-smoking campaigns, federal tobacco settlements, rising health awareness --- gets attributed entirely to Proposition 99.

**Recap.** Naive pre-post says $-27.0$ packs, but it has no counterfactual at all --- only California's own past. Hold that number in mind; it will set the upper bound for what every other method estimates.

## 6. Method 2 --- Difference-in-Differences (California vs Nevada)

**The idea.** Pick one control state (Nevada). Compute its pre-to-post change. Subtract that from California's pre-to-post change. Whatever is left over is "what the policy did".

**R tooling.** Base R `lm()` with a `state * prepost` interaction; [`sandwich`](https://cran.r-project.org/package=sandwich) + [`lmtest`](https://cran.r-project.org/package=lmtest) for HAC-robust standard errors. For modern multi-period Difference-in-Differences with staggered adoption see the [`did`](https://cran.r-project.org/package=did) and [`fixest`](https://cran.r-project.org/package=fixest) packages, covered in the companion [r_did tutorial](https://carlos-mendez.org/post/r_did/).

**The identifying assumption.** California and Nevada would have moved on *parallel paths* without the policy. Differences in levels are fine; differences in trends are not. The estimand becomes a proper **Average Treatment effect on the Treated** (ATT) for California.

The formal DiD identity is

$$\hat{\tau}_{\text{DiD}} = \big(\bar{Y}_{\text{CA, post}} - \bar{Y}_{\text{CA, pre}}\big) - \big(\bar{Y}_{\text{NV, post}} - \bar{Y}_{\text{NV, pre}}\big).$$

In words: DiD takes California's change and subtracts Nevada's change. If both states would have evolved in parallel without the policy, the only thing that can drive a *difference* in their changes is the policy itself.

The four ingredients of the DiD calculation are easier to see as a 2×2 grid. Each cell holds a group mean; the two within-state changes are the row differences; the DiD estimate is the difference *of* those differences.

```{mermaid}
flowchart TB
    subgraph "California"
        CA_pre["Pre (1984–88) mean<br/>= 99.0"] --> CA_d["Δ California =<br/>72.0 − 99.0 = −27.0"]
        CA_post["Post (1989–93) mean<br/>= 72.0"] --> CA_d
    end
    subgraph "Nevada (control)"
        NV_pre["Pre (1984–88) mean<br/>= 143.1"] --> NV_d["Δ Nevada =<br/>121.8 − 143.1 = −21.3"]
        NV_post["Post (1989–93) mean<br/>= 121.8"] --> NV_d
    end
    CA_d --> DD["DiD ATT =<br/>(−27.0) − (−21.3) = −5.7"]
    NV_d --> DD

    style CA_pre fill:#d97757,stroke:#141413,color:#fff
    style CA_post fill:#d97757,stroke:#141413,color:#fff
    style NV_pre fill:#6a9bcc,stroke:#141413,color:#fff
    style NV_post fill:#6a9bcc,stroke:#141413,color:#fff
    style DD fill:#00d4c8,stroke:#141413,color:#141413
```

The arithmetic is literally what the regression below computes. In `cigsale ~ state * prepost`, the interaction coefficient `stateCalifornia:prepostPost` *is* the DiD estimate.

```{r}
#| label: fit-did

# Keep only California and Nevada in the 1984-1993 window and add the
# Pre/Post factor. Make Nevada the reference level so that the
# stateCalifornia interaction lands on California's extra change.
prop99_did <- prop99 |>
  filter(state %in% c("California", "Nevada"),
         year > 1983, year < 1994) |>
  mutate(prepost = factor(year > 1988, labels = c("Pre", "Post")),
         state   = factor(state, levels = c("Nevada", "California")))

# Two-way interacted regression: state main effect + Pre/Post main effect
# + (state x Pre/Post) interaction. The interaction is the DiD estimate.
fit_did <- lm(cigsale ~ state * prepost, data = prop99_did)

# HAC-robust standard errors for the four coefficients.
coeftest(fit_did, vcov. = vcovHAC)
```

**Reading the output.** The interaction coefficient `stateCalifornia:prepostPost` is $-5.68$ packs (HAC SE 5.39, $p = 0.31$). That is *dramatically* smaller than the naive $-27.02$, and statistically indistinguishable from zero. Why? Because the `prepostPost` main effect is also large: $-21.34$ packs. Nevada's own cigarette sales fell by 21.3 packs between 1984--1988 and 1989--1993. When DiD subtracts that Nevada change from California's change, almost all of California's drop is absorbed.

The picture below makes the problem obvious.

```{r}
#| label: fig-did-trends
#| fig-cap: "California vs Nevada raw series 1970–2000, both showing downward post-1988 trajectories. Nevada's parallel decline absorbs most of California's drop in the DiD subtraction."

# Plot California and Nevada side-by-side on the full 1970-2000 window
# to visualise the pre-trend the DiD specification assumes.
ggplot(prop99 |> filter(state %in% c("California", "Nevada")),
       aes(x = year, y = cigsale, color = state, linewidth = state)) +
  geom_line() +
  geom_vline(xintercept = 1988.5, color = "#d97757",
             linetype = "dashed", linewidth = 0.7) +
  scale_color_manual(values = c("California" = "#d97757",
                                "Nevada"     = "#00d4c8")) +
  scale_linewidth_manual(values = c("California" = 1.2, "Nevada" = 1.0)) +
  labs(title = "DiD inputs: California vs Nevada",
       subtitle = "Pre-trend visual check before fitting DiD",
       x = "Year", y = "Cigarette sales (packs per capita)",
       color = NULL) +
  guides(linewidth = "none")
```

This is the textbook DiD pitfall. A single control unit that itself is shifting in the same direction makes the contrast collapse. Nevada is geographically and culturally adjacent to California. It inherits many of the same secular forces: rising health awareness, federal tobacco settlements, retail-price spillovers. So it is a poor "what would California have done?" control.

**Common pitfall.** Picking the *one* "most similar" control by hand. If your single control is subject to the same secular forces as the treated unit --- geographic neighbours, policy spillovers, regional macro shocks --- the contrast collapses and DiD silently reports zero.

**Recap.** DiD vs Nevada says $-5.7$ packs and we cannot reject zero. The lesson is *not* that DiD is broken --- it is that DiD with a single similar control unit is fragile. Synthetic Control in §10 is the principled response: instead of one control state, blend many states into a weighted "synthetic California".

## 7. Method 3a --- Interrupted Time Series via pre-period growth curve

**The idea.** Stop borrowing from a comparison unit. Instead, build the counterfactual from California's *own* pre-period dynamics. Fit a model on 1970--1988, extrapolate it into 1989--2000, and call the gap between the extrapolation and the observed data the effect.

**R tooling.** Base R `lm()` for the pre-period linear fit; `predict()` for the post-period extrapolation. No specialised package needed for the simplest growth-curve variant. The [`tsibble`](https://cran.r-project.org/package=tsibble) class (loaded via [`fpp3`](https://fpp3.otexts.com/)) provides a tidy panel index that the next ITS variant relies on.

**The equation.** Fit a linear trend on the pre-period only:

$$Y_{1t} = \alpha + \beta\, t + \varepsilon_t, \qquad t \le t^* = 1988.$$

Then *extrapolate* the fitted line into the post-period as the counterfactual:

$$\widehat{Y_{1t}(0)} = \hat\alpha + \hat\beta\, t, \qquad t > t^*.$$

Finally, average the gap between observed and counterfactual over the post-period:

$$\widehat{\text{ATT}}_{\text{ITS-growth}} = \frac{1}{T_{\text{post}}} \sum_{t > t^*} \Big[Y_{1t} - (\hat\alpha + \hat\beta\, t)\Big].$$

In words: a single straight line, fit on 1970--1988 cigarette sales in California, becomes the counterfactual for 1989--2000. The policy effect is the *average* of the per-year residuals between what was actually observed and what the extrapolated line predicted. The slope $\hat\beta$ captures whatever secular trend California was already on; only deviations *from* that trend after 1989 are attributed to Proposition 99.

**Why it differs from naive pre-post.** Naive pre-post assumes "no change". ITS allows a non-zero pre-trend. If California was already declining, the ITS counterfactual continues that decline; only the *extra* drop after 1989 gets attributed to the policy.

```{r}
#| label: fit-its-growth

# California-only time series with a Pre/Post factor and a centred year
# index (year0 = 0 at the first post-period year). The tsibble class is
# required by the fpp3 forecasting tools used later.
prop99_ts <- prop99 |>
  filter(state == "California") |>
  select(year, cigsale) |>
  mutate(prepost = factor(year > 1988, labels = c("Pre", "Post"))) |>
  as_tsibble(index = year) |>
  mutate(year0 = year - 1989)

# Fit a linear pre-period trend (cigsale on year, 1970-1988 only).
fit_growth <- lm(cigsale ~ year, data = prop99_ts |> filter(prepost == "Pre"))

# Print the intercept and slope with their standard errors.
summary(fit_growth)$coefficients
```

The pre-period (1970--1988) linear trend is $-1.78$ packs/year ($p < 10^{-5}$, $R^2 = 0.735$) --- so smoking was already declining about 1.8 packs per capita per year in California well before Proposition 99. To estimate the policy effect we extrapolate that line forward to 2000 and average the gap between observed and predicted:

```{r}
#| label: its-growth-gap

# Subset to the 1989-2000 rows and extrapolate the fitted line forward.
post_df <- prop99_ts |> filter(prepost == "Post")
pred_growth <- predict(fit_growth, newdata = as_tibble(post_df))

# ATT estimate = average per-year gap between observed and extrapolation.
its_growth_estimate <- mean(post_df$cigsale - pred_growth)
its_growth_estimate
```

**Reading the output.** The ITS-growth-curve estimate is $-28.28$ packs/capita. That is essentially identical to the naive pre-post $-27.02$. Why? Because both methods only use within-California information. Neither borrows from a comparison unit. So neither can separate "California-specific effect" from "national secular decline".

The coincidence is suggestive but not reassuring. Both methods can be biased the same way if California's pre-trend was *understating* the speed of the secular decline.

**Common pitfall.** Assuming the linear pre-trend is the right *shape*. If the true secular decline is accelerating or saturating, a linear extrapolation either understates or overstates what would have happened --- and the policy effect inherits the bias.

**Recap.** ITS-growth says $-28.3$ packs. Adding a linear pre-trend changed almost nothing relative to the naive baseline, because the trend was modest. The next ITS variant uses a more flexible time-series model --- and we will see why "more flexible" can backfire.

## 8. Method 3b --- Interrupted Time Series via auto-selected ARIMA forecast

**The idea.** Replace the straight line with a flexible time-series model. Let the data decide the model's complexity through an information criterion (AICc). Forecast forward as the counterfactual.

**R tooling.** The [`fpp3`](https://fpp3.otexts.com/) meta-package (Hyndman & Athanasopoulos 2021) loads [`fable::ARIMA()`](https://fable.tidyverts.org/reference/ARIMA.html) for the model fit, `forecast()` for the post-period projection, and `report()` for the diagnostic printout. Companion textbook *Forecasting: Principles and Practice* is the canonical reference.

**The equation.** A general ARIMA$(p, d, q)$ model writes the $d$-th differenced series as an autoregressive-moving-average process. Using the lag operator $L$ (so $L\, Y_t = Y_{t-1}$):

$$\Phi(L)\, (1 - L)^d\, Y_{1t} \, = \, \Theta(L)\, \varepsilon_t, \qquad \varepsilon_t \sim \mathcal{N}(0, \sigma^2),$$

where $\Phi(L) = 1 - \phi_1 L - \cdots - \phi_p L^p$ collects the $p$ autoregressive coefficients and $\Theta(L) = 1 + \theta_1 L + \cdots + \theta_q L^q$ collects the $q$ moving-average coefficients. The `fpp3::ARIMA(..., ic = "aicc")` call searches over $(p, d, q)$ and picks the combination that minimises the corrected Akaike Information Criterion on the pre-period. For California's 1970--1988 series, AICc picks $(p, d, q) = (1, 2, 0)$: one autoregressive lag and *two* rounds of differencing (which is what bends the counterfactual so aggressively in the figure below).

Once the model is fit, the post-period counterfactual is the model's $h$-step forecast and the ATT is the average gap, just as in §7:

$$\widehat{Y_{1t}(0)} = \hat Y_{1t \mid t^*}, \qquad \widehat{\text{ATT}}_{\text{ITS-ARIMA}} = \frac{1}{T_{\text{post}}} \sum_{t > t^*} \Big[Y_{1t} - \hat Y_{1t \mid t^*}\Big].$$

In words: same recipe as the growth-curve version --- fit on pre-period, project forward, average the gap --- but the "fit on pre-period" step now uses an autoregressive-integrated-moving-average model instead of a straight line. The values $p$, $d$, $q$ control how flexible the model is allowed to be.

**What ARIMA(p, d, q) means in plain English.** `p` is the number of past values the model uses (autoregression). `d` is the number of times the series is differenced before fitting (to handle trends). `q` is the number of past forecast errors used (moving average). Lower AICc = "better fit traded off against complexity".

```{r}
#| label: fit-its-arima

# Fit an ARIMA model on the 1970-1988 California series. ic = "aicc"
# tells fable to search over (p, d, q) and pick the AICc minimiser.
fit_arima <- prop99_ts |>
  filter(prepost == "Pre") |>
  model(timeseries = ARIMA(cigsale, ic = "aicc"))

# Print the chosen orders, coefficients, and information criteria.
report(fit_arima)
```

`ARIMA(1, 2, 0)` was selected: one autoregressive lag and *two* rounds of differencing. The double-differencing means the model is tracking the *acceleration* of California's late-1980s drop, not just its level or slope. We then forecast 12 years out and average the gap.

```{r}
#| label: its-arima-gap

# Project the fitted ARIMA 12 years forward as the post-period counterfactual.
fcasts <- forecast(fit_arima, h = "12 years")

# ATT estimate = average per-year gap between observed and ARIMA forecast.
ce_arima <- post_df$cigsale - fcasts$.mean
mean(ce_arima)
```

**Reading the output.** The ARIMA-based ITS estimate is $+4.55$ packs. That is *positive* --- it would imply Proposition 99 *increased* California's smoking. That is plainly the wrong answer. The visual diagnostic shows why:

```{r}
#| label: fig-its-arima
#| fig-cap: "ITS counterfactual from ARIMA(1,2,0) model with 95% forecast band. The blue dashed line bends below the observed orange series, which is why the gap flips sign."

# Build a long table for the ITS-ARIMA figure: observed pre, observed
# post, and the ARIMA point forecast plus its 95% interval.
fc_int  <- hilo(fcasts$cigsale, 95)
its_plot_df <- bind_rows(
  prop99_ts |> filter(prepost == "Pre") |>
    as_tibble() |>
    transmute(year, value = cigsale, series = "Observed (pre)",
              lower = NA_real_, upper = NA_real_),
  tibble(year   = post_df$year,
         value  = post_df$cigsale,
         series = "Observed (post)",
         lower  = NA_real_, upper = NA_real_),
  tibble(year   = post_df$year,
         value  = fcasts$.mean,
         series = "ARIMA counterfactual",
         lower  = fc_int$lower,
         upper  = fc_int$upper)
)

ggplot(its_plot_df, aes(x = year, y = value, color = series)) +
  geom_ribbon(data = its_plot_df |> filter(series == "ARIMA counterfactual"),
              aes(ymin = lower, ymax = upper),
              fill = "#6a9bcc", alpha = 0.2, color = NA) +
  geom_line(linewidth = 1.0) +
  geom_vline(xintercept = 1988.5, color = "#d97757",
             linetype = "dashed", linewidth = 0.7) +
  scale_color_manual(values = c("Observed (pre)"      = "grey30",
                                "Observed (post)"     = "#d97757",
                                "ARIMA counterfactual" = "#6a9bcc")) +
  labs(title = "ITS via ARIMA: observed vs pre-period counterfactual",
       subtitle = "California cigarette sales, 1970-2000",
       x = "Year", y = "Cigarette sales (packs per capita)",
       color = NULL)
```

The dashed blue line is the ARIMA counterfactual. It sits *below* the observed orange series throughout the post period. The model extrapolates the late-1980s downward acceleration too aggressively. It predicts California should have hit roughly 50 packs by 2000 if the pre-period momentum had continued. Since California actually only hit 60 packs, the model concludes Proposition 99 "raised" smoking by about 5 packs relative to that doomsday counterfactual.

**The pitfall in one sentence.** AICc minimises *in-sample* fit, but in-sample fit can come from features (here, second-order momentum) that do not persist *out-of-sample*.

**Common pitfall.** Trusting an information-criterion-selected model on a short pre-period. AICc rewards in-sample fit. With 19 pre-period observations, it can latch onto late-pre-period momentum that does not persist out-of-sample, producing a counterfactual that bends through (or past) the observed post-period values.

**Recap.** ITS-ARIMA says $+4.55$ packs and is the headline-grabbing outlier. The lesson is not "ARIMA is bad" --- it is that **single-model ITS is fragile**. Always pair an ITS estimate against a comparison-unit method (Synthetic Control, CausalImpact, or a credibly-matched DiD) before drawing conclusions.

## 9. Method 4 --- Regression Discontinuity on time (segmented regression)

**The idea.** Use *calendar time* as the running variable. Fit a piecewise linear regression that allows two breaks at 1989: a level jump and a slope change. The level jump is the immediate "policy shock"; the slope change is how the trajectory bends afterwards.

**R tooling.** Base R `lm()` with a piecewise specification; [`sandwich`](https://cran.r-project.org/package=sandwich) + [`lmtest`](https://cran.r-project.org/package=lmtest) for HAC-robust standard errors. For classical sharp RDD on a continuous running variable (e.g. test scores, income), use [`rdrobust`](https://cran.r-project.org/package=rdrobust) (Calonico, Cattaneo & Titiunik) instead.

**The equation.** Re-centre time at the threshold by defining $\tilde t = t - t^* - 1$ (so $\tilde t = 0$ for the first post-period year). Then fit the segmented regression

$$Y_{1t} = \beta_0 + \beta_1\, \tilde t + \beta_2\, \mathbf{1}[\tilde t \ge 0] + \beta_3\, \tilde t \cdot \mathbf{1}[\tilde t \ge 0] + \varepsilon_t.$$

Each coefficient has a precise causal reading.

- $\beta_0$ is the pre-period intercept at $\tilde t = 0$ (California's fitted level just before the threshold).
- $\beta_1$ is the *pre-period* slope.
- $\beta_2$ is the **level break** at the threshold --- the immediate jump in cigarette sales at $\tilde t = 0$. This is the headline RDD effect.
- $\beta_3$ is the **change in slope** after the threshold (the post-period slope is $\beta_1 + \beta_3$).

The piecewise counterfactual continues the pre-period line ($\beta_0 + \beta_1\, \tilde t$) into the post-period, so the total deviation of observed from counterfactual at time $\tilde t > 0$ is $\beta_2 + \beta_3\, \tilde t$. In other words, the policy shifts the series *down* by $\beta_2$ packs immediately, then the gap widens (or narrows) at rate $\beta_3$ per year.

In words: the calendar year is recentred so that the threshold sits at zero, then we fit two straight lines --- one for $\tilde t < 0$, one for $\tilde t \ge 0$ --- that are allowed to differ in both intercept and slope. The intercept gap $\beta_2$ is the "discontinuity" you can see at the dashed orange line in the figure below.

**A naming heads-up.** The workshop labels this specification "RDD". It is RDD with time as the running variable, not the classical sharp RDD you would use for a means-tested benefit at an income cutoff. With time as the running variable, the math reduces to *segmented regression*.

```{r}
#| label: fit-rdd

# Piecewise OLS: pre-period slope, level break at threshold, slope change
# after threshold. year0 = year - 1989 (already constructed earlier).
fit_rdd <- lm(cigsale ~ year0 + prepost + year0:prepost,
              data = as_tibble(prop99_ts))

# HAC-robust standard errors for the four coefficients.
coeftest(fit_rdd, vcov. = vcovHAC)
```

**Reading the output.** Three coefficients matter.

1. **Pre-period slope `year0` = $-1.78$ packs/year.** Matches the ITS-growth fit; sanity check passed.
2. **Level break `prepostPost` = $-20.06$ packs** (HAC SE 5.59, $p = 0.001$). California's sales drop by about 20 packs *immediately* at the 1989 threshold.
3. **Slope change `year0:prepostPost` = $-1.49$ packs/year** (HAC SE 0.40, $p < 0.001$). The post-period decline accelerates by an extra 1.5 packs/year *on top of* the pre-period 1.8 packs/year.

Combining the level break and the slope change, by 2000 (11 years after the threshold) the cumulative deviation from the extrapolated pre-trend is roughly $-20 - 11 \times 1.49 \approx -36$ packs. The piecewise fit is excellent ($R^2 = 0.973$):

```{r}
#| label: fig-rdd
#| fig-cap: "RDD on time: piecewise pre/post linear fit with level and slope breaks at 1989. The blue pre-line and orange post-line are visibly disconnected at the threshold."

# Predict from the fitted RDD model on every California observation,
# then plot the pre- and post-period segments separately.
rdd_estimate <- as.numeric(coef(fit_rdd)["prepostPost"])

rdd_pred <- as_tibble(prop99_ts) |>
  mutate(fit = predict(fit_rdd, newdata = as_tibble(prop99_ts)))

ggplot() +
  geom_point(data = as_tibble(prop99_ts),
             aes(x = year, y = cigsale),
             color = "grey30", size = 1.8, alpha = 0.85) +
  geom_line(data = rdd_pred |> filter(prepost == "Pre"),
            aes(x = year, y = fit),
            color = "#6a9bcc", linewidth = 1.1) +
  geom_line(data = rdd_pred |> filter(prepost == "Post"),
            aes(x = year, y = fit),
            color = "#d97757", linewidth = 1.1) +
  geom_vline(xintercept = 1988.5, color = "#d97757",
             linetype = "dashed", linewidth = 0.7) +
  annotate("text", x = 1989.7, y = 130,
           label = sprintf("Level jump\n%.1f packs", rdd_estimate),
           hjust = 0, vjust = 0,
           color = "#d97757", fontface = "bold", size = 3.5) +
  labs(title = "RDD on time (segmented regression)",
       subtitle = "Pre-period trend in blue, post-period trend in orange",
       x = "Year", y = "Cigarette sales (packs per capita)")
```

The blue pre-1988 line and the orange post-1989 line both fit California's points almost perfectly, with a clear discontinuity at the threshold.

**Caveat.** RDD on time inherits the same pre-trend mis-specification risk as ITS. If California's *underlying* trajectory was already changing curvature in the late 1980s for non-policy reasons --- say, the 1988 Surgeon General's report on nicotine addiction --- the level break attributed to Proposition 99 will absorb that change too.

**Common pitfall.** Mistaking a coincident shock at the threshold for the policy effect. With time as the running variable, *any* event that happens to land in the same year as the policy --- a related federal regulation, a recession, a media campaign --- is absorbed into the level break.

**Recap.** Regression Discontinuity on time reports a $-20.1$ pack level break with a tight standard error. It is the first of three methods to land in the credible $-13$ to $-20$ "consensus" range, alongside Synthetic Control and CausalImpact.

## 10. Method 5 --- Synthetic Control

**The idea.** Stop using one control state. Instead, build a *weighted combination* of donor states that matches California's pre-period as closely as possible on a set of predictors. The weighted combination is "synthetic California". The gap between observed California and synthetic California is the estimated effect.

**R tooling.** [`tidysynth`](https://cran.r-project.org/package=tidysynth) by Eric Dunford ([GitHub](https://github.com/edunford/tidysynth)) wraps the original Abadie--Diamond--Hainmueller optimisation in a tidyverse-friendly pipeline. The older [`Synth`](https://cran.r-project.org/package=Synth) package by Hainmueller is the historical reference. For Bayesian and spatial extensions see the [r_sc_bayes_spatial tutorial](https://carlos-mendez.org/post/r_sc_bayes_spatial/).

**The equation, in plain English first.** The optimisation picks a *recipe* --- a convex combination of donor states --- that mimics California's pre-1988 trajectory on a chosen set of predictors. That recipe is then frozen and used to project the no-policy counterfactual into the post-period. Two simple constraints make the result interpretable: every donor weight is non-negative, and the weights sum to 1. So the synthetic series cannot extrapolate outside the range of the donors --- it is always a "blend", never an "extension".

Formally, let $X_1$ be the vector of $k$ pre-period predictors for the treated unit (California), and let $X_0$ be the $k \times J$ matrix holding the same predictors for the $J = 38$ donor states. The Synthetic Control estimator chooses donor weights $w$ to minimise the (V-weighted) discrepancy between treated and synthetic on the predictors:

$$w^* \, = \, \arg\min_{w \in \mathcal{W}} \, \big(X_1 - X_0 w\big)^\top V \big(X_1 - X_0 w\big),$$

subject to

$$\mathcal{W} = \big\{w \in \mathbb{R}^J \,:\, w_j \ge 0 \,\, \forall j, \,\, \textstyle\sum_{j=1}^J w_j = 1\big\}.$$

In words: find the donor weights $w$ that make synthetic California's predictor profile as close as possible to real California's, where "close" is measured by the V-weighted quadratic distance. The diagonal matrix $V$ holds the *predictor* importance weights --- the optimiser can care more about pre-period cigarette sales than about, say, beer consumption (we will inspect $V$ in §10.2).

Once $w^*$ is solved, the synthetic California outcome at any year $t$ is

$$\widehat{Y_{1t}(0)} = \sum_{j=1}^J w_j^* \, Y_{jt},$$

and the average treatment effect on the treated over 1989--2000 is just the mean post-period gap between observed California and that synthetic counterfactual:

$$\widehat{\text{ATT}}_{\text{SCM}} = \frac{1}{T_{\text{post}}} \sum_{t > t^*} \Big[Y_{1t} - \sum_{j=1}^J w_j^* \, Y_{jt}\Big].$$

**Why it works where Difference-in-Differences failed.** Difference-in-Differences against Nevada needed parallel pre-trends with *one* neighbour. Synthetic Control needs parallel pre-trends with a *data-driven blend* of many neighbours. The optimisation does the matching, so the analyst no longer has to pick "the right" control state by hand.

**The pipeline.** The `tidysynth` package by Eric Dunford wraps the Abadie--Diamond--Hainmueller optimisation into a tidyverse-style pipeline with four explicit stages.

```{mermaid}
flowchart LR
    A["1. synthetic_control()<br/>declare treated unit<br/>and intervention time"] --> B["2. generate_predictor()<br/>define matching variables<br/>(one call per time window)"]
    B --> C["3. generate_weights()<br/>optimise donor weights<br/>(quadratic programming)"]
    C --> D["4. generate_control()<br/>build synthetic California<br/>and post-period gap series"]
    D --> E["5. plot_/grab_ helpers<br/>trends, weights,<br/>placebos, MSPE ratio,<br/>Fisher exact p-value"]

    style A fill:#6a9bcc,stroke:#141413,color:#fff
    style B fill:#6a9bcc,stroke:#141413,color:#fff
    style C fill:#6a9bcc,stroke:#141413,color:#fff
    style D fill:#d97757,stroke:#141413,color:#fff
    style E fill:#00d4c8,stroke:#141413,color:#141413
```

Stages 1--4 produce the estimate. Stage 5 is a battery of inspection helpers --- `plot_trends()`, `plot_differences()`, `plot_weights()`, `plot_placebos()`, `plot_mspe_ratio()`, `grab_unit_weights()`, `grab_predictor_weights()`, `grab_balance_table()`, `grab_significance()` --- that turn the fitted object into figures and tables for diagnostics and inference. We use all of them below.

### A roadmap for this section

Synthetic Control is the deepest method in this tutorial, so this section is the longest. To keep you oriented, here is what each subsection does:

| Subsection | What you will see | Why it matters |
|---|---|---|
| 10.1 | Build `prop99_syn` with the four-stage pipeline | Defines the treated unit, the donor pool, and the predictors |
| 10.2 | Donor weights (W) and predictor weights (V) | Tells you *which donors* and *which predictors* did the matching work |
| 10.3 | The point estimate and the trends plot | The headline ATT and the visual comparison observed vs synthetic |
| 10.4 | Predictor balance table | Confirms the matching worked --- California vs synthetic California vs donor average |
| 10.5 | `plot_differences()` | Isolates the year-by-year treatment-effect curve |
| 10.6 | Placebo permutation test | Inference: where does California fall vs every "what if a donor had been treated?" simulation |
| 10.7 | Mean Squared Prediction Error ratio and Fisher exact $p$-value | A sharper one-number inference statistic |
| 10.8 | Inspecting the nested tidysynth object | The whole optimisation is introspectable from R |

Read top to bottom for the full walk-through, or jump to a subsection if you only need one diagnostic.

### 10.1 Fit the synthetic-control pipeline

```{r}
#| label: sc-fit

prop99_syn <- prop99 |>
  # 1. Declare the panel structure: outcome, unit, time, treated unit
  #    ("California"), and the last full pre-period year (1988).
  #    generate_placebos = TRUE also fits the model treating each donor
  #    state as treated, for the permutation test in 10.6.
  synthetic_control(
    outcome  = cigsale, unit = state, time = year,
    i_unit   = "California", i_time = 1988,
    generate_placebos = TRUE
  ) |>
  # 2. Predictors averaged over the full pre-period (1980-1988).
  generate_predictor(
    time_window = 1980:1988,
    lnincome    = mean(lnincome, na.rm = TRUE),
    retprice    = mean(retprice, na.rm = TRUE),
    age15to24   = mean(age15to24, na.rm = TRUE)
  ) |>
  # 2b. beer is sparser, so use a narrower window where data is densest.
  generate_predictor(time_window = 1984:1988,
                     beer = mean(beer, na.rm = TRUE)) |>
  # 2c. Three "lagged outcomes" - cigsale at three pre-period dates.
  #     These pin synthetic California's pre-period trajectory.
  generate_predictor(time_window = 1975, cigsale_1975 = cigsale) |>
  generate_predictor(time_window = 1980, cigsale_1980 = cigsale) |>
  generate_predictor(time_window = 1988, cigsale_1988 = cigsale) |>
  # 3. Solve the constrained QP for donor weights w*. The three IPOP
  #    parameters are tuning knobs for the interior-point optimiser:
  #    margin_ipop (convergence margin), sigf_ipop (significant figures),
  #    bound_ipop (numerical bound). These values match the tidysynth
  #    README example; defaults are usually fine.
  generate_weights(optimization_window = 1970:1988,
                   margin_ipop = .02,
                   sigf_ipop   = 7,
                   bound_ipop  = 6) |>
  # 4. Compute the synthetic California series from w* and donor outcomes.
  generate_control()
```

**On the `*_ipop` parameters.** They tune the [IPOP](https://cran.r-project.org/package=kernlab) interior-point optimiser that `tidysynth` calls under the hood. Most users can leave them at defaults; we expose them here because the canonical tidysynth README example sets them explicitly, and we want this notebook to be bit-comparable to that reference.

**Predictor choices.** Seven predictors are passed in. Three are pre-period covariate averages over the full pre-period (`lnincome`, `retprice`, `age15to24` over 1980--1988). One uses a narrower window where data is densest (`beer` over 1984--1988). Three are *lagged outcomes* --- cigarette sales themselves at 1975, 1980, and 1988. The lagged outcomes are the most important trick: anchoring the synthetic control on the treated unit's own pre-period *outcome levels* at multiple time points forces the synthetic series to track California's pre-1988 trajectory closely.

### 10.2 The donor weights and the predictor weights

The optimisation produces two weight vectors that drive the entire fit. Both are extractable as tidy tables.

```{r}
#| label: sc-weights

# Donor states (W)
grab_unit_weights(prop99_syn) |> arrange(desc(weight)) |> head(8)

# Matching variables (V)
grab_predictor_weights(prop99_syn) |> arrange(desc(weight))
```

Two things to notice.

1. **Five states absorb essentially 100% of the donor weight.** Utah (≈ 34%), Nevada (≈ 24%), Montana (≈ 21%), Colorado (≈ 15%), Connecticut (≈ 6%). Every other state gets effectively zero. California is matched mostly to other Western/sunbelt states with similar age structure and cigarette price levels, plus Connecticut as a smoking-rate counterweight from the east.

2. **The two earliest cigsale levels dominate the V matrix.** `cigsale_1975` and `cigsale_1980` together get about 88% of the predictor weight. The four behavioural and demographic covariates get less than 9% combined. The optimiser has effectively decided: "the best way to predict California's cigarette sales is using *other states' cigarette sales*."

For a one-line visual of both weight vectors, tidysynth ships a `plot_weights()` helper. The same fitted object is passed in and we get a faceted ggplot showing donor unit weights on the left and predictor (V matrix) weights on the right.

```{r}
#| label: fig-sc-weights
#| fig-cap: "tidysynth plot_weights output: faceted bar chart with donor unit weights on the left and predictor V-matrix weights on the right."
#| fig-height: 6

plot_weights(prop99_syn)
```

The left facet recovers the five-state recipe from the table above; the right facet shows the heavy concentration on the two lagged-outcome predictors. Both panels are produced by the same one-line call to `plot_weights(prop99_syn)`.

#### A closer look at the V matrix

The combined `plot_weights()` view is convenient, but the V matrix deserves a stand-alone chart because it answers a different question than the donor weights. Donor weights say *which states* mimic California; the V matrix says *which variables* the optimiser used to decide what "mimics" means.

```{r}
#| label: fig-sc-vmatrix
#| fig-cap: "Stand-alone bar chart of the V matrix: cigsale_1975 and cigsale_1980 dominate, while behavioural covariates lnincome, age15to24 and beer get nearly zero weight."

# Build a stand-alone bar chart of the V matrix from the tidy
# grab_predictor_weights() output.
predw_df <- grab_predictor_weights(prop99_syn) |>
  mutate(variable = forcats::fct_reorder(variable, weight))

ggplot(predw_df, aes(x = weight, y = variable)) +
  geom_col(fill = "#6a9bcc") +
  geom_text(aes(label = sprintf("%.3f", weight)), hjust = -0.12) +
  expand_limits(x = max(predw_df$weight) * 1.20) +
  labs(x = "V-matrix weight (predictor importance)", y = NULL)
```

Two readings of the same picture, one practical and one cautionary.

- *Practical reading.* Two lagged outcomes (`cigsale_1975` and `cigsale_1980`) carry about 88% of the matching information. The remaining 12% is split mostly between retail cigarette price and the third lagged outcome. The optimiser has decided that California's pre-period cigarette sales --- *at multiple time points* --- are the best fingerprint to match.
- *Cautionary reading.* The V matrix is **not a causal ranking**. It tells you which variables were *useful for matching the treated unit's pre-period*, not which variables *cause* the outcome. A predictor can have zero V-weight here and still be substantively important for cigarette consumption --- it just was not the most efficient lever for getting the pre-period RMSPE small.

**Common pitfall.** Treating the V matrix as a list of causal drivers. It is a list of *good pre-period predictors for one specific unit*, not a structural model of smoking. The right place to look for *causal* importance is the post-period gap series in §10.5, not the V matrix.

### 10.3 The estimate

```{r}
#| label: sc-estimate

# grab_synthetic_control() returns a tidy tibble with observed (real_y)
# and synthetic (synth_y) cigsale for every year. We restrict to the
# post-period and compute the per-year gap.
sc_post <- grab_synthetic_control(prop99_syn) |>
  filter(time_unit > 1988) |>
  mutate(dif = real_y - synth_y)

# Average the per-year gap to recover the ATT.
sc_estimate <- mean(sc_post$dif)
sc_estimate
```

The Synthetic Control ATT is about $-18.85$ packs/capita averaged over 1989--2000. This is the workshop's primary causal estimate and within rounding of the canonical Abadie et al. (2010) result.

```{r}
#| label: fig-sc-trends
#| fig-cap: "Synthetic Control: California (observed) vs synthetic California (weighted donor combination). The pre-period fit is excellent; a gap opens immediately after 1989."

plot_trends(prop99_syn)
```

The pre-period fit is excellent --- the synthetic and observed series are nearly indistinguishable through 1988. A substantial gap opens immediately after 1989, widening to roughly 30 packs by 2000.

### 10.4 Predictor balance: did the matching work?

`grab_balance_table()` shows California, synthetic California, and the unweighted donor average side-by-side on every predictor.

```{r}
#| label: sc-balance

grab_balance_table(prop99_syn)
```

Read the rightmost two columns against the leftmost. On every variable, *synthetic California* is far closer to California than the unweighted donor average is. The most dramatic improvement is on the lagged outcomes: `cigsale_1988` is about 90 for California vs about 91 for the synthetic --- a near-perfect match --- while the unweighted donor average is about 114. That gap of ~24 packs is exactly the bias the naive pre-post method silently absorbed.

### 10.5 Visualising the post-period gap with `plot_differences()`

`plot_trends()` showed *both* observed and synthetic California on one canvas. The companion helper `plot_differences()` plots just the *gap*: $Y_{1t} - \widehat{Y_{1t}(0)}$, year by year. This isolates the treatment-effect curve in its cleanest form.

```{r}
#| label: fig-sc-differences
#| fig-cap: "tidysynth plot_differences: per-year gap between observed California and synthetic California, isolating the treatment-effect curve."

plot_differences(prop99_syn)
```

Read the line as the *effect of Proposition 99 on California in year $t$*. The pre-period values hover near zero (the matching worked), the line drops sharply after 1989, and it stays negative --- steadily widening --- throughout the post-period. The 1989--2000 mean of this series is exactly the $-18.85$ packs ATT reported above.

### 10.6 Inference via placebo permutation

A "standard error" computed as cross-year SD divided by $\sqrt{N}$ is *not* a real sampling-distribution-based standard error. The proper Synthetic Control uncertainty quantification is a permutation test.

**The recipe.** Refit the synthetic-control model treating *each donor state* as if *it* had been the treated unit. Compute the post-period gap for each placebo. Compare California's gap trajectory to those placebo trajectories. If California's gap is extreme relative to the placebos, the policy probably did something.

```{r}
#| label: fig-sc-placebos
#| fig-cap: "tidysynth plot_placebos: gap series for California in warm orange, overlaid on grey gap series for each donor state refit as if treated. Donors with poor pre-period fit are pruned by default."

# tidysynth's built-in one-liner. Returns a ggplot showing the gap
# series for California (highlighted) on top of the gap series for
# every donor refit as if treated.
plot_placebos(prop99_syn)
```

The orange line is California; the grey lines are the donor placebos. By default, `plot_placebos()` *prunes* placebos whose pre-period mean squared prediction error (MSPE) exceeds twice California's --- those donors fit their own pre-period so badly that comparing their post-period gap to California's would be misleading. After pruning, California's post-period gap sits visibly below every retained placebo, which is the visual signature of a "real" treatment effect.

The unpruned variant keeps every donor for full transparency:

```{r}
#| label: fig-sc-placebos-unpruned
#| fig-cap: "Same plot as above but with every donor retained --- including badly pre-fit ones --- to show the full pool of placebo trajectories."

plot_placebos(prop99_syn, prune = FALSE)
```

With pruning off, the grey cloud is messier and a few badly-fit donors swing wildly --- but California's post-period descent still ends up at the bottom of the bundle. The qualitative conclusion does not depend on the pruning rule.

### 10.7 The Mean Squared Prediction Error ratio and a Fisher exact p-value

A sharper version of the same test is the **MSPE ratio** --- the ratio of post-period to pre-period mean squared prediction error. If a unit has a tight pre-period fit *and* a large post-period gap, the ratio is large. California's number is striking:

```{r}
#| label: sc-significance

# grab_significance() returns one row per unit (treated + every placebo)
# with pre_mspe, post_mspe, the post/pre ratio, the unit's rank in that
# ratio, and the Fisher-style p-value (rank / n_units).
grab_significance(prop99_syn) |> arrange(desc(mspe_ratio)) |> head(5)
```

California's MSPE ratio is about 124 --- more than two and a half times higher than the next-highest unit (Georgia at about 47). California ranks **1st out of 39 units**. The Fisher exact $p$-value is rank divided by total units, so $1/39 \approx 0.026$. Under the null hypothesis that Proposition 99 had no effect, the probability of seeing a unit this extreme purely by chance is about 2.6%.

```{r}
#| label: fig-sc-mspe-ratio
#| fig-cap: "MSPE-ratio bar chart with California at the top of the ranking. Its post-period gap is unusually large relative to its pre-period fit."
#| fig-height: 7

plot_mspe_ratio(prop99_syn)
```

The orange/highlighted bar at the top is California; every other bar is a placebo donor. The gap between California and Georgia (the second-place state) is enormous. That gap is the visual signature of "a real treatment effect that the donor pool does not naturally replicate".

### 10.8 Inspecting the nested tidysynth object

`prop99_syn` is not a plain data frame --- it is a *nested tibble* with one row per unit (treated unit + every donor refit as a placebo) and list-columns that hold every intermediate output of the optimisation. Printing the object directly shows the structure.

```{r}
#| label: sc-nested

# Show the first five rows so the list-column structure is visible.
print(prop99_syn, n = 5)

# Use grab_synthetic_control(placebo = TRUE) for a tidy long table with
# observed (real_y) and synthetic (synth_y) outcomes side-by-side for
# every unit and year. This is the canonical way to flatten the nested
# tibble for downstream summaries.
prop99_syn |>
  grab_synthetic_control(placebo = TRUE) |>
  head(8)
```

The columns capture, in order: unit name, placebo indicator, type, the outcome series, the predictor matrix, the synthetic-control series, the donor weights, the predictor weights, the raw input data, run metadata, and the optimiser loss. Each list-column can be flattened with `tidyr::unnest()` for custom downstream work.

This is the whole point of the nested-tibble design: every step of the optimisation is *introspectable from R*, with no need to dig into S4 slots or `attr()` blobs.

### Recap of section 10

After eight subsections it helps to gather everything in one place.

| Question | Answer |
|---|---|
| What does Synthetic Control estimate? | The ATT on California, 1989--2000 |
| What is the point estimate? | **$-18.85$ packs/capita per year** |
| What is "synthetic California"? | A convex combination of five states: Utah ≈ 34%, Nevada ≈ 24%, Montana ≈ 21%, Colorado ≈ 15%, Connecticut ≈ 6% |
| What predictors did the matching? | Mostly two lagged outcomes --- `cigsale_1975` and `cigsale_1980` (≈ 88% of V-weight combined) |
| How is the matching quality? | Excellent --- `cigsale_1988` is ~90 (California) vs ~91 (synthetic), against an unweighted donor average of ~114 |
| What is the inference statistic? | Fisher exact $p = 0.026$ (California ranks 1st out of 39 on the MSPE ratio) |
| What is the design-time pitfall? | Don't read the V matrix as a list of causal drivers --- it is a list of *good pre-period predictors* |

Synthetic Control is the workshop's headline causal estimate, and the placebo/MSPE-ratio diagnostics in §10.6--§10.7 both confirm that California's post-1989 trajectory is unusual relative to what other states experienced in the same window. In §11 (CausalImpact) we hand the same donor information to a Bayesian model and ask whether a *credible interval* (a direct probability statement about the effect) tells the same story.

## 11. Method 6 --- CausalImpact

**The idea.** Fit a **Bayesian structural time-series (BSTS)** model on the pre-period. Use *other states' cigarette sales* (and optionally covariates) as predictors. Project the fitted model forward as the counterfactual. The posterior over (observed − projected) gives a credible interval for the policy effect.

**R tooling.** [`CausalImpact`](https://google.github.io/CausalImpact/) by Brodersen et al. (Google), built on top of [`bsts`](https://cran.r-project.org/package=bsts) (Bayesian Structural Time Series). Missing-covariate imputation here uses [`mice`](https://cran.r-project.org/package=mice) with random-forest backend.

**The model in two pieces.** The BSTS counterfactual is

$$y_{1t} = \mu_t + \beta^\top x_t + \varepsilon_t, \quad t \le t^*$$

where $\mu_t$ is a local-level trend, $x_t$ are the control-series regressors (other states' `cigsale`, plus optional covariates), and $t^*$ is the intervention date. In words: California's outcome is *a slowly-evolving trend* **plus** *a linear combination of donor-state series* **plus** *a random error*.

The two ingredients each play a distinct role.

```{mermaid}
flowchart TB
    subgraph "BSTS counterfactual ŷ₁ₜ"
        TREND["μₜ — local-level trend<br/>(absorbs dynamics no control can explain)"]
        REG["β·xₜ — regression on donor cigsale + covariates<br/>(borrows from donor pool)"]
        ERR["εₜ — random error"]
    end
    TREND --> Y["ŷ₁ₜ"]
    REG --> Y
    ERR --> Y
    Y --> CMP["Observed y₁ₜ − ŷ₁ₜ<br/>= policy effect (with credible band)"]

    style TREND fill:#6a9bcc,stroke:#141413,color:#fff
    style REG fill:#00d4c8,stroke:#141413,color:#141413
    style ERR fill:#7a8395,stroke:#141413,color:#fff
    style Y fill:#1f2b5e,stroke:#141413,color:#fff
    style CMP fill:#d97757,stroke:#141413,color:#fff
```

The trend $\mu_t$ absorbs the dynamics that no control series can explain; the regression term $\beta^\top x_t$ borrows information from the donor pool. After the model is fit on $t \le t^*$, it is projected forward and the posterior over $y_{1t} - \hat{y}_{1t}$ gives the credible interval for the policy effect.

**Input format.** CausalImpact wants a *wide* dataset with the treated outcome in column 1 and every control series in the remaining columns. The covariate columns have missing values, so we fill them with random-forest multiple imputation from `mice`.

```{r}
#| label: ci-imputation

# Fill in missing covariate values with one round of random-forest
# multiple imputation (m = 1, method = "rf"). printFlag suppresses
# console chatter.
prop99_imputed <- prop99 |>
  mice(m = 1, method = "rf", printFlag = FALSE) |>
  complete() |>
  as_tibble()

# Pivot the long panel to wide format: one column per (variable, state)
# pair. CausalImpact requires the treated outcome in column 1, hence
# relocate(cigsale_California) and drop the year index.
prop99_wide <- prop99_imputed |>
  pivot_wider(names_from  = state,
              values_from = c(cigsale, lnincome, beer, age15to24, retprice)) |>
  relocate(cigsale_California) |>
  select(-year)

cat("Imputed rows:", nrow(prop99_imputed),
    " Remaining NAs:", sum(is.na(prop99_imputed)), "\n")
```

```{r}
#| label: ci-fit

# CausalImpact takes integer row indices, not years.
# Row 1 = 1970, row 19 = 1988 (last pre-period year).
# Row 20 = 1989, row 31 = 2000 (last observed year).
pre_idx  <- c(1, 19)
post_idx <- c(20, 31)

# Reset the seed so the BSTS MCMC draws are reproducible.
set.seed(42)

# Fit the Bayesian structural time-series model and print the summary
# (average + cumulative effect with credible intervals).
impact_full <- CausalImpact(prop99_wide, pre.period = pre_idx,
                            post.period = post_idx)
summary(impact_full)
```

**Reading the output.**

- **Average ATT:** about $-13$ packs/capita (posterior SD ~11), 95% credible interval roughly $[-32, +6]$.
- **Cumulative effect:** about $-154$ packs over 12 years (95% CI roughly $[-383, +68]$), or about 16% of what would have been expected absent the policy.
- **Posterior probability of any causal effect:** about 92%.

If we drop the covariates and use only other states' cigarette sales as controls, the point estimate strengthens (around $-21$ packs) and the posterior probability rises. The covariates absorb some of the variation the simpler model was attributing to Proposition 99 --- which can be read as either "added robustness" or "watered-down signal" depending on how much you trust the imputed beer-and-income covariates.

```{r}
#| label: fig-ci
#| fig-cap: "CausalImpact two-panel: pointwise observed vs Bayesian counterfactual, and cumulative effect over time."
#| fig-height: 7

plot(impact_full)
```

The top panel shows the pointwise picture: observed California opens a steady gap below the Bayesian counterfactual starting in 1989, with a 95% credible band that widens as we forecast further from the training window. The bottom panel cumulates that gap over time. By 2000 the cumulative effect is roughly $-150$ packs/capita with a credible interval that includes zero only at the very upper edge.

**Common pitfall.** Imputing missing covariates without thinking about the imputation model. The random-forest fill we use here is a single-imputation shortcut for tutorial speed. With multiple imputation ($m > 1$) or a different model, the estimate can move by 1--3 packs --- and worse, an imputation that uses California itself to fill donor covariates would build an artificial post-period correlation that biases the result toward zero.

**Recap.** CausalImpact lands at $-13$ to $-21$ packs depending on whether covariates are included, with a 92--97% posterior probability of a non-zero effect. It is the only method here that delivers a *credible* interval (a direct probability statement about the parameter), not a frequentist confidence band.

## 12. Cross-method comparison

We collect every method's point estimate, an approximate ±1.96·SE interval for visual comparison, *and* a `principled_inference` string that records each method's recommended uncertainty quantification. The two columns differ for Synthetic Control (where the right inference is a Fisher exact p-value, not a confidence interval) and for CausalImpact (where the right interval is Bayesian, not frequentist).

```{r}
#| label: cross-method-table

# Build a tidy tibble that holds, for each estimator, its name, the
# estimand it targets, the point estimate, an approximate standard error
# (for the back-of-envelope forest plot), and a text string describing
# the method's PRINCIPLED uncertainty quantification (used in the table
# that follows the forest plot).
results_tbl <- tibble(
  method = c("Naive pre-post", "DiD (CA vs Nevada)", "ITS (growth curve)",
             "ITS (ARIMA)", "RDD on time", "Synthetic Control", "CausalImpact"),
  estimand = c("Descriptive (biased)", "ATT (CA, 1989-1993)",
               "Mean post-period gap", "Mean post-period gap",
               "Level jump at 1989", "ATT (CA, 1989-2000)",
               "ATT (CA, 1989-2000)"),
  estimate  = c(-27.02, -5.68, -28.28, 4.55, -20.06, -18.85, -12.82),
  std_error = c(5.30, 5.39, 1.72, 2.34, 5.59, 1.84, 9.60),
  principled_inference = c(
    "HAC 95% CI: [-37.4, -16.6]",
    "HAC 95% CI: [-16.3, +4.9]",
    "Linear-trend 95% prediction interval (no closed-form ATT SE)",
    "ARIMA 95% forecast-band average: [-29.1, +38.2]",
    "HAC 95% CI: [-31.0, -9.1]",
    "Fisher exact p = 0.026 (MSPE-ratio rank 1/39)",
    "Posterior 95% CrI: [-31.9, +5.7]; P(effect != 0) = 92%"
  )
) |>
  # Add the back-of-envelope 95% interval used in the forest plot.
  mutate(ci_low  = estimate - 1.96 * std_error,
         ci_high = estimate + 1.96 * std_error)

results_tbl
```

The forest plot below uses the back-of-envelope `±1.96·SE` interval to fit every method onto a shared visual scale. The table that follows it is the more honest summary: it records each method's *recommended* uncertainty quantification, which differs in kind from method to method.

```{r}
#| label: fig-forest
#| fig-cap: "Forest plot of all seven estimators with back-of-envelope 95% intervals for the effect on per-capita cigarette sales."
#| fig-height: 6

# Order methods top-to-bottom in display, mark naive as the descriptive
# baseline, and plot point + interval + label.
forest_df <- results_tbl |>
  mutate(method = factor(method, levels = rev(results_tbl$method)),
         family = if_else(method == "Naive pre-post",
                          "Descriptive baseline", "Causal estimator"))

ggplot(forest_df, aes(x = estimate, y = method, color = family)) +
  geom_vline(xintercept = 0, color = "grey60", linewidth = 0.4) +
  geom_errorbar(aes(xmin = ci_low, xmax = ci_high),
                width = 0.18, linewidth = 0.8,
                orientation = "y") +
  geom_point(size = 3.4) +
  geom_text(aes(label = sprintf("%.1f", estimate)),
            vjust = -1.2, size = 3.2) +
  scale_color_manual(values = c("Descriptive baseline" = "grey60",
                                "Causal estimator"     = "#d97757")) +
  labs(title = "Six estimators of California's Proposition 99 effect",
       subtitle = "Effect on per-capita cigarette sales (packs).\n95% CIs use HAC, naive across-year SD, or CausalImpact credible bands.",
       x = "Effect on cigarette sales (negative = reduction)",
       y = NULL, color = NULL) +
  theme(legend.position = "bottom")
```

| Method | Estimand | Point estimate | Back-of-envelope ±1.96·SE | Principled inference |
|---|---|---:|---|---|
| Naive pre-post | Descriptive (biased) | $-27.0$ | [$-37.4$, $-16.6$] | HAC 95% CI: [$-37.4$, $-16.6$] |
| Difference-in-Differences (vs Nevada) | ATT on California, 1989--1993 | $-5.7$ | [$-16.3$, $+4.9$] | HAC 95% CI: [$-16.3$, $+4.9$] |
| Interrupted Time Series (growth curve) | Mean post-period gap | $-28.3$ | [$-31.7$, $-24.9$] | Linear-trend 95% prediction interval (no closed-form ATT standard error) |
| Interrupted Time Series (ARIMA) | Mean post-period gap | $+4.5$ | [$-0.0$, $+9.1$] | ARIMA 95% forecast-band average: [$-29.1$, $+38.2$] |
| Regression Discontinuity on time | Level jump at 1989 | $-20.1$ | [$-31.0$, $-9.1$] | HAC 95% CI: [$-31.0$, $-9.1$] |
| Synthetic Control | ATT on California, 1989--2000 | $-18.9$ | [$-22.5$, $-15.2$] | Fisher exact $p = 0.026$ (Mean Squared Prediction Error ratio rank 1 of 39) |
| CausalImpact | ATT on California, 1989--2000 | $-12.8$ | [$-31.6$, $+6.0$] | Posterior 95% credible interval: [$-31.9$, $+5.7$]; $P(\text{effect} \neq 0) = 92\%$ |

Three things are now visible that the forest plot alone cannot show:

1. For four of the seven rows (Naive, Difference-in-Differences, the linear-trend Interrupted Time Series, and Regression Discontinuity) the back-of-envelope interval and the principled interval are the same --- those are the methods where heteroskedasticity-and-autocorrelation-consistent standard errors are the natural inference.
2. **ARIMA's principled forecast band ($[-29.1, +38.2]$) is enormous** --- much wider than the back-of-envelope ±1.96·SE bar in the forest plot. The point estimate of $+4.5$ packs sits in the middle of an interval that easily crosses both $-29$ and $+38$, which is the *honest* way to report ARIMA-based Interrupted Time Series under model uncertainty.
3. **Synthetic Control's principled inference is not a confidence interval at all** --- it is a rank-based Fisher exact $p$-value of 0.026. Anyone reporting Synthetic Control should cite that $p$-value, not the back-of-envelope $\pm 1.96 \cdot \tfrac{\mathrm{SD}}{\sqrt{N}}$ band.

Three groupings jump off the page.

**Cluster 1 --- the causal consensus ($-13$ to $-20$ packs).** RDD on time ($-20.1$), Synthetic Control ($-18.9$), and CausalImpact full-covariate ($-12.8$) sit close together with overlapping intervals. All three build counterfactuals from principled donor-information machinery: a piecewise time model, a weighted donor blend, and a Bayesian structural time series. This is the headline range.

**Cluster 2 --- pre-trend extrapolation only (overshoots by ~50%).** Naive pre-post ($-27.0$) and ITS-growth-curve ($-28.3$) report roughly 50% larger effects. They use only within-California information. With no comparison unit to absorb the nationwide secular decline, the entire California drop gets attributed to Proposition 99.

**Cluster 3 --- the broken outliers in opposite directions.** DiD vs Nevada ($-5.7$, $p = 0.31$) collapses to noise because Nevada was falling in parallel. ITS-ARIMA ($+4.55$) flips sign because AICc picks a model that extrapolates short-run momentum out of sample. Each illustrates a textbook failure mode worth remembering.

## 13. Discussion

The point of running six estimators on the same data is not to find "the right answer". It is to learn *where* each estimator fails and *how* to read disagreement.

### Six counterfactuals at a glance

Each method's counterfactual is a one-sentence assumption. Lining them up makes the disagreement legible.

| Method | The counterfactual is… | Estimate |
|---|---|---:|
| Naive pre-post | California's pre-1989 level continues unchanged | $-27.0$ |
| DiD vs Nevada | California would have done what Nevada did | $-5.7$ |
| ITS growth-curve | California's straight-line pre-trend continues | $-28.3$ |
| ITS ARIMA | California's pre-trend continues via best-AICc model | $+4.5$ |
| RDD on time | California's pre-period piecewise fit continues | $-20.1$ |
| Synthetic Control | A weighted blend of donor states tracks California | $-18.9$ |
| CausalImpact | A Bayesian time-series model fit on donors projects forward | $-12.8$ |

### Three lessons

**1. The choice of counterfactual is *the* design decision.**
Every method computes effect $=$ observed $-$ counterfactual. The gap from $-5.7$ (DiD vs Nevada) to $-28.3$ (ITS-growth) is the *price* of making the wrong assumption about the missing counterfactual. The data are the same; the assumptions differ.

**2. Single comparisons are fragile; weighted combinations are robust.**
DiD against one neighbouring state collapses when that state is itself shifting. Synthetic Control's data-driven blending --- Utah 34%, Nevada 24%, Montana 21%, Colorado 15%, Connecticut 6%, everyone else ~0% --- produces a stable, interpretable estimate. CausalImpact does the same job through a Bayesian regression on all donors and lands in the same neighbourhood.

**3. Automated model selection is not your friend in ITS.**
AICc picked ARIMA(1, 2, 0) on California's 19-year pre-period. The implied counterfactual is *worse than the observed post-period*. No diagnostic statistic flagged the problem. Always pair a single-model ITS estimate against a comparison-unit method before drawing conclusions.

### A "so-what" for policymakers

If a state legislator asks "what did Proposition 99 do for California's smoking rates?", the honest answer is:

> Cigarette sales fell about 18 packs per capita per year more than they would have without the policy, with reasonable bounds of $-13$ to $-22$ packs. The cumulative effect over the first 12 years is roughly 150--250 fewer packs per Californian.

That headline survives every causally-defensible specification (RDD, Synthetic Control, both CausalImpact variants). It can be plugged directly into a back-of-envelope mortality or tax-revenue calculation.

## 14. Summary and next steps

**Method takeaway.** Five of the six causal estimators agree on a $-13$ to $-20$ pack reduction. The synthetic-control class (SCM, CausalImpact, RDD-on-time) clusters around $-18$ packs. The naive and single-unit methods either overshoot ($-27$ to $-28$) or collapse ($-5.7$, $+4.5$).

**Data takeaway.** California's pre-1988 cigarette sales were already declining at $-1.78$ packs/year. Any honest evaluation must separate the policy effect from that pre-existing trend. Synthetic California's pre-period fit (~90 vs ~91 in 1988) shows a five-state weighted blend can replicate the trajectory almost exactly.

**Inference takeaway.** Synthetic Control's Fisher exact $p$-value is 0.026 (California ranks 1st of 39 on the MSPE ratio). CausalImpact's posterior probability of a non-zero effect is about 92% (full covariates) or 97% (cigarette-only). The two strongest principled inference statements agree.

**Practical limitation.** No method here delivers a "true" causal effect with formal frequentist guarantees, because Proposition 99 was not randomized. Every estimate is conditional on an identifying assumption (parallel trends, pre-trend extrapolation, donor convexity, BSTS prior). The cross-method comparison is a *triangulation*, not a proof.

**Next steps.** For a deeper modern DiD treatment with staggered adoption, group-time ATTs, and HonestDiD sensitivity analysis, see [Difference-in-Differences for Policy Evaluation: A Tutorial using R](https://carlos-mendez.org/post/r_did/). For a Bayesian extension that lets the donor weights vary across space --- also fit on this same Proposition 99 dataset --- see [Bayesian Spatial Synthetic Control: California's Proposition 99 in R](https://carlos-mendez.org/post/r_sc_bayes_spatial/). For the original workshop with PDF lecture slides, see [causalpolicy.nl](https://causalpolicy.nl/).

## 15. Exercises

1. **Sensitivity to the comparison window.** Re-run the DiD and naive pre-post estimates on the full 1970--2000 window instead of the workshop's 1984--1993 window. Do the estimates get closer to the synthetic-control consensus, or further away? Why?

2. **Pick a different ITS model.** Refit the ITS section using `ARIMA(1, 1, 0)` (one autoregressive lag, one round of differencing) instead of the AICc-selected `ARIMA(1, 2, 0)`. Does the post-period counterfactual still bend below the observed series? What does that imply for the choice between AIC, AICc, and BIC in policy evaluation?

3. **Different intervention year.** Pretend the intervention happened in 1985 instead of 1989 (a placebo). Re-run Synthetic Control with `i_time = 1984`. The post-period gap should be near zero if the method is working --- is it? What does a non-zero "placebo effect" tell you about the method's identification assumptions?

4. **Probe the V matrix.** Print `grab_predictor_weights(prop99_syn)` for the fitted model. Two predictors (`cigsale_1975` and `cigsale_1980`) together get 88% of the weight. Re-fit *without* the three lagged outcomes (drop the three `generate_predictor(time_window = 19xx, cigsale_19xx = cigsale)` calls). Does the synthetic California still match the pre-period as well? What does that tell you about the role of lagged outcomes in Synthetic Control?

## 16. References

1. [Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's Tobacco Control Program. *Journal of the American Statistical Association*, 105(490), 493--505.](https://www.aeaweb.org/articles?id=10.1257/jasa.2010.ap08746)

2. [Abadie, A. (2021). Using synthetic controls: Feasibility, data requirements, and methodological aspects. *Journal of Economic Literature*, 59(2), 391--425.](https://www.aeaweb.org/articles?id=10.1257/jel.20191450)

3. [Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., & Scott, S. L. (2015). Inferring causal impact using Bayesian structural time-series models. *Annals of Applied Statistics*, 9, 247--274.](https://research.google.com/pubs/pub41854.html)

4. [Bernal, J. L., Cummins, S., & Gasparrini, A. (2017). Interrupted time series regression for the evaluation of public health interventions: A tutorial. *International Journal of Epidemiology*, 46(1), 348--355.](https://academic.oup.com/ije/article/46/1/348/2622842)

5. [Hyndman, R. J., & Athanasopoulos, G. (2021). *Forecasting: Principles and Practice* (3rd ed.). OTexts.](https://otexts.com/fpp3/)

6. [ODISSEI Social Data Science team. (2024). *Workshop on Causal Effects of Policy Interventions*. CC-BY-4.0.](https://causalpolicy.nl/)

7. [Dunford, E. (2024). `tidysynth` --- A tidy implementation of the synthetic control method in R. GitHub repository.](https://github.com/edunford/tidysynth)

8. [`CausalImpact` --- An R package for causal inference using Bayesian structural time-series models.](https://google.github.io/CausalImpact/)

9. [`fpp3` --- Forecasting: Principles and Practice (3rd edition) data and R package.](https://cran.r-project.org/package=fpp3)

10. [Brodersen, K. H. *Inferring the effect of an event using CausalImpact*. YouTube talk.](https://youtu.be/GTgZfCltMm8) --- a 50-minute walk-through of the CausalImpact intuition, motivating examples, and the Bayesian structural time-series model from the package's lead author.