--- name: statsmodels description: >- Statistical modeling: OLS/WLS/GLS, GLM (logit, probit, Poisson), time series (ARIMA, VAR), mixed effects, diagnostics. Formula API. Use for regressions without fixed effects, GLMs, or time series. For FE/DiD use pyfixest; panel/IV use linearmodels. metadata: audience: research-coders domain: python-library library-version: "0.14.6" skill-last-updated: "2026-03-27" --- # statsmodels Skill statsmodels general-purpose statistical modeling library for Python. Covers OLS/WLS/GLS, GLM (logit, probit, Poisson, negative binomial), discrete choice models, time series (ARIMA, SARIMAX, VAR), mixed effects (MixedLM), robust regression, hypothesis tests, and comprehensive diagnostics. Supports R-style formula API. Use when fitting regressions without fixed effects, running GLMs or logit/probit, analyzing time series, or using formula syntax. For fixed effects or DiD, use pyfixest; for panel/IV/system models, use linearmodels. Comprehensive skill for statistical modeling with statsmodels. Use decision trees below to find the right guidance, then load detailed references. ## What is statsmodels? statsmodels is the general-purpose **statistical modeling** library for Python: - **Two APIs**: Formula API (`smf.ols("y ~ x1 + x2", data=df)`) for R-style modeling, and array API (`sm.OLS(y, X)`) for programmatic control - **Broad model coverage**: OLS, WLS, GLS, GLM (all families), logit, probit, multinomial, count models, zero-inflated models, quantile regression, robust regression - **Time series**: ARIMA, SARIMAX, VAR, exponential smoothing, state space models, unit root tests - **Diagnostics**: Heteroskedasticity tests, normality tests, specification tests, VIF, influence measures, residual analysis - **Hypothesis testing**: t-tests, F-tests, Wald tests, likelihood ratio tests, multiple comparison corrections ## How to Use This Skill ### Reference File Structure | File | Purpose | When to Read | |------|---------|--------------| | `quickstart.md` | Installation, formula vs array API, first model | Starting with statsmodels | | `linear-models.md` | OLS, WLS, GLS, robust regression, quantile regression | Fitting linear models | | `glm-discrete.md` | GLM families, logit/probit, count models, zero-inflated | Non-linear models, binary/count outcomes | | `time-series.md` | ARIMA, SARIMAX, VAR, exponential smoothing, unit root tests | Analyzing temporal data | | `diagnostics.md` | Heteroskedasticity, normality, VIF, influence, residuals | Checking model assumptions | | `hypothesis-testing.md` | t-tests, F-tests, Wald tests, multiple comparisons | Testing coefficients and comparing models | | `gotchas.md` | Constant term, convergence, predict pitfalls, pyfixest boundary | Debugging issues | ### Reading Order 1. **New to statsmodels?** Start with `quickstart.md` then `linear-models.md` 2. **Need GLM or logit/probit?** Read `quickstart.md` then `glm-discrete.md` 3. **Time series analysis?** Read `quickstart.md` then `time-series.md` 4. **Checking model assumptions?** Read `diagnostics.md` 5. **Coming from R?** Read `quickstart.md` (formula API mirrors R syntax) ## Related Skills - **pyfixest**: Use instead of statsmodels when your model needs absorbed fixed effects, IV with FE, or difference-in-differences. pyfixest is faster for FE models; statsmodels is broader for everything else - **linearmodels**: Use for panel data models (FE, RE, between, first difference, Fama-MacBeth), IV/GMM without FE (2SLS, LIML, GMM), system estimation (SUR, 3SLS), and asset pricing. Built on top of statsmodels; extends it for structured data - **svy**: Use for survey-weighted regression and estimation with complex survey designs. **Important:** statsmodels WLS is NOT equivalent to survey-weighted regression — WLS handles heteroscedastic errors but does not account for stratification, clustering, or finite population corrections. If your data comes from a complex probability survey (NHANES, ACS PUMS, CPS, ECLS-K, etc.), load the `svy` skill instead - **data-scientist**: Provides methodology guidance (when to use which model, assumption checking protocol, interpretation). Load alongside statsmodels for the "why"; statsmodels provides the "how" - **polars**: Data manipulation before modeling. statsmodels accepts pandas DataFrames; convert with `df.to_pandas()` if using Polars - **plotnine**: Publication-quality visualization of model results and diagnostics ## Quick Decision Trees ### "I need to fit a regression model" ``` What kind of regression? ├─ Linear (continuous outcome) │ ├─ Basic OLS → ./references/linear-models.md │ ├─ Weighted least squares → ./references/linear-models.md │ │ (⚠ WLS ≠ survey-weighted regression — for complex surveys, use `svy` skill) │ ├─ Correlated errors (GLS) → ./references/linear-models.md │ ├─ Robust to outliers (M-estimator) → ./references/linear-models.md │ └─ Quantile regression → ./references/linear-models.md ├─ Binary outcome (0/1) │ ├─ Logit → ./references/glm-discrete.md │ └─ Probit → ./references/glm-discrete.md ├─ Count outcome (0, 1, 2, ...) │ ├─ Poisson → ./references/glm-discrete.md │ ├─ Negative binomial → ./references/glm-discrete.md │ └─ Zero-inflated → ./references/glm-discrete.md ├─ Multinomial (3+ categories) │ └─ Multinomial logit → ./references/glm-discrete.md ├─ GLM (custom family/link) │ └─ GLM framework → ./references/glm-discrete.md └─ Need fixed effects? └─ Use pyfixest instead (faster FE absorption) ``` ### "I need to analyze time series" ``` What time series task? ├─ Forecast a single series │ ├─ ARIMA / SARIMAX → ./references/time-series.md │ └─ Exponential smoothing → ./references/time-series.md ├─ Multiple interrelated series │ └─ VAR / VECM → ./references/time-series.md ├─ Test for stationarity │ ├─ ADF test → ./references/time-series.md │ └─ KPSS test → ./references/time-series.md ├─ Examine autocorrelation │ └─ ACF / PACF → ./references/time-series.md └─ Structural time series └─ Unobserved components → ./references/time-series.md ``` ### "I need to check model assumptions" ``` What assumption to check? ├─ Heteroskedasticity → ./references/diagnostics.md │ ├─ Breusch-Pagan test │ └─ White test ├─ Normality of residuals → ./references/diagnostics.md │ ├─ Jarque-Bera test │ └─ Shapiro-Wilk test ├─ Specification / functional form → ./references/diagnostics.md │ └─ RESET test ├─ Multicollinearity → ./references/diagnostics.md │ ├─ VIF │ └─ Condition number ├─ Influential observations → ./references/diagnostics.md │ ├─ Cook's distance │ └─ Leverage / DFFITS ├─ Serial correlation → ./references/diagnostics.md │ └─ Durbin-Watson / Breusch-Godfrey └─ All of the above → ./references/diagnostics.md ``` ### "I need to test hypotheses" ``` What kind of test? ├─ Single coefficient significance → ./references/hypothesis-testing.md ├─ Joint significance (F-test) → ./references/hypothesis-testing.md ├─ Linear restrictions (Wald) → ./references/hypothesis-testing.md ├─ Compare nested models (LR test) → ./references/hypothesis-testing.md ├─ Multiple comparisons correction → ./references/hypothesis-testing.md └─ Chi-squared test → ./references/hypothesis-testing.md ``` ### "Something isn't working" ``` Common issues? ├─ Missing constant / intercept → ./references/gotchas.md ├─ Convergence warnings → ./references/gotchas.md ├─ predict() errors → ./references/gotchas.md ├─ Formula parsing issues → ./references/gotchas.md ├─ summary() formatting → ./references/gotchas.md ├─ statsmodels vs pyfixest → ./references/gotchas.md └─ General troubleshooting → ./references/gotchas.md ``` ## File-First Execution in Research Workflows **Important:** In data research pipelines (see `CLAUDE.md`), statsmodels analyses are executed through **script files**, not interactively. This ensures auditability and reproducibility. **The pattern:** 1. Write model code to `scripts/stage8_analysis/{step}_{model-name}.py` 2. Execute via Bash with automatic output capture wrapper script 3. Validation results get automatically embedded in scripts as comments 4. If failed, create versioned copy for fixes Closely read `agent_reference/SCRIPT_EXECUTION_REFERENCE.md` for the mandatory file-first execution protocol covering complete code file writing, output capture, and file versioning rules. **See:** - `agent_reference/SCRIPT_EXECUTION_REFERENCE.md` — Script execution protocol and format with validation The examples below show statsmodels syntax. In research workflows, wrap them in scripts following the file-first pattern. --- ## Quick Reference ### Essential Imports ```python import statsmodels.api as sm # Array API import statsmodels.formula.api as smf # Formula API (R-style) ``` ### Core Operations | Operation | Code | |-----------|------| | OLS (formula) | `smf.ols("y ~ x1 + x2", data=df).fit()` | | OLS (array) | `sm.OLS(y, sm.add_constant(X)).fit()` | | Logit | `smf.logit("y ~ x1 + x2", data=df).fit()` | | Probit | `smf.probit("y ~ x1 + x2", data=df).fit()` | | Poisson | `smf.poisson("y ~ x1 + x2", data=df).fit()` | | GLM (custom) | `smf.glm("y ~ x1", data=df, family=sm.families.Binomial()).fit()` | | WLS | `smf.wls("y ~ x1", data=df, weights=w).fit()` | | Robust (HC1) | `fit = smf.ols(...).fit(cov_type='HC1')` | | ARIMA | `sm.tsa.ARIMA(y, order=(p,d,q)).fit()` | | Summary | `results.summary()` | | Predict | `results.predict(new_data)` | | Confidence intervals | `results.conf_int(alpha=0.05)` | | Marginal effects | `results.get_margeff(at='overall')` | | VIF | `from statsmodels.stats.outliers_influence import variance_inflation_factor` | | Breusch-Pagan | `sm.stats.diagnostic.het_breuschpagan(resid, exog)` | ### Formula Syntax ```python # Additive terms "y ~ x1 + x2 + x3" # Interaction (with main effects) "y ~ x1 * x2" # equivalent to x1 + x2 + x1:x2 # Interaction only (no main effects) "y ~ x1 : x2" # Categorical variable "y ~ C(region)" # treatment coding (default) "y ~ C(region, Treatment(reference='West'))" # explicit reference # Suppress intercept "y ~ x1 + x2 - 1" # Polynomial "y ~ x1 + I(x1**2)" # I() protects Python operators ``` ## Topic Index | Topic | Reference File | |-------|---------------| | Installation | `./references/quickstart.md` | | Formula vs array API | `./references/quickstart.md` | | Reading summary output | `./references/quickstart.md` | | Comparison to pyfixest | `./references/quickstart.md` | | OLS regression | `./references/linear-models.md` | | Weighted least squares | `./references/linear-models.md` | | GLS | `./references/linear-models.md` | | Robust regression (RLM) | `./references/linear-models.md` | | Quantile regression | `./references/linear-models.md` | | Interactions and polynomials | `./references/linear-models.md` | | GLM framework | `./references/glm-discrete.md` | | Logit / probit | `./references/glm-discrete.md` | | Multinomial logit | `./references/glm-discrete.md` | | Poisson / negative binomial | `./references/glm-discrete.md` | | Zero-inflated models | `./references/glm-discrete.md` | | Marginal effects | `./references/glm-discrete.md` | | Exposure / offset | `./references/glm-discrete.md` | | ARIMA / SARIMAX | `./references/time-series.md` | | VAR / VECM | `./references/time-series.md` | | Exponential smoothing | `./references/time-series.md` | | Unit root tests | `./references/time-series.md` | | ACF / PACF | `./references/time-series.md` | | Forecasting | `./references/time-series.md` | | State space models | `./references/time-series.md` | | Heteroskedasticity tests | `./references/diagnostics.md` | | Normality tests | `./references/diagnostics.md` | | Specification tests (RESET) | `./references/diagnostics.md` | | VIF / multicollinearity | `./references/diagnostics.md` | | Influence measures | `./references/diagnostics.md` | | Residual analysis | `./references/diagnostics.md` | | Durbin-Watson | `./references/diagnostics.md` | | t-tests and F-tests | `./references/hypothesis-testing.md` | | Wald tests | `./references/hypothesis-testing.md` | | Likelihood ratio tests | `./references/hypothesis-testing.md` | | Multiple comparison corrections | `./references/hypothesis-testing.md` | | Comparing nested models | `./references/hypothesis-testing.md` | | Serial correlation tests | `./references/diagnostics.md` | | Diagnostic checklist | `./references/diagnostics.md` | | Chi-squared tests | `./references/hypothesis-testing.md` | | Joint significance tests | `./references/hypothesis-testing.md` | | Ordered logit / probit | `./references/glm-discrete.md` | | Mixed effects (MixedLM) | `./references/linear-models.md` | | Constant term pitfall | `./references/gotchas.md` | | Convergence warnings | `./references/gotchas.md` | | predict() issues | `./references/gotchas.md` | | Formula parsing (patsy) | `./references/gotchas.md` | | summary() vs summary2() | `./references/gotchas.md` | | NaN / missing data | `./references/gotchas.md` | | DataFrame index issues | `./references/gotchas.md` | | statsmodels vs pyfixest | `./references/gotchas.md` | ## Citation When this library is used as a primary analytical tool, include in the report's Software & Tools references: > Seabold, S. & Perktold, J. (2010). "Statsmodels: Econometric and Statistical Modeling with Python." *Proceedings of the 9th Python in Science Conference*. **Cite when:** statsmodels is used for GLM estimation, time series modeling, or statistical hypothesis testing central to the analysis. **Do not cite when:** Only used for post-estimation diagnostics supporting another library's primary estimation. For method-specific citations (e.g., individual estimators or techniques), consult the reference files in this skill and `agent_reference/CITATION_REFERENCE.md`.