# Model comparison with the AIC

[mRNA count dataset download](https://s3.amazonaws.com/bebi103.caltech.edu/data/singer_transcript_counts.csv)

[Spindle length dataset download](https://s3.amazonaws.com/bebi103.caltech.edu/data/good_invitro_droplet_data.csv)

<hr>

In [1]:
#| code-fold: true

# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars iqplot bebi103 watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

In [2]:
import warnings

import polars as pl
import numpy as np
import numba
import scipy.optimize
import scipy.stats as st

import bebi103

import iqplot

import bokeh.io
bokeh.io.output_notebook()

<hr>

We have previously introduced the Akaike information criterion. Here, we will demonstrate its use in model comparison and the mechanics of how to calculated it.

As a reminder, for a set of parameters $\theta$ with MLE $\theta^*$ and a model with log-likelihood $\ell(\theta;\text{data})$, the AIC is given by

\begin{align}
\text{AIC} = -2\ell(\theta^*;\text{data}) + 2p,
\end{align}

where $p$ is the number of free parameters in a model. The Akaike weight of model $i$ in a collection of models is

\begin{align}
w_i = \frac{\mathrm{e}^{-(\text{AIC}_i - \text{AIC}_\mathrm{max})/2}}{\sum_j\mathrm{e}^{-(\text{AIC}_j - \text{AIC}_\mathrm{max})/2}}.
\end{align}

To begin, we will use the AIC to compare a single Negative Binomial model to a mixture of two Negative Binomials for smFISH data from Singer, et al.

## AIC for mRNA counts

Let us now compare the single Negative Binomial to the mixture model for mRNA counts. We again need our functions for computing the MLE and computing the log-likelihood from previous lessons.

In [3]:
def log_like_iid_nbinom(params, n):
    """Log likelihood for i.i.d. NBinom measurements, parametrized
    by alpha, b=1/beta."""
    alpha, b = params

    if alpha <= 0 or b <= 0:
        return -np.inf

    return np.sum(st.nbinom.logpmf(n, alpha, 1/(1+b)))


def mle_iid_nbinom(n):
    """Perform maximum likelihood estimates for parameters for i.i.d.
    NBinom measurements, parametrized by alpha, b=1/beta"""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_iid_nbinom(params, n),
            x0=np.array([3, 3]),
            args=(n,),
            method='Powell'
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)
        

def initial_guess_mix(n, w_guess):
    """Generate initial guess for mixture model."""
    n_low = n[n < np.percentile(n, 100*w_guess)]
    n_high = n[n >= np.percentile(n, 100*w_guess)]
    
    alpha1, b1 = mle_iid_nbinom(n_low)
    alpha2, b2 = mle_iid_nbinom(n_high)
    
    return alpha1, b1, alpha2, b2


def log_like_mix(alpha1, b1, alpha2, b2, w, n):
    """Log-likeihood of binary Negative Binomial mixture model."""
    # Fix nonidentifiability be enforcing values of w
    if w < 0 or w > 1:
        return -np.inf
    
    # Physical bounds on parameters
    if alpha1 < 0 or alpha2 < 0 or b1 < 0 or b2 < 0:
        return -np.inf

    logx1 = st.nbinom.logpmf(n, alpha1, 1/(1+b1))
    logx2 = st.nbinom.logpmf(n, alpha2, 1/(1+b2))

    # Multipliers for log-sum-exp
    lse_coeffs = np.tile([w, 1-w], [len(n), 1]).transpose()

    # log-likelihood for each measurement
    log_likes = scipy.special.logsumexp(np.vstack([logx1, logx2]), axis=0, b=lse_coeffs)
    
    return np.sum(log_likes)


def mle_mix(n, w_guess):
    """Obtain MLE estimate for parameters for binary mixture 
    of Negative Binomials."""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_mix(*params, n),
            x0=[*initial_guess_mix(n, w_guess), w_guess],
            args=(n,),
            method='Powell',
            tol=1e-6,
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)            

Now we can load in the data and compute the MLEs for each of the four genes.

In [4]:
# Load in data
df = pl.read_csv(
    os.path.join(data_path, "singer_transcript_counts.csv"), comment_prefix="#"
)

df_mle = pl.DataFrame(
    schema=[("gene", str)]
    + [(param, float) for param in ["alpha", "b", "alpha1", "b1", "alpha2", "b2", "w"]]
)

for gene in df.schema:
    n = df["Nanog"].to_numpy()

    # Single Negative Binomial MLE
    alpha, b = mle_iid_nbinom(df[gene].to_numpy())

    # Mixture model MLE
    alpha1, b1, alpha2, b2, w = mle_mix(df[gene].to_numpy(), 0.2)

    # Store results in data frame
    df_mle = pl.concat(
        (
            df_mle,
            pl.DataFrame(
                data=[[gene, alpha, b, alpha1, b1, alpha2, b2, w]],
                schema=df_mle.schema,
                orient="row",
            ),
        )
    )

# Take a look
df_mle

gene,alpha,b,alpha1,b1,alpha2,b2,w
str,f64,f64,f64,f64,f64,f64,f64
"""Rex1""",1.634562,84.680915,3.497009,4.104916,5.089625,31.810375,0.160422
"""Rest""",4.530335,16.543054,2.786601,12.395701,6.683424,11.953265,0.108772
"""Nanog""",1.263097,69.347842,0.834832,66.535947,4.127488,28.133048,0.466636
"""Prdm14""",0.552886,8.200636,2.385858,4.747279,0.558672,4.872751,0.210606


For each of the two models, we can compute the log likelihood evaluated at the MLEs for the parameters.

In [5]:
# Define funcitons taking Polars structs for computing log likelihoods
def pl_log_like_iid_nbinom(s):
    return log_like_iid_nbinom((s["alpha"], s["b"]), df[s["gene"]].to_numpy())


def pl_log_like_mix(s):
    return log_like_mix(
        s["alpha1"], s["b1"], s["alpha2"], s["b2"], s["w"], df[s["gene"]].to_numpy()
    )


# Apply the functions
df_mle = df_mle.with_columns(
    # Single negative binomial
    pl.struct(["alpha", "b", "gene"])
    .map_elements(pl_log_like_iid_nbinom, return_dtype=float)
    .alias("log_like_single"),

    # Mixture model
    pl.struct(["alpha1", "b1", "alpha2", "b2", "w", "gene"])
    .map_elements(pl_log_like_mix, return_dtype=float)
    .alias("log_like_mix"),
)

# Take a look
df_mle

gene,alpha,b,alpha1,b1,alpha2,b2,w,log_like_single,log_like_mix
str,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""Rex1""",1.634562,84.680915,3.497009,4.104916,5.089625,31.810375,0.160422,-1638.678482,-1590.353743
"""Rest""",4.530335,16.543054,2.786601,12.395701,6.683424,11.953265,0.108772,-1376.748398,-1372.108896
"""Nanog""",1.263097,69.347842,0.834832,66.535947,4.127488,28.133048,0.466636,-1524.928918,-1512.444558
"""Prdm14""",0.552886,8.200636,2.385858,4.747279,0.558672,4.872751,0.210606,-713.091587,-712.702876


We can already see a very large difference between the log likelihood evaluated at the MLE for Rex1, but not much difference for Prdm14. The mixture model has $p = 5$ parameters, while the single Negative Binomial model has $p = 2$. With these numbers, we can compute the AIC and then also the Akaike weights.

In [6]:
df_mle = df_mle.with_columns(
    (-2 * (pl.col('log_like_single') - 2)).alias('AIC_single'),
    (-2 * (pl.col('log_like_mix') - 5)).alias('AIC_mix'),
)

# Take a look
df_mle

gene,alpha,b,alpha1,b1,alpha2,b2,w,log_like_single,log_like_mix,AIC_single,AIC_mix
str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""Rex1""",1.634562,84.680915,3.497009,4.104916,5.089625,31.810375,0.160422,-1638.678482,-1590.353743,3281.356963,3190.707487
"""Rest""",4.530335,16.543054,2.786601,12.395701,6.683424,11.953265,0.108772,-1376.748398,-1372.108896,2757.496797,2754.217792
"""Nanog""",1.263097,69.347842,0.834832,66.535947,4.127488,28.133048,0.466636,-1524.928918,-1512.444558,3053.857837,3034.889116
"""Prdm14""",0.552886,8.200636,2.385858,4.747279,0.558672,4.872751,0.210606,-713.091587,-712.702876,1430.183173,1435.405751


Finally, we can compute the Akaike weight for the model with a single Negative Binomial (the weight for the mixture model is $1-w_\mathrm{single}$).

In [7]:
df_mle = df_mle.with_columns(
    max_AIC := pl.max_horizontal(pl.col('AIC_single', 'AIC_mix')).alias('max_AIC'),
    num := (-(pl.col('AIC_single') - max_AIC) / 2).exp().alias('num'),
    (num / (num + (-(pl.col('AIC_mix') - max_AIC) / 2).exp())).alias('w_single')
).select(
    pl.exclude('max_AIC', 'num')
)

# Look at Akaike weights
df_mle[['gene', 'w_single']]

gene,w_single
str,f64
"""Rex1""",2.0688e-20
"""Rest""",0.162533
"""Nanog""",7.6e-05
"""Prdm14""",0.931585


In looking at the Akaike weight for the mixture (1 – `w_single`), is it clear that the mixture model is strongly preferred for Rex1 and Nanog. There is not strong preference for Rest, and a preference for the single Negative Binomial model for Prdm14. Reminding ourselves of the ECDFs, this makes sense.

In [8]:
genes = ["Nanog", "Prdm14", "Rest", "Rex1"]

plots = [
    iqplot.ecdf(
        data=df[gene].to_numpy(),
        q=gene,
        x_axis_label="mRNA count",
        title=gene,
        frame_height=150,
        frame_width=200,
    )
    for gene in genes
]

bokeh.io.show(
    bokeh.layouts.column(bokeh.layouts.row(*plots[:2]), bokeh.layouts.row(*plots[2:]))
)

Rex1 clearly is bimodal, and Nanog appears to have a second inflection point where the ECDF reaches a value of about 0.4, which is what we see in the MLE estimates in the mixture model. Rest and Prdm14 both appear to be unimodal, agreeing with what we saw with the AIC analysis.

Note that this underscores something we've been stressing all along. You should do good exploratory data analysis first, and the EDA often tells much of the story!

### Caveat

Remember, though, that we did *not* take into account that the measurements of the four genes were done in the same cells. We modeled that when we presented the mixture models at the beginning of this lesson. The analysis of a more complicated model with MLE proved to be out of reach due to computational difficulty. So, we should not make strong conclusions about what the relative quality of the mixture of single Negative Binomial models mean in this context. We will address these kinds of modeling issues in the sequel of this course.

## AIC for the spindle model

We can do a similar analysis for the two competing models for mitotic spindle size. We need our functions from earlier.

In [9]:
def theor_spindle_length(gamma, phi, d):
    """Compute spindle length using mathematical model"""
    return gamma * d / np.cbrt(1 + (gamma * d / phi)**3)


def log_likelihood(params, d, ell):
    """Log likelihood of spindle length model."""
    gamma, phi, sigma = params

    if gamma <= 0 or gamma > 1 or phi <= 0:
        return -np.inf

    mu = theor_spindle_length(gamma, phi, d)
    return np.sum(st.norm.logpdf(ell, mu, sigma))


def spindle_mle(d, ell):
    """Compute MLE for parameters in spindle length model."""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, d, ell: -log_likelihood(params, d, ell),
            x0=np.array([0.5, 35, 5]),
            args=(d, ell),
            method='Powell'
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)        

We can now perform MLE to get the parameters for each model and store the results.

In [10]:
df = pl.read_csv(os.path.join(data_path, "good_invitro_droplet_data.csv"), comment_prefix="#")

mle_1 = df.select(
    pl.col("Spindle Length (um)").mean().alias('phi_1'), 
    pl.col("Spindle Length (um)").std().alias('sigma_1')
)

mle_2 = pl.DataFrame(
    data=spindle_mle(
        df["Droplet Diameter (um)"].to_numpy(), 
        df["Spindle Length (um)"].to_numpy()
    ).reshape((1, 3)),
    orient='row',
    schema=['gamma', 'phi_2', 'sigma_2']
)

Next, we can compute the log likelihood evaluated at the MLE.

In [11]:
log_like_1 = st.norm.logpdf(
    df["Spindle Length (um)"], 
    mle_1["phi_1"].item(), 
    mle_1["sigma_1"].item()
).sum()

log_like_2 = log_likelihood(
    mle_2.to_numpy().flatten(),
    df["Droplet Diameter (um)"],
    df["Spindle Length (um)"],
)

# Take a look
log_like_1, log_like_2

(np.float64(-1999.5179249272933), np.float64(-1837.1589821363168))

The log likeihood for model 2, with spindle size depending on droplet diameter, is much greater than for model 1. And now we can compute the AIC,noting that there are two parameters for model 1 and three for model 2.

In [12]:
AIC_1 = -2 * (log_like_1 - 2)
AIC_2 = -2 * (log_like_2 - 3)

# Look at the AICs
AIC_1, AIC_2

(np.float64(4003.0358498545866), np.float64(3680.3179642726336))

There is a massive disparity in the AICs, so we know that model 2 is strongly preferred. Nonetheless, we can compute the Akaike weight for model 1 to compare.

In [13]:
AIC_max = max(AIC_1, AIC_2)
numerator = np.exp(-(AIC_1 - AIC_max)/2)
denominator = numerator + np.exp(-(AIC_2 - AIC_max)/2)
w_single = numerator / denominator

# Check the Akaike weight
w_single

np.float64(8.369539052514859e-71)

Model 1 is completely out of the question, with a tiny Akaike weight!

## Computing environment

In [14]:
%load_ext watermark
%watermark -v -p numpy,polars,scipy,bokeh,iqplot,bebi103,jupyterlab

Python implementation: CPython
Python version       : 3.13.5
IPython version      : 9.4.0

numpy     : 2.2.6
polars    : 1.31.0
scipy     : 1.16.0
bokeh     : 3.7.3
iqplot    : 0.3.7
bebi103   : 0.1.28
jupyterlab: 4.4.5

