相关特征#

此笔记本是 PyMC 移植的 Statistical Rethinking 2023 系列讲座(Richard McElreath 主讲)的一部分。

视频 - 第 14 讲 - 相关特征# 第 14 讲 相关特征

# Ignore warnings
import warnings

import arviz as az
import numpy as np
import pandas as pd
import pymc as pm
import statsmodels.formula.api as smf
import utils as utils
import xarray as xr

from matplotlib import pyplot as plt
from matplotlib import style
from scipy import stats as stats

warnings.filterwarnings("ignore")

# Set matplotlib style
STYLE = "statistical-rethinking-2023.mplstyle"
style.use(STYLE)

孟加拉国生育率再探#

utils.draw_causal_graph(
    edge_list=[
        ("A", "C"),
        ("K", "C"),
        ("U", "C"),
        ("D", "C"),
        ("D", "U"),
        ("U", "K"),
        ("A", "K"),
        ("D", "K"),
    ]
)
../_images/d2608eeb7524c6ec6d1275f831987c746beb960c009a0f984b13887d52634329.svg
  • 估计量 1:每个地区 \(D\) 的避孕药具使用率 \(C\)

  • 估计量 2:“城市化”的影响 \(U\)

  • 估计量 3:孩子数量 \(K\) 和年龄 \(A\) 的影响

地区是聚类#

农村/城市细分#

下面我们展示了农村地区(仅包括地区级截距 \(\alpha_{D[i]}\))与城市地区(包括城市地区的额外截距 \(\beta_{D[i]}\))组的避孕药具使用率。这些图是在 第 13 讲 - 多层冒险 中生成的。

utils.display_image("fertility_posterior_means_rural_urban.png", width=1000)
../_images/c04c90f91b88082f5d184c444693f85d8367ee305b5a072718ea159b8e848b95.png

该图显示

  • 平均而言,农村人口比城市地区人口更不可能使用避孕措施

    • 顶部图中的虚线低于底部图中的虚线

  • 城市样本较少,因此城市地区的不确定性范围较大

    • 与农村地区相比,与城市人口相关的误差蓝色条更大,农村地区采样的民意调查更多。

城市化是一个相关特征#

下面我们绘制了城市地区 \(p(C | U=1)\) 与农村地区 \(p(C|U=0)\) 的避孕药具使用概率。这些图是在 第 13 讲 - 多层冒险 中生成的。

utils.display_image("fertility_p_C_rural_urban.png")
../_images/d747080f4d2af2ce9ee69cc874f813fc38b69c637532df15de2d52ed16986fc1.png

该图显示,城市/农村观察之间的避孕药具使用率是相关的 (cc>0.7)

🤔#

  • 如果我们知道农村地区的避孕药具使用率,由于这种相关性,您可以更好地猜测城市地区的避孕药具使用率

  • 我们没有让 Golem(即我们的统计模型)利用这种特征相关性,因此是在浪费信息

本讲座的重点是构建可以捕获特征之间相关信息的统计模型。

  • 通过跨特征的部分合并,使推断更有效

估计相关性和部分合并演示

McElreath 继续展示了贝叶斯更新在具有相关特征的模型中的演示。它非常棒,我建议多次回顾该演示,但我太懒了,无法实现它(也没有聪明到在没有动画的情况下做到这一点,我正在尝试避免动画)。它将添加到 TODO 列表中。

添加相关特征#

  • 每个聚类/子组的一个先验分布 \(\rightarrow a_j \sim \text{Normal}(\bar a , \sigma)\);允许部分合并

  • 如果一个特征,则一维先验

    • \(a_j \sim \text{Normal}(\bar a, \sigma)\)

  • N 个特征的 N 维分布

    • 我们使用特征的联合分布,而不是独立的分布(就像我们在之前的讲座中所做的那样)

    • \(a_{j, 1..N} \sim \text{MVNormal}(A, \Sigma)\)

    • 例如,\([\alpha_j, \beta_j] \sim \text{MVNormal}([\bar \alpha, \bar \beta], \Sigma_{\alpha, \beta})\)

  • 难点:学习特征之间的关联/相关性

    • “估计均值很容易”

    • “估计标准差很难”

    • “估计相关性非常难”

    • 也就是说,通常尝试学习相关性是个好主意

      • 如果您没有足够的样本/信号,您将退回到先验

之前的模型 – 使用不相关的城市特征#

utils.draw_causal_graph(
    edge_list=[
        ("A", "C"),
        ("K", "C"),
        ("U", "C"),
        ("D", "C"),
        ("D", "U"),
        ("U", "K"),
        ("A", "K"),
        ("D", "K"),
    ],
    node_props={"U": {"color": "red"}, "D": {"color": "red"}, "C": {"color": "red"}},
    edge_props={
        ("U", "K"): {"color": "red"},
        ("U", "C"): {"color": "red"},
        ("K", "C"): {"color": "red"},
    },
)
../_images/d41c36842da03db7638aa281bfadc3a46fd823fdcf2ecbd7e376da53a84feb30.svg

这是未居中的实现,或多或少从之前的讲座中复制粘贴而来

FERTILITY = utils.load_data("bangladesh")

USES_CONTRACEPTION = FERTILITY["use.contraception"].values.astype(int)
DISTRICT_ID, _ = pd.factorize(FERTILITY.district)
DISTRICT = np.arange(1, 62).astype(
    int
)  # note: district 54 has no data so we create it's dim by hand
URBAN_ID, URBAN = pd.factorize(FERTILITY.urban, sort=True)

with pm.Model(coords={"district": DISTRICT}) as uncorrelated_model:

    # Mutable data
    urban = pm.Data("urban", URBAN_ID)

    # Priors -- priors for $\alpha$ and $\beta$ are separate, independent Normal distributions
    # District offset
    alpha_bar = pm.Normal("alpha_bar", 0, 1)  # the average district
    sigma = pm.Exponential("sigma", 1)  # variation amongst districts

    # Uncentered parameterization
    z_alpha = pm.Normal("z_alpha", 0, 1, dims="district")
    alpha = pm.Deterministic("alpha", alpha_bar + z_alpha * sigma, dims="district")

    # District / urban interaction
    beta_bar = pm.Normal("beta_bar", 0, 1)  # the average urban effect
    tau = pm.Exponential("tau", 1)  # variation amongst urban

    # Uncentered parameterization
    z_beta = pm.Normal("z_beta", 0, 1, dims="district")
    beta = pm.Deterministic("beta", beta_bar + z_beta * tau, dims="district")

    # Recored p(contraceptive)
    p_C = pm.Deterministic("p_C", pm.math.invlogit(alpha + beta))
    p_C_urban = pm.Deterministic("p_C_urban", pm.math.invlogit(alpha + beta))
    p_C_rural = pm.Deterministic("p_C_rural", pm.math.invlogit(alpha))

    # Likelihood
    p = pm.math.invlogit(alpha[DISTRICT_ID] + beta[DISTRICT_ID] * urban)
    C = pm.Bernoulli("C", p=p, observed=USES_CONTRACEPTION)

    uncorrelated_inference = pm.sample(target_accept=0.95)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha_bar, sigma, z_alpha, beta_bar, tau, z_beta]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 10 seconds.

证明特征先验在不相关模型中是独立的#

下面我们从先验中采样,并表明所有地区的 \(\alpha\)\(\beta\) 参数都与基轴对齐,表明没有相关性

def plot_2d_guassian_ci(R, mean=[0, 0], std=[1, 1], ci_prob=0.89, **plot_kwargs):
    """
    Plot the `ci_prob`% confidence interval for a 2D Gaussian defined
    by correlation matrix `R`, center `mean`, and standard deviations
    `std` along the marginals.
    """
    # Create a circle
    angles = np.linspace(0, 2 * np.pi, 100)
    radius = stats.norm.ppf(ci_prob)
    circle = radius * np.vstack([np.cos(angles), np.sin(angles)])

    # Warp circle using the covariance matrix
    D = np.diag(std)
    cov = D @ R @ D
    ellipse = sqrtm(np.matrix(cov)) @ circle
    plt.plot(ellipse[0] + mean[0], ellipse[1] + mean[1], linewidth=3, **plot_kwargs)
with uncorrelated_model:
    prior_pred = pm.sample_prior_predictive()

districts = [1, 10, 30, 61]
n_districts = len(districts)
fig, axs = plt.subplots(n_districts, 1, figsize=(6, n_districts * 4))
for ii, district in enumerate(districts):
    district_prior = prior_pred.prior.sel(district=district)
    plt.sca(axs[ii])
    az.plot_dist(
        district_prior["alpha"],
        district_prior["beta"],
        ax=axs[ii],
        contourf_kwargs={"cmap": "Reds"},
    )
    plt.xlim([-4, 4])
    plt.ylim([-4, 4])
    plt.xlabel("$\\alpha$")
    plt.ylabel("$\\beta$")
    plt.title(f"District {district} Prior")
Sampling: [C, alpha_bar, beta_bar, sigma, tau, z_alpha, z_beta]
../_images/0dfb290c0632efc27c95710bc7ed37d31b9190ef820407281c73fc15d71d638a.png

上面我们可以看到不相关模型提供的先验。我们可以判断它们是不相关的,因为

  • 分布的主轴垂直于参数值的轴。

  • 即,PDF 轮廓没有“倾斜”

多元正态先验#

\[\begin{split} \begin{align*} [\alpha_j, \beta_j] &= \text{MVNormal}([\bar \alpha, \bar \beta], R, [\sigma, \tau]) \\ [\alpha_j, \beta_j]&:= \text{地区 } j \text{ 的特征} \\ R &:= \text{特征之间的相关矩阵} \\ [\sigma, \tau] &:= \text{每个参数 } j \text{ 的(独立)标准差} \\ \end{align*} \end{split}\]

来自多元先验的样本#

np.random.seed(123)
sigma = 1.0
tau = 2.0
R = np.array([[1, 0.8], [0.8, 1]])  # correlation matrix


stds = np.array([sigma, tau])
D = np.diag(stds)
Sigma = D @ R @ D  # convert correlation matrix to covariance matrix
mvnorm = stats.multivariate_normal(cov=Sigma)


def mvn_norm_pdf(
    xs,
    ys,
):
    return mvnorm.pdf(np.vstack([xs, ys]).T)


RESOLUTION = 100
xs = ys = np.linspace(-4, 4, RESOLUTION)
utils.plot_2d_function(xs, ys, mvn_norm_pdf, cmap="Reds")

n_samples = 50
samples = mvnorm.rvs(n_samples)
utils.plot_scatter(samples[:, 0], samples[:, 1], color="darkred", alpha=1, label="samples")
plt.xlabel("$\\alpha$")
plt.ylabel("$\\beta$")
plt.xlim([-4, 4])
plt.ylim([-4, 4])
plt.legend()

print(f"Standard deviations\n{stds}")
print(f"Correlation Matrix\n{R}")
print(f"Covariance Matrix\n{Sigma}")
print(f"Cholesky Matrix\n{np.linalg.cholesky(Sigma)}")
Standard deviations
[1. 2.]
Correlation Matrix
[[1.  0.8]
 [0.8 1. ]]
Covariance Matrix
[[1.  1.6]
 [1.6 4. ]]
Cholesky Matrix
[[1.  0. ]
 [1.6 1.2]]
../_images/1692ac6a3f65d929cb36fadb59f748e85a924c9777a9e8007f2726cd566e13b8.png

具有相关城市特征的模型#

\[\begin{split} \begin{align*} C_i &\sim \text{Bernouill}(p_i) \\ p_i &= \alpha_{D[i]} + \beta_{D[i]} U_i \\ [\alpha_j, \beta_j] &\sim \text{MVNormal}([\bar \alpha, \bar \beta], R, [\sigma, \tau]) \\ \sigma &\sim \text{Exponential}(1) \\ \tau &\sim \text{Exponential}(1) \\ R &\sim \text{LKJCorr}(4) \end{align*} \end{split}\]

关于 LKJCorr 先验#

  • 以 Lewandowski-Kurowicka-Joe 命名,他们是关于该分布的原始论文的作者

  • 相关矩阵的先验

  • 采用形状参数 \(\eta>0\)

    • \(\eta=1\) 会导致相关矩阵上的均匀分布

    • \(\eta \rightarrow \infty\),R 接近单位矩阵

  • 更改相关矩阵的分布

etas = [1, 4, 10, 20, 100]
for ii, eta in enumerate(etas):
    rv = pm.LKJCorr.dist(n=2, eta=eta)
    samples = pm.draw(rv, 1000)
    az.plot_dist(samples, color=f"C{ii}", label=f"LKJCorr$(\\eta={eta})$")
plt.legend()
plt.xlabel("correlation")
plt.ylabel("density")
plt.title("LKJCorr distribution of correlations");
../_images/cbd2998d0eada611f7e6ffe4d4a5e995b9cd363ab3685af21e19194940f2eaaa.png

从 LKJCorr 采样的相关矩阵#

import matplotlib.transforms as transforms

from matplotlib.patches import Ellipse
from scipy.linalg import sqrtm


def sample_correlation_matrix_LKJ(dim=2, eta=4, n_samples=1):

    # Samples; LKJCorr returns correlations are returned as
    # upper-triangular matrix entries
    rv = pm.LKJCorr.dist(n=dim, eta=eta)
    samples = pm.draw(rv, n_samples)

    # Convert upper triangular entries to full correlation matrices
    # (there's likely a way to vectorize this, but meh it's a demo)
    covs = []
    upper_idx = np.triu_indices(n=dim, k=1)
    for sample in samples:
        cov = np.eye(dim) * 0.5
        cov[upper_idx[0], upper_idx[1]] = sample
        cov += cov.T
        covs.append(cov)
    return np.array(covs)


def sample_LKJCorr_prior(eta=1, n_samples=20, ax=None):
    plt.sca(ax)
    for _ in range(n_samples):

        R = sample_correlation_matrix_LKJ(eta=eta)

        # Sample a random standard deviation along each dimension
        std = stats.expon(0.1).rvs(2)  # exponential
        # std = np.random.rand(2) * 10  # uniform
        # std = np.abs(np.random.randn(2))  # half-normal

        plot_2d_guassian_ci(R, std=std, color="C0", alpha=0.25)

    plt.xlim([-2, 2])
    plt.ylim([-2, 2])
    plt.xlabel("$\\alpha$")
    plt.ylabel("$\\beta$")
    plt.title(f"Random Correlations Sampled\nfrom LKJCorr({eta}) prior");
etas = [1, 4, 10, 1000]
n_etas = len(etas)
_, axs = plt.subplots(n_etas, 1, figsize=(6, 6 * n_etas))
for ii, eta in enumerate(etas):
    sample_LKJCorr_prior(eta=eta, ax=axs[ii])
../_images/64867a0633348143145ae07dd35135c445b5a97898a86256333373a9b9615b55.png

我们可以看到,随着 \(\eta\) 变得更大,相关矩阵的轴与参数空间的轴变得更加对齐,从而提供对角协方差样本

具有相关特征的模型#

一些注意事项 ⚠️#

  • 我们在此处直接跳到使用未居中先验的实现,跳过了居中版本。有关未居中实现的详细信息,请参阅 PyMC 文档。

  • 我们使用 LKJCholeskyCov 而不是 LKJCorr,以提高数值稳定性和性能。有关详细信息,请参阅 https://discourse.pymc.io/t/uses-of-lkjcholeskycov-and-lkjcorr/3629/3

  • 我们不需要实现讲座中的 Chol_to_Corr 函数,因为我们可以直接从 LKJCholeskyCov 分布中提取相关性

N_CORRELATED_FEATURES = 2
N_DISTRICTS = 61
ETA = 4

with pm.Model(coords={"district": DISTRICT}) as correlated_model:

    # Mutable data for posterior predictions
    urban = pm.Data("urban", URBAN_ID)

    # Priors
    # Feature correlation
    chol, corr, stds = pm.LKJCholeskyCov(
        "Rho",
        eta=ETA,
        n=N_CORRELATED_FEATURES,
        sd_dist=pm.Exponential.dist(1, size=N_CORRELATED_FEATURES),  # variation for params
    )

    # Uncentered parameterization for each variable
    z = pm.Normal("z", 0, 1, shape=(N_DISTRICTS, N_CORRELATED_FEATURES))
    v = pm.Deterministic("v", chol.dot(z.T).T)

    alpha_bar = pm.Normal("alpha_bar", 0, 1)  # the average district
    alpha = pm.Deterministic("alpha", alpha_bar + v[:, 0])

    beta_bar = pm.Normal("beta_bar", 0, 1)  # the average urban effect
    beta = pm.Deterministic("beta", beta_bar + v[:, 1])

    # Record values for reporting
    pm.Deterministic("feature_cov", chol.dot(chol.T))
    pm.Deterministic("feature_corr", corr)
    pm.Deterministic("feature_std", stds)
    pm.Deterministic("p_C", pm.math.invlogit(alpha + beta), dims="district")
    pm.Deterministic("p_C_urban", pm.math.invlogit(alpha + beta), dims="district")
    pm.Deterministic("p_C_rural", pm.math.invlogit(alpha), dims="district")

    # Likelihood
    p = pm.math.invlogit(alpha[DISTRICT_ID] + beta[DISTRICT_ID] * urban)
    C = pm.Bernoulli("C", p=p, observed=USES_CONTRACEPTION)

    correlated_inference = pm.sample(target_accept=0.95)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Rho, z, alpha_bar, beta_bar]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 11 seconds.

分析特征相关性#

correlated_posterior_mean = correlated_inference.posterior.mean(dim=("chain", "draw"))
print(f"Feature Covariance\n", correlated_posterior_mean["feature_cov"].values)
print(f"Feature Correlation\n", correlated_posterior_mean["feature_corr"].values)
print(f"Feature Std\n", correlated_posterior_mean["feature_std"].values)
Feature Covariance
 [[ 0.32032104 -0.24587394]
 [-0.24587394  0.60162453]]
Feature Correlation
 [[ 1.         -0.55103506]
 [-0.55103506  1.        ]]
Feature Std
 [0.55748164 0.74948169]

与先验比较#

with correlated_model:
    # Add prior samples
    correlated_prior_predictive = pm.sample_prior_predictive()
Sampling: [C, Rho, alpha_bar, beta_bar, z]
def plot_posterior_mean_alpha_beta(posterior_mean, title=None):
    alphas = posterior_mean["alpha"].values
    betas = posterior_mean["beta"].values
    alpha_mean = posterior_mean["alpha"].mean().values
    beta_mean = posterior_mean["beta"].mean().values

    # Correlated feature params
    if "feature_corr" in posterior_mean:
        posterior_R = posterior_mean["feature_corr"].values
        posterior_std = posterior_mean["feature_std"].values

    # Independent feature params
    else:
        cc = np.corrcoef(alphas, betas)[0][1]
        posterior_R = np.array([[1, cc], [cc, 1]])
        posterior_std = np.array([posterior_mean["sigma"], posterior_mean["tau"]])

    plot_2d_guassian_ci(R=posterior_R, mean=(alpha_mean, beta_mean), std=posterior_std, color="C0")

    utils.plot_scatter(
        xs=posterior_mean["alpha"],
        ys=posterior_mean["beta"],
        color="k",
        label="district param means",
    )

    plt.axvline(0, color="gray", linestyle="--")
    plt.axhline(0, color="gray", linestyle="--")
    plt.xlim([-2, 1])
    plt.ylim([-2, 2])
    plt.xlabel("$\\alpha$")
    plt.ylabel("$\\beta$")
    plt.legend()
    plt.title(title)


_, axs = plt.subplots(1, 2, figsize=(10, 5))

# Plot correlation distribution for posterior and prior
plt.sca(axs[0])
plot_kwargs = {"linewidth": 3}
az.plot_dist(
    correlated_inference.posterior["feature_corr"][:, :, 0, 1],
    label="posterior",
    plot_kwargs=plot_kwargs,
    ax=axs[0],
)
az.plot_dist(
    correlated_prior_predictive.prior["feature_corr"][:, :, 0, 1],
    color="k",
    label="prior",
    plot_kwargs=plot_kwargs,
    ax=axs[0],
)

plt.axvline(0, color="gray", linestyle="--")
plt.xlim([-1, 1])
plt.xlabel("correlation between $\\alpha$ and $\\beta$")
plt.ylabel("density")
plt.legend()

plt.sca(axs[1])

plot_posterior_mean_alpha_beta(correlated_posterior_mean)
plt.suptitle("Correlated Features", fontsize=18);
../_images/8aedf05830795d6e81553baa1d1d11019821f0b67521c03cd2517908936fa7bd.png
uncorrelated_posterior_mean = uncorrelated_inference.posterior.mean(dim=("chain", "draw"))

_, axs = plt.subplots(1, 2, figsize=(10, 5))

plt.sca(axs[0])
plot_posterior_mean_alpha_beta(uncorrelated_posterior_mean, title="uncorrelated features")

plt.sca(axs[1])
plot_posterior_mean_alpha_beta(correlated_posterior_mean, title="correlated features")
../_images/4047abc09cca4b1d5d4ccec86068f7056427f1d85b03d5ce5b2fc2a78d310549.png

与不建模特征相关性的模型相比:具有不相关特征的模型表现出弱得多的负相关性(即,左侧的红色椭圆“向下倾斜”的程度较小)。

在结果尺度上比较模型。#

def plot_p_C_district_rural_urban(posterior_mean, title=None):
    for ii, (label, p_C) in enumerate(zip(["rural", "urban"], ["p_C_rural", "p_C_urban"])):
        color = f"C{ii}"
        plot_data = posterior_mean[p_C]
        utils.plot_scatter(DISTRICT, plot_data, label=label, color=color, alpha=0.8)

        # Add quantile indicators
        ql, median, qu = np.quantile(plot_data, [0.25, 0.5, 0.975])
        error = np.array((median - ql, qu - median))[:, None]
        plt.errorbar(
            x=0,
            y=median,
            yerr=error,
            color=color,
            capsize=5,
            linewidth=2,
            label=f"95%IPR:({ql:0.2f}, {qu:0.2f}))",
        )
        plt.axhline(median, color=color)

    plt.axhline(0.5, color="gray", linestyle="--", label="p=0.5")
    plt.legend(loc="upper right")
    plt.ylim([0, 1])
    plt.xlabel("district")
    plt.ylabel("prob. contraceptive use")
    plt.title(title)


_, axs = plt.subplots(2, 1, figsize=(10, 8))

plt.sca(axs[0])
plot_p_C_district_rural_urban(correlated_posterior_mean, title="correlated features")

plt.sca(axs[1])
plot_p_C_district_rural_urban(uncorrelated_posterior_mean, title="uncorrelated features")
../_images/8fce93ddfbb95f39fddd00c6dd4b63793d30ecb4a5235ecf84f83afe19f5052b.png
  • 建模特征相关性允许城市地区以不同的方式捕获城市方差。虽然效果很微妙,但相关特征模型(顶部)表现出比不相关特征模型(0.41-0.65)更小的内百分位数范围 (IPR) (0.45-0.65)。

  • ⚠️ 这实际上与讲座中报告的相反。我的直觉是,后验变异性会更小,因为我们正在跨特征共享信息,从而降低了我们的不确定性。但我很容易在这里做错或理解错误。

查看结果空间#

def plot_pC_rural_urban_xy(posterior_mean, label, color):
    utils.plot_scatter(
        xs=posterior_mean["p_C_rural"],
        ys=posterior_mean["p_C_urban"],
        label=label,
        color=color,
        alpha=0.8,
    )

    plt.axhline(0.5, color="gray", linestyle="--")
    plt.axvline(0.5, color="gray", linestyle="--")
    plt.xlabel("$p(C)$ (rural)")
    plt.ylabel("$p(C)$ (urban)")
    plt.legend()
    plt.axis("square")


def plot_pC_xy_movement(posterior_mean_a, posterior_mean_b):
    xs_a = posterior_mean_a["p_C_rural"]
    xs_b = posterior_mean_b["p_C_rural"]
    ys_a = posterior_mean_a["p_C_urban"]
    ys_b = posterior_mean_b["p_C_urban"]

    for ii, (xa, xb, ya, yb) in enumerate(zip(xs_a, xs_b, ys_a, ys_b)):
        label = "$p(C) $ change" if not ii else None
        plt.plot((xa, xb), (ya, yb), linewidth=0.5, color="C0", label=label)
    plt.legend(loc="upper left")


plot_pC_rural_urban_xy(uncorrelated_posterior_mean, label="uncorrelated features", color="k")
plot_pC_rural_urban_xy(correlated_posterior_mean, label="correlated features", color="C0")
plot_pC_xy_movement(uncorrelated_posterior_mean, correlated_posterior_mean)
../_images/efa42ce1b755c3b3f51f17cdfc51045340cca441318cecd968bbc7334076e8c7.png

包括特征相关性#

  • 点移动是因为跨特征的信息传递提供了更好的 \(p(C)\) 估计

  • 农村地区与城市地区避孕药具使用率差异之间存在负相关关系

    • 当一个地区的农村地区避孕药具使用率很高时,差异会更小(即城市地区也将具有较高的避孕药具使用率)

    • 当农村地区避孕药具使用率较低时,城市地区将更加不同(即城市地区仍然倾向于具有较高的避孕药具使用率)

回顾:相关变动效应#

学习特征之间相关性的先验。提供以下好处

  1. 跨特征的部分合并 – 内存,共享信息提高了效率

  2. 学习相关性 – 这是研究问题的重点

💡 变动效应可能是相关的,并且仍然可以在没有包含相关结构先验的情况下学习。但是,显式地包含相关性允许部分合并,因此更有效,并提供关于这些相关性的更明确的信息。

奖励:非居中(又名变换)先验#

不方便的后验#

  • 由负对数似然中的陡峭曲率引起的不高效 MCMC

  • 哈密顿蒙特卡洛在探索陡峭表面时遇到困难

    • 导致“发散”跃迁

    • “冲破滑板公园的墙壁”

    • 当哈密顿量在提议的开始/结束之间发生剧烈变化时检测到

  • 变换先验以使其成为“更平滑的滑板公园”有所帮助

示例:魔鬼漏斗先验#

\[\begin{split} \begin{align*} v &\sim \text{Normal}(0, \sigma_v) \\ x &\sim \text{Normal}(0, \exp({v})) \end{align*} \end{split}\]

随着 \(\sigma_v\) 的增加,先验概率表面中会形成一个讨厌的槽,哈密顿动力学很难对其进行采样——即,滑板公园太陡峭和狭窄。

我太懒了,无法编写 McElreath 用来可视化每个发散路径的精美 HMC 动画。但是,我们仍然可以验证,当我们在魔鬼漏斗先验中增加 v 先验的 \(\sigma\) 时,发散的数量会增加。

居中先验模型#

from functools import partial

xs = vs = np.linspace(-4, 4, 100)


def prior_devils_funnel(xs, vs, sigma_v=0.5):

    log_prior_v = stats.norm(0, sigma_v).logpdf(vs)
    log_prior_x = stats.norm(0, np.exp(vs)).logpdf(xs)

    log_prior = log_prior_v + log_prior_x
    return np.exp(log_prior - log_prior.max())


# Loop over sigma_v, and show that divergences increase
# as the depth and narrowness of the trough increases
# for low values of v
sigma_vs = [0.5, 1, 3]
n_sigma_vs = len(sigma_vs)
fig, axs = plt.subplots(3, 1, figsize=(4, 4 * n_sigma_vs))

for ii, sigma_v in enumerate(sigma_vs):
    with pm.Model() as centered_devils_funnel:
        v = pm.Normal("v", 0, sigma_v)
        x = pm.Normal("x", 0, pm.math.exp(v))
        centered_devils_funnel_inference = pm.sample()
        n_divergences = centered_devils_funnel_inference.sample_stats.diverging.sum().values

    prior = partial(prior_devils_funnel, sigma_v=sigma_v)
    plt.sca(axs[ii])
    utils.plot_2d_function(xs, ys, prior, colors="gray", ax=axs[ii])
    plt.xlabel("x")
    plt.ylabel("v")
    plt.title(f"$v \sim Normal(0, {sigma_v})$\n# Divergences: {n_divergences}")
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [v, x]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
There were 9 divergences after tuning. Increase `target_accept` or reparameterize.
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [v, x]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
There were 405 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [v, x]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
There were 622 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
../_images/e6e741850165e057875de1291f9ecd61a9bc296a535834bba49d227e5b41d663.png

怎么办?#

  • 更小的步长:更好地处理陡峭度,但需要更长的时间来探索后验

  • 重新参数化(变换)以使表面更平滑

非居中先验#

我们添加一个辅助变量 \(z\),它具有平滑的概率表面。然后我们对该辅助变量进行采样,并对其进行变换以获得目标变量分布。对于魔鬼漏斗先验的情况

\[\begin{split} \begin{align*} v &\sim \text{Normal}(0, \sigma_v) \\ z &\sim \text{Normal}(0, 1) \\ x &= z \exp(v) \end{align*} \end{split}\]
sigma_vs = [0.5, 1, 3]
n_sigma_vs = len(sigma_vs)
fig, axs = plt.subplots(3, 1, figsize=(4, 4 * n_sigma_vs))


def prior_noncentered_devils_funnel(xs, vs, sigma_v=0.5):
    """Reparamterize into a smoother auxilary variable space"""
    log_prior_v = stats.norm(0, sigma_v).logpdf(vs)
    log_prior_z = stats.norm(0, 1).logpdf(xs)
    log_prior = log_prior_v + log_prior_z
    return np.exp(log_prior - log_prior.max())


for ii, sigma_v in enumerate(sigma_vs):
    with pm.Model() as noncentered_devils_funnel:
        v = pm.Normal("v", 0, sigma_v)
        z = pm.Normal("z", 0, 1)
        # Record x for reporting
        x = pm.Deterministic("x", z * pm.math.exp(v))
        noncentered_devils_funnel_inference = pm.sample()
        n_divergences = noncentered_devils_funnel_inference.sample_stats.diverging.sum().values

    prior = partial(prior_noncentered_devils_funnel, sigma_v=sigma_v)
    plt.sca(axs[ii])
    utils.plot_2d_function(xs, ys, prior, colors="gray", ax=axs[ii])
    plt.xlabel("z")
    plt.ylabel("v")
    plt.title(f"$v \sim$ Normal(0, {sigma_v})\n# Divergences: {n_divergences}")
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [v, z]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [v, z]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [v, z]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
../_images/2dc86cf4134ce187498087acf689bc98c7d9b297c6989e9354cd78dead9cddd1.png
  • 通过重新参数化,我们可以对多维正态分布进行采样,这些分布在对数空间中是更平滑的抛物线。

  • 我们可以看到,对于所有先验方差值,发散的数量都持续减少。

检查 HMC 诊断#

az.summary(noncentered_devils_funnel_inference)["r_hat"]
v    1.0
x    1.0
z    1.0
Name: r_hat, dtype: float64
az.plot_trace(noncentered_devils_funnel_inference);
../_images/40e4529e53fcc4fbed4270a108166a429012a76e39b285d813efcd28b39784c1.png

诊断看起来不错 👍

  • Rhats = 1

  • \(v\)\(z\) 链表现出“模糊毛毛虫”行为

Cholesky 因子和相关矩阵#

  • Cholesky 因子 \(\bf L\) 提供了一种有效的方法来编码相关矩阵 \(\Omega\)(比完整的相关矩阵需要更少的浮点数)

  • \(\Omega = \bf LL^T\)

  • \(\bf L\) 是一个下三角矩阵

  • 我们可以使用 \(\Omega\) 的 Cholesky 分解 \(\bf L\) 对具有相关性 \(\Omega\) 和标准差 \(\sigma_i\) 的数据进行采样,如下所示

\[ X_\Omega \sim \text{diag}(\sigma_i)\bf{LZ}, \]

其中 \(\bf Z\) 是从标准正态分布中采样的 z 分数矩阵

演示从 Cholesky 因子重建相关矩阵#

orig_corr = np.array([[1, 0.6], [0.6, 1]])
print("Ground-truth Correlation Matrix\n", orig_corr)

L = np.linalg.cholesky(orig_corr)
print("\nCholesky Matrix of correlation matrix, L\n", L)
print("\nReconstructed correlation matrix from L\n", L @ L.T)
Ground-truth Correlation Matrix
 [[1.  0.6]
 [0.6 1. ]]

Cholesky Matrix of correlation matrix, L
 [[1.  0. ]
 [0.6 0.8]]

Reconstructed correlation matrix from L
 [[1.  0.6]
 [0.6 1. ]]

演示从 z 分数和 Cholesky 因子采样随机相关矩阵#

np.random.seed(12345)
N = 100_000
orig_sigmas = [2, 0.5]

# Matrix of z-score samples
Z = stats.norm().rvs(size=(2, N))
print("Raw z-score samples are indpendent\n", np.corrcoef(Z[0], Z[1]))

# Transform Z-scores using the Cholesky factorization of the correlation matrix
sLZ = np.diag(orig_sigmas) @ L @ Z

corr_sLZ = np.corrcoef(sLZ[0], sLZ[1])
print("\nCholesky-transformed z-scores have original correlation encoded\n", corr_sLZ)
assert np.isclose(orig_corr[0, 1], corr_sLZ[0, 1], atol=0.01)

std_sLZ = sLZ.std(axis=1)
print("\nOriginal std devs also encoded in transformed z-scores:\n", std_sLZ)
assert np.isclose(orig_sigmas[0], std_sLZ[0], atol=0.01)
assert np.isclose(orig_sigmas[1], std_sLZ[1], atol=0.01)
Raw z-score samples are indpendent
 [[1.         0.00293945]
 [0.00293945 1.        ]]

Cholesky-transformed z-scores have original correlation encoded
 [[1.         0.60268456]
 [0.60268456 1.        ]]

Original std devs also encoded in transformed z-scores:
 [2.00212936 0.50024976]

何时使用变换先验#

  • 取决于上下文

  • 居中

    • 每个聚类/子组中的大量数据

    • 似然主导场景

  • 非居中

    • 许多聚类/子组,其中一些聚类中的数据稀疏

    • 先验主导场景

作者

  • 移植到 PyMC:Dustin Stansbury (2024)

  • 基于 Richard McElreath 的 Statistical Rethinking (2023) 讲座

%load_ext watermark
%watermark -n -u -v -iv -w -p pytensor,aeppl,xarray
Last updated: Tue Dec 17 2024

Python implementation: CPython
Python version       : 3.12.5
IPython version      : 8.27.0

pytensor: 2.26.4
aeppl   : not installed
xarray  : 2024.7.0

pymc       : 5.19.1
scipy      : 1.14.1
xarray     : 2024.7.0
pandas     : 2.2.2
numpy      : 1.26.4
matplotlib : 3.9.2
statsmodels: 0.14.2
arviz      : 0.19.0

Watermark: 2.5.0

许可证声明#

此示例库中的所有笔记本均根据 MIT 许可证 提供,该许可证允许修改和再分发用于任何用途,前提是保留版权和许可证声明。

引用 PyMC 示例#

要引用此笔记本,请使用 Zenodo 为 pymc-examples 存储库提供的 DOI。

重要提示

许多笔记本都改编自其他来源:博客、书籍……在这种情况下,您也应该引用原始来源。

另请记住引用您的代码使用的相关库。

这是一个 bibtex 中的引用模板

@incollection{citekey,
  author    = "<notebook authors, see above>",
  title     = "<notebook title>",
  editor    = "PyMC Team",
  booktitle = "PyMC examples",
  doi       = "10.5281/zenodo.5654871"
}

一旦呈现,它可能看起来像