建模事件#

此笔记本是 PyMC 端口的一部分,该端口移植自 Richard McElreath 的 Statistical Rethinking 2023 系列讲座。

视频 - 第 09 讲 - 建模事件# 第 09 讲 - 建模事件

# Ignore warnings
import warnings

import arviz as az
import numpy as np
import pandas as pd
import pymc as pm
import statsmodels.formula.api as smf
import utils as utils
import xarray as xr

from matplotlib import pyplot as plt
from matplotlib import style
from scipy import stats as stats

warnings.filterwarnings("ignore")

# Set matplotlib style
STYLE = "statistical-rethinking-2023.mplstyle"
style.use(STYLE)

加州大学伯克利分校录取#

数据集#

  • 4562 名研究生申请者

  • 按以下因素分层:

    • 院系

    • 性别

目标是识别招生官的性别歧视

ADMISSIONS = utils.load_data("UCBadmit")
ADMISSIONS
院系 申请者性别 录取 拒绝 申请数
1 A 男性 512 313 825
2 A 女性 89 19 108
3 B 男性 353 207 560
4 B 女性 17 8 25
5 C 男性 120 205 325
6 C 女性 202 391 593
7 D 男性 138 279 417
8 D 女性 131 244 375
9 E 男性 53 138 191
10 E 女性 94 299 393
11 F 男性 22 351 373
12 F 女性 24 317 341

建模事件#

  • 事件:离散、无序的结果

  • 观测是计数/聚合

  • 未知数是事件发生的概率 \(p\),或这些概率的几率 \(\frac{p}{1-p}\)

  • 我们分层的全部事物都会相互作用——在现实生活中通常不是独立的

  • 通常处理 \(p = \log \left (\frac{p}{1-p} \right)\) 的对数几率

招生:画猫头鹰 🦉#

  1. 估计量

  2. 科学模型

  3. 统计模型

  4. 分析

1. 估计量#

哪条路径定义了“歧视”#

直接歧视(因果效应)#

utils.draw_causal_graph(
    edge_list=[("G", "D"), ("G", "A"), ("D", "A")],
    node_props={
        "G": {"label": "Gender, G", "color": "red"},
        "D": {"label": "Department, D"},
        "A": {"label": "Admission rate, A", "color": "red"},
    },
    edge_props={("G", "A"): {"color": "red"}},
    graph_direction="LR",
)
../_images/36481e2592bbf6a5057aa0e742f873f7ec7751037be4a25d61ac4718e313d0ae.svg
  • 又称“制度性歧视”

  • 裁判对特定群体有偏见

  • 通常是我们有兴趣识别是否存在的类型

  • 需要强因果假设

在此,院系 D 是一个中介——这是社会科学中常见的结构,其中分类地位(例如,性别)影响某些中介背景(例如,职业),两者都会影响目标结果(工资)。示例

  • 工资歧视

  • 招聘

  • 科学奖项

间接歧视#

utils.draw_causal_graph(
    edge_list=[("G", "D"), ("G", "A"), ("D", "A")],
    node_props={
        "G": {"label": "Gender, G", "color": "red"},
        "D": {"label": "Department, D"},
        "A": {"label": "Admission rate, A", "color": "red"},
    },
    edge_props={("G", "D"): {"color": "red"}, ("D", "A"): {"color": "red"}},
    graph_direction="LR",
)
../_images/1e4536009a97476089b1ae85736f832899d4fbe78d41aa172c087b3543e6a5f1.svg
  • 又称“结构性歧视”

  • 例如,性别影响一个人的兴趣,从而影响他们将申请的院系

  • 影响总体录取率

  • 需要强因果假设

总歧视#

utils.draw_causal_graph(
    edge_list=[("G", "D"), ("G", "A"), ("D", "A")],
    node_props={
        "G": {"label": "Gender, G", "color": "red"},
        "D": {"label": "Department, D"},
        "A": {"label": "Admission rate, A", "color": "red"},
    },
    edge_props={
        ("G", "D"): {"color": "red"},
        ("G", "A"): {"color": "red"},
        ("D", "A"): {"color": "red"},
    },
    graph_direction="LR",
)
../_images/9c7a23caa46f1b7648693b51fb9b7f4716a3c5310ac53daac5110781eb8d7bd8.svg
  • 又称“经验性歧视”

  • 通过直接和间接途径

  • 需要温和的假设

估计量 & 估计器#

  • 每个 不同的估计量都需要不同的估计器

  • 通常我们 可以估计 的东西往往不是我们 想要估计 的东西

  • 例如,总歧视可能更容易估计,但可操作性较差

未观测到的混淆因素#

utils.draw_causal_graph(
    edge_list=[("G", "D"), ("G", "A"), ("D", "A"), ("U", "D"), ("U", "A")],
    node_props={"U": {"style": "dashed"}, "unobserved": {"style": "dashed"}},
    graph_direction="LR",
)
../_images/7fe9702ac0f014ba3f018166f8bc709d597ee153e0631977f57ab656b057fca6.svg

总是可能在中介和某些未观测到的混淆因素之间也存在混淆。我们现在将忽略这些。

2. 科学模型:#

模拟过程#

下面是审查/录取过程的生成模型

np.random.seed(123)
n_samples = 1000

GENDER = stats.bernoulli.rvs(p=0.5, size=n_samples)

# Groundtruth parameters
# Gender 1 tends to apply to department 1
P_DEPARTMENT = np.where(GENDER == 0, 0.3, 0.8)

# Acceptance rates matrices -- Department x Gender
UNBIASED_ACCEPTANCE_RATES = np.array([[0.1, 0.1], [0.3, 0.3]])  # No *direct* gender bias

# Biased acceptance:
# - dept 0 accepts gender 0 at 50% of unbiased rate
# - dept 1 accepts gender 0 at 67% of unbiased rate
BIASED_ACCEPTANCE_RATES = np.array([[0.05, 0.1], [0.2, 0.3]])  # *direct* gender bias present

DEPARTMENT = stats.bernoulli.rvs(p=P_DEPARTMENT, size=n_samples).astype(int)


def simulate_admissions_data(bias_type):
    """Simulate admissions data using the global params above"""
    acceptance_rate = (
        BIASED_ACCEPTANCE_RATES if bias_type == "biased" else UNBIASED_ACCEPTANCE_RATES
    )
    acceptance = stats.bernoulli.rvs(p=acceptance_rate[DEPARTMENT, GENDER])

    return (
        pd.DataFrame(
            np.vstack([GENDER, DEPARTMENT, acceptance]).T,
            columns=["gender", "department", "accepted"],
        ),
        acceptance_rate,
    )


for bias_type in ["unbiased", "biased"]:

    fake_admissions, acceptance_rate = simulate_admissions_data(bias_type)

    gender_acceptance_counts = fake_admissions.groupby(["gender", "accepted"]).count()["department"]
    gender_acceptance_counts.name = None

    gender_department_counts = fake_admissions.groupby(["gender", "department"]).count()["accepted"]
    gender_department_counts.name = None

    observed_acceptance_rates = fake_admissions.groupby("gender").mean()["accepted"]
    observed_acceptance_rates.name = None

    print()
    print("-" * 30)
    print(bias_type.upper())
    print("-" * 30)
    print(f"Department Acceptance rate:\n{acceptance_rate}")
    print(f"\nGender-Department Frequency:\n{gender_department_counts}")
    print(f"\nGender-Acceptance Frequency:\n{gender_acceptance_counts}")
    print(f"\nOverall Acceptance Rates:\n{observed_acceptance_rates}")
------------------------------
UNBIASED
------------------------------
Department Acceptance rate:
[[0.1 0.1]
 [0.3 0.3]]

Gender-Department Frequency:
gender  department
0       0             335
        1             173
1       0             106
        1             386
dtype: int64

Gender-Acceptance Frequency:
gender  accepted
0       0           437
        1            71
1       0           359
        1           133
dtype: int64

Overall Acceptance Rates:
gender
0    0.139764
1    0.270325
dtype: float64

------------------------------
BIASED
------------------------------
Department Acceptance rate:
[[0.05 0.1 ]
 [0.2  0.3 ]]

Gender-Department Frequency:
gender  department
0       0             335
        1             173
1       0             106
        1             386
dtype: int64

Gender-Acceptance Frequency:
gender  accepted
0       0           465
        1            43
1       0           370
        1           122
dtype: int64

Overall Acceptance Rates:
gender
0    0.084646
1    0.247967
dtype: float64

模拟数据属性#

两种情景#

  • 性别 0 倾向于申请院系 0

  • 性别 1 倾向于申请院系 1

无偏情景:#

  • 由于院系 0 的录取率较低,并且性别 0 倾向于申请院系 0,因此性别 0 的总体拒绝率低于性别 1

  • 由于院系 1 的录取率较高,并且性别 1 倾向于申请院系 1,因此性别 1 的总体录取率高于性别 0

  • 即使在没有(直接)性别歧视的情况下,仍然存在基于性别倾向于申请不同院系的间接歧视,以及每个院系录取学生的不均等可能性。

有偏情景#

  • 总体录取率较低(由于性别 0 在两个院系的录取率均有所降低)

  • 在存在实际院系偏见的情景中,由于间接效应,我们看到与无偏情况类似的总体歧视模式

3. 统计模型#

建模事件#

  • 我们观察到 事件的计数

  • 我们估计 概率——或者,更确切地说,是事件发生的对数几率

线性模型#

期望值是参数的线性(加性)组合

\[\begin{split} \begin{align*} Y_i &\sim \mathcal{N}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta_X X_i + ... \end{align*} \end{split}\]

b.c. 正态分布是无界的,因此线性模型的期望值也是无界的。

事件模型#

离散事件要么发生,取值 1,要么不发生,取值 0。这 事件的期望值 设置了界限。即界限在区间 \((0, 1)\)

逻辑回归的先验#

logit 链接函数是一种苛刻的变换

  • Logit 压缩参数分布

  • \(x > +4 \rightarrow \) 事件基本上总是发生

  • \(x < -4 \rightarrow\) 事件基本上从不发生

n_samples = 10000
fig, axs = plt.subplots(3, 2, figsize=(8, 8))
for ii, std in enumerate([10, 1.5, 1]):

    # Alpha prior distribution
    alphas = stats.norm(0, std).rvs(n_samples)
    plt.sca(axs[ii][0])
    az.plot_dist(alphas, color="C1")
    plt.xlim([-30, 30])
    plt.ylabel("density")
    if ii == 2:
        plt.xlabel("alpha")

    # Resulting event probability distribution
    ps = utils.invlogit(alphas)
    plt.sca(axs[ii][1])
    az.plot_dist(ps, color="C0")
    plt.xlim([0, 1])
    plt.ylabel("density")
    plt.title(f"$\\alpha \sim \mathcal{{N}}(0, {std})$")
    if ii == 2:
        plt.xlabel("p(event)")
../_images/89d0718aba6a7e06f1a8c1c8c4887c77c6e88c11846aa5859e6c7ae34c94f611.png

先验预测模拟#

# Demonstrating the effect of alpha / beta on p(x)
from functools import partial


def gaussian_2d_pdf(xs, ys, mean=(0, 0), covariance=np.eye(2)):
    return np.array(stats.multivariate_normal(mean, covariance).pdf(np.vstack([xs, ys]).T))


pdf = partial(gaussian_2d_pdf)

xs = ys = np.linspace(-3, 3, 100)
# utils.plot_2d_function(xs, ys, pdf, cmap='gray_r')


def plot_logistic_prior_predictive(n_samples=20, prior_std=0.5):

    _, axs = plt.subplots(1, 2, figsize=(10, 4))
    plt.sca(axs[0])
    min_x, max_x = -prior_std, prior_std
    xs = np.linspace(min_x, max_x, 100)
    ys = xs

    pdf = partial(gaussian_2d_pdf, covariance=np.array([[prior_std**2, 0], [0, prior_std**2]]))
    utils.plot_2d_function(xs, ys, pdf, cmap="gray_r")

    plt.axis("square")
    plt.xlim([min_x, max_x])
    plt.ylim([min_x, max_x])

    # Sample some parameters from prior
    βs = stats.norm.rvs(0, prior_std, size=n_samples)
    αs = stats.norm.rvs(0, prior_std, size=n_samples)

    for b, a in zip(βs, αs):
        plt.scatter(a, b)

    plt.xlabel("$\\alpha$")
    plt.ylabel("$\\beta$")
    plt.title(f"Samples from prior with std={prior_std}")

    # Resulting sigmoid functions
    plt.sca(axs[1])
    min_x, max_x = -3, 3
    xs = np.linspace(min_x, max_x, 100)
    for a, b in zip(αs, βs):
        logit_p = a + xs * b
        p = utils.invlogit(logit_p)
        plt.plot(xs, p)

    plt.xlabel("x")
    plt.ylabel("p = invlogit(x)")
    plt.xlim([-3, 3])
    plt.ylim([-0.05, 1.05])
    plt.axhline(0.5, linestyle="--", color="k")
    plt.title(f"Resulting Logistic Models")
plot_logistic_prior_predictive(prior_std=0.5)
../_images/acd14733c244c8b5e43958bc9c24b112559f2db5baa06bb008ec0d89b99ed116.png
plot_logistic_prior_predictive(prior_std=1.0)
../_images/98fbf3ff89e2c9255cb73c7c6cfc376dcb45e43038b88a1db2931b64be25f1d6.png
plot_logistic_prior_predictive(prior_std=10)
../_images/15f7fccb522dcca48bf0a4082e8a0fbfbeb3e42dbfe7de9dd406636c103a62de.png

招生统计模型#

同样,估计器将取决于估计量

总因果效应#

utils.draw_causal_graph(
    edge_list=[("G", "D"), ("G", "A"), ("D", "A")],
    node_props={
        "G": {"color": "red"},
        "A": {"color": "red"},
    },
    edge_props={
        ("G", "A"): {"color": "red"},
        ("D", "A"): {"color": "red"},
        ("G", "D"): {"color": "red"},
    },
    graph_direction="LR",
)
../_images/75c2b72eec68ed3513e34cd78749c8d664e23266b22fe92d3fb83676b0c40b33.svg

仅按性别分层。不要按院系分层,因为它是我们不想阻止的管道(中介)

\[\begin{split} \begin{align*} A_i &\sim \text{Bernoulli}(p=p_i) \\ \text{logit}(p_i) &= \alpha[G_i] \\ \alpha &= [\alpha_0, \alpha_1] \\ \alpha_j &\sim \text{Normal}(0, 1) \end{align*} \end{split}\]

直接因果效应#

utils.draw_causal_graph(
    edge_list=[("G", "D"), ("G", "A"), ("D", "A")],
    node_props={
        "G": {"color": "red"},
        "A": {"color": "red"},
    },
    edge_props={("G", "A"): {"color": "red"}},
    graph_direction="LR",
)
../_images/c24d42205a8c20380c913fa67ffafc5cd530c08b829324849d696ce9454db55d.svg

按性别和院系分层 以阻止管道

\[\begin{split} \begin{align*} A_i &\sim \text{Bernoulli}(p_i) \\ \text{logit}(p_i) &= \alpha[G_i, D_i] \\ \alpha &= \left[{\begin{array}{cc} \alpha_{0,0}, \alpha_{0,1} \\ \alpha_{1,0}, \alpha_{1,1} \end{array}}\right] \\ \alpha_{j,k} &\sim \text{Normal}(0, 1) \end{align*} \end{split}\]

拟合总因果效应模型#

def fit_total_effect_model_admissions(data):
    """Fit total effect model, stratifying by gender.

    Note
    ----
    We use the Binomial regression, so simulated observation-level
    data from above must be pre-aggregated. (see the function
    `aggregate_admissions_data` below). We could have
    used a Bernoulli likelihood (Logistic regression) for the simulated
    data, but given that we only have aggregates for the real
    UC Berkeley data we choose this more general formulation.
    """

    # Set up observations / coords
    n_admitted = data["admit"].values
    n_applications = data["applications"].values

    gender_coded, gender = pd.factorize(data["applicant.gender"].values)

    with pm.Model(coords={"gender": gender}) as total_effect_model:

        # Mutable data for running any gender-based counterfactuals
        gender_coded = pm.Data("gender", gender_coded, dims="obs_ids")

        alpha = pm.Normal("alpha", 0, 1, dims="gender")

        # Record the inverse linked probabilities for reporting
        pm.Deterministic("p_accept", pm.math.invlogit(alpha), dims="gender")

        p = pm.math.invlogit(alpha[gender_coded])

        likelihood = pm.Binomial(
            "accepted", n=n_applications, p=p, observed=n_admitted, dims="obs_ids"
        )
        total_effect_inference = pm.sample()

    return total_effect_model, total_effect_inference


def aggregate_admissions_data(raw_admissions_data):
    """Aggregate simulated data from `simulate_admissions_data`, which
    is in long format to short format, by so that it can be modeled
    using Binomial likelihood. We also recode column names to have the
    same fields as the UC Berkeley dataset so that we can use the
    same model-fitting functions.
    """

    # Aggregate applications, admissions, and rejections. Rename
    # columns to have the  same fields as UC Berkeley dataset
    applications = raw_admissions_data.groupby(["gender", "department"]).count().reset_index()
    applications.rename(columns={"accepted": "applications"}, inplace=True)

    admits = raw_admissions_data.groupby(["gender", "department"]).sum().reset_index()
    admits.rename(columns={"accepted": "admit"}, inplace=True)

    data = applications.merge(admits)
    data.loc[:, "reject"] = data.applications - data.admit

    # Code gender & department. For easier comparison to lecture,
    # we use 1-indexed gender/department indicators like McElreath
    data.loc[:, "applicant.gender"] = data.gender.apply(lambda x: 2 if x else 1)
    data.loc[:, "dept"] = data.department.apply(lambda x: 2 if x else 1)

    return data[["applicant.gender", "dept", "applications", "admit", "reject"]]


def plot_admissions_model_posterior(inference, figsize=(6, 2)):
    _, ax = plt.subplots(figsize=figsize)

    # Plot alphas
    az.plot_forest(inference, var_names=["alpha"], combined=True, hdi_prob=0.89, ax=ax)

    # Plot inver-linked acceptance probabilities
    _, ax = plt.subplots(figsize=figsize)
    az.plot_forest(inference, var_names=["p_accept"], combined=True, hdi_prob=0.89, ax=ax)

    return az.summary(inference)

在有偏模拟招生数据上拟合总效应模型#

SIMULATED_BIASED_ADMISSIONS, _ = simulate_admissions_data("biased")
SIMULATED_BIASED_ADMISSIONS = aggregate_admissions_data(SIMULATED_BIASED_ADMISSIONS)
SIMULATED_BIASED_ADMISSIONS
申请者性别 院系 申请数 录取 拒绝
0 1 1 335 13 322
1 1 2 173 34 139
2 2 1 106 11 95
3 2 2 386 108 278
total_effect_model_simulated_biased, total_effect_inference_simulated_biased = (
    fit_total_effect_model_admissions(SIMULATED_BIASED_ADMISSIONS)
)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
plot_admissions_model_posterior(total_effect_inference_simulated_biased, (6, 1.5))
均值 标准差 HDI_3% HDI_97% MCSE_均值 MCSE_标准差 ESS_批量 ESS_尾部 R_hat
alpha[1] -2.240 0.151 -2.531 -1.958 0.002 0.002 3876.0 2626.0 1.0
alpha[2] -1.133 0.106 -1.335 -0.939 0.002 0.001 4108.0 3100.0 1.0
p_accept[1] 0.097 0.013 0.071 0.121 0.000 0.000 3876.0 2626.0 1.0
p_accept[2] 0.244 0.019 0.208 0.281 0.000 0.000 4108.0 3100.0 1.0
../_images/3c9cb7ead48aa4b6c88d0daa4b7b9c2679778c121ee9a9976dfe7c24fcccc2d3.png ../_images/a9e2d4919a9a882f6fd19fb886850176e6693db3b446d604c19209db646fdd06.png

在无偏模拟招生数据上拟合总效应模型#

SIMULATED_UNBIASED_ADMISSIONS, _ = simulate_admissions_data("unbiased")
SIMULATED_UNBIASED_ADMISSIONS = aggregate_admissions_data(SIMULATED_UNBIASED_ADMISSIONS)
SIMULATED_UNBIASED_ADMISSIONS
申请者性别 院系 申请数 录取 拒绝
0 1 1 335 33 302
1 1 2 173 60 113
2 2 1 106 8 98
3 2 2 386 111 275
total_effect_model_simulated_unbiased, total_effect_inference_simulated_unbiased = (
    fit_total_effect_model_admissions(SIMULATED_UNBIASED_ADMISSIONS)
)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
plot_admissions_model_posterior(total_effect_inference_simulated_unbiased, (6, 1.5))
均值 标准差 HDI_3% HDI_97% MCSE_均值 MCSE_标准差 ESS_批量 ESS_尾部 R_hat
alpha[1] -1.481 0.112 -1.694 -1.264 0.002 0.001 3700.0 2457.0 1.0
alpha[2] -1.132 0.107 -1.327 -0.937 0.002 0.001 4389.0 2864.0 1.0
p_accept[1] 0.186 0.017 0.150 0.215 0.000 0.000 3700.0 2457.0 1.0
p_accept[2] 0.244 0.020 0.210 0.281 0.000 0.000 4389.0 2864.0 1.0
../_images/279bf2f460bf6f9e76966003da9469b5cbbae6ae7193e0e9898eba1964f6ca89.png ../_images/1007e213a0dd112dfb22b6a9f0dabdf83e0613e63ef5aadefc05ce1b5b843b83.png

拟合直接因果效应模型#

def fit_direct_effect_model_admissions(data):
    """Fit total effect model, stratifying by gender.

    Note
    ----
    We use the Binomial likelihood, so simulated observation-level data from
    above must be pre-aggregated. (see the function `aggregate_admissions_data`
    below). We could have used a Bernoulli likelihood for the simulated data,
    but given that we only have aggregates for the real UC Berkeley data we
    choose this more general formulation.
    """

    # Set up observations / coords
    n_admitted = data["admit"].values
    n_applications = data["applications"].values

    department_coded, department = pd.factorize(data["dept"].values)
    gender_coded, gender = pd.factorize(data["applicant.gender"].values)

    with pm.Model(coords={"gender": gender, "department": department}) as direct_effect_model:

        # Mutable data for running any gender-based or department-based counterfactuals
        gender_coded = pm.Data("gender", gender_coded, dims="obs_ids")
        department_coded = pm.Data("department", department_coded, dims="obs_ids")
        n_applications = pm.Data("n_applications", n_applications, dims="obs_ids")

        alpha = pm.Normal("alpha", 0, 1, dims=["department", "gender"])

        # Record inverse linked probabilities for reporting
        pm.Deterministic("p_accept", pm.math.invlogit(alpha), dims=["department", "gender"])

        p = pm.math.invlogit(alpha[department_coded, gender_coded])

        likelihood = pm.Binomial(
            "accepted", n=n_applications, p=p, observed=n_admitted, dims="obs_ids"
        )
        direct_effect_inference = pm.sample()

    return direct_effect_model, direct_effect_inference

将直接效应模型拟合到有偏模拟招生数据#

direct_effect_model_simulated_biased, direct_effect_inference_simulated_biased = (
    fit_direct_effect_model_admissions(SIMULATED_BIASED_ADMISSIONS)
)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
plot_admissions_model_posterior(direct_effect_inference_simulated_biased)
均值 标准差 HDI_3% HDI_97% MCSE_均值 MCSE_标准差 ESS_批量 ESS_尾部 R_hat
alpha[1, 1] -3.021 0.256 -3.505 -2.563 0.003 0.002 7099.0 2936.0 1.0
alpha[1, 2] -1.997 0.278 -2.516 -1.488 0.003 0.003 6422.0 3414.0 1.0
alpha[2, 1] -1.371 0.188 -1.733 -1.034 0.002 0.002 7865.0 3270.0 1.0
alpha[2, 2] -0.938 0.114 -1.145 -0.722 0.001 0.001 7740.0 3283.0 1.0
p_accept[1, 1] 0.048 0.011 0.029 0.072 0.000 0.000 7099.0 2936.0 1.0
p_accept[1, 2] 0.123 0.029 0.067 0.176 0.000 0.000 6422.0 3414.0 1.0
p_accept[2, 1] 0.204 0.030 0.148 0.259 0.000 0.000 7865.0 3270.0 1.0
p_accept[2, 2] 0.282 0.023 0.239 0.324 0.000 0.000 7740.0 3283.0 1.0
../_images/b21b5a7e2f065c4faa42c90a6e7c859df726682fa82d2260da98370c366b4c4c.png ../_images/02be9cbdb6d5e5144e4aee07063258d7e5ef015d6c27e941288c6c27f9652c1b.png

为了比较,这里是基本事实的有偏录取率,我们能够基本恢复

# Department x Gender
BIASED_ACCEPTANCE_RATES
array([[0.05, 0.1 ],
       [0.2 , 0.3 ]])

将直接效应模型拟合到无偏模拟招生数据#

direct_effect_model_simulated_unbiased, direct_effect_inference_simulated_unbiased = (
    fit_direct_effect_model_admissions(SIMULATED_UNBIASED_ADMISSIONS)
)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
plot_admissions_model_posterior(direct_effect_inference_simulated_unbiased)
均值 标准差 HDI_3% HDI_97% MCSE_均值 MCSE_标准差 ESS_批量 ESS_尾部 R_hat
alpha[1, 1] -2.155 0.175 -2.460 -1.805 0.002 0.002 5264.0 3200.0 1.0
alpha[1, 2] -2.276 0.313 -2.901 -1.722 0.004 0.003 5684.0 2986.0 1.0
alpha[2, 1] -0.618 0.161 -0.919 -0.322 0.002 0.001 7013.0 3158.0 1.0
alpha[2, 2] -0.898 0.114 -1.114 -0.684 0.001 0.001 6859.0 3242.0 1.0
p_accept[1, 1] 0.105 0.016 0.076 0.136 0.000 0.000 5264.0 3200.0 1.0
p_accept[1, 2] 0.096 0.027 0.050 0.148 0.000 0.000 5684.0 2986.0 1.0
p_accept[2, 1] 0.351 0.036 0.284 0.419 0.000 0.000 7013.0 3158.0 1.0
p_accept[2, 2] 0.290 0.023 0.243 0.331 0.000 0.000 6859.0 3242.0 1.0
../_images/08221fe4839e60ee04c4fbdea91b44b4d10a1325e94b0b762d30595727cdddce.png ../_images/0c2eee666100ee3e46206bc624b42d8e8765d61b955f8660e736c325af76e62d.png

为了比较,这里是基本事实的无偏录取率,我们能够恢复

# Department x Gender
UNBIASED_ACCEPTANCE_RATES
array([[0.1, 0.1],
       [0.3, 0.3]])

4. 分析加州大学伯克利分校的招生数据#

计数数据集回顾#

  • 我们将使用这些计数数据来建模性别/院系录取率的对数几率

  • 请注意,我们将使用二项回归,它等效于伯努利回归,但对聚合计数进行运算,而不是对单个二元试验进行运算

    • 上面的示例实际上是作为二项回归实现的,以便我们可以重用代码并演示一般的分析模式

    • 无论哪种方式,您都将使用这两种方法获得相同的推断

ADMISSIONS
院系 申请者性别 录取 拒绝 申请数
1 A 男性 512 313 825
2 A 女性 89 19 108
3 B 男性 353 207 560
4 B 女性 17 8 25
5 C 男性 120 205 325
6 C 女性 202 391 593
7 D 男性 138 279 417
8 D 女性 131 244 375
9 E 男性 53 138 191
10 E 女性 94 299 393
11 F 男性 22 351 373
12 F 女性 24 317 341

将总效应模型拟合到加州大学伯克利分校的招生数据#

不要忘记查看诊断结果,我们在这里将跳过

total_effect_model, total_effect_inference = fit_total_effect_model_admissions(ADMISSIONS)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
plot_admissions_model_posterior(total_effect_inference, figsize=(6, 1.5))
均值 标准差 HDI_3% HDI_97% MCSE_均值 MCSE_标准差 ESS_批量 ESS_尾部 R_hat
alpha[男性] -0.221 0.038 -0.294 -0.151 0.001 0.000 4202.0 2936.0 1.0
alpha[女性] -0.828 0.050 -0.917 -0.732 0.001 0.001 3905.0 2831.0 1.0
p_accept[男性] 0.445 0.009 0.427 0.462 0.000 0.000 4202.0 2936.0 1.0
p_accept[女性] 0.304 0.011 0.286 0.325 0.000 0.000 3905.0 2831.0 1.0
../_images/c27c438f4095bd50af410c6e9c3c954924e9b43ed36cbdd568fa2338d6bc1ce1.png ../_images/26007db02f8220015cd49c95c590e9aa040216e29d6ae0878528c1dd5d59a0f8.png

总因果效应#

# Total Causal Effect
female_p_accept = total_effect_inference.posterior["p_accept"].sel(gender="female")
male_p_accept = total_effect_inference.posterior["p_accept"].sel(gender="male")
contrast = male_p_accept - female_p_accept

plt.subplots(figsize=(8, 3))
az.plot_dist(contrast)
plt.axvline(0, linestyle="--", color="black", label="No difference")
plt.xlabel("gender contrast (acceptance probability)")
plt.ylabel("density")
plt.title(
    "Total Causal Effect (Experienced Discrimination):\npositive values indicate male advantage"
)
plt.legend();
../_images/99b458765f16558672fb5499b0da78adf9a777177c864fb7375054534b147f88.png

将直接效应模型拟合到加州大学伯克利分校的招生数据#

direct_effect_model, direct_effect_inference = fit_direct_effect_model_admissions(ADMISSIONS)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
plot_admissions_model_posterior(direct_effect_inference, figsize=(6, 3))
均值 标准差 HDI_3% HDI_97% MCSE_均值 MCSE_标准差 ESS_批量 ESS_尾部 R_hat
alpha[A, 男性] 0.490 0.072 0.353 0.620 0.001 0.001 5850.0 3067.0 1.0
alpha[A, 女性] 1.474 0.243 1.029 1.937 0.003 0.002 5579.0 2964.0 1.0
alpha[B, 男性] 0.531 0.087 0.377 0.703 0.001 0.001 5106.0 2554.0 1.0
alpha[B, 女性] 0.657 0.395 -0.075 1.387 0.005 0.004 6701.0 3242.0 1.0
alpha[C, 男性] -0.532 0.115 -0.743 -0.310 0.001 0.001 5940.0 3091.0 1.0
alpha[C, 女性] -0.660 0.086 -0.831 -0.506 0.001 0.001 5786.0 3051.0 1.0
alpha[D, 男性] -0.698 0.102 -0.892 -0.513 0.001 0.001 6055.0 3004.0 1.0
alpha[D, 女性] -0.618 0.108 -0.828 -0.425 0.001 0.001 6215.0 2943.0 1.0
alpha[E, 男性] -0.935 0.158 -1.239 -0.645 0.002 0.002 5672.0 2865.0 1.0
alpha[E, 女性] -1.146 0.119 -1.364 -0.928 0.002 0.001 5891.0 2814.0 1.0
alpha[F, 男性] -2.667 0.205 -3.057 -2.291 0.003 0.002 5224.0 2612.0 1.0
alpha[F, 女性] -2.499 0.195 -2.863 -2.131 0.003 0.002 6109.0 3036.0 1.0
p_accept[A, 男性] 0.620 0.017 0.587 0.650 0.000 0.000 5850.0 3067.0 1.0
p_accept[A, 女性] 0.811 0.037 0.742 0.878 0.000 0.000 5579.0 2964.0 1.0
p_accept[B, 男性] 0.629 0.020 0.595 0.670 0.000 0.000 5106.0 2554.0 1.0
p_accept[B, 女性] 0.653 0.086 0.488 0.805 0.001 0.001 6701.0 3242.0 1.0
p_accept[C, 男性] 0.370 0.027 0.322 0.423 0.000 0.000 5940.0 3091.0 1.0
p_accept[C, 女性] 0.341 0.019 0.303 0.376 0.000 0.000 5786.0 3051.0 1.0
p_accept[D, 男性] 0.333 0.023 0.291 0.374 0.000 0.000 6055.0 3004.0 1.0
p_accept[D, 女性] 0.351 0.025 0.304 0.395 0.000 0.000 6215.0 2943.0 1.0
p_accept[E, 男性] 0.283 0.032 0.222 0.341 0.000 0.000 5672.0 2865.0 1.0
p_accept[E, 女性] 0.242 0.022 0.204 0.283 0.000 0.000 5891.0 2814.0 1.0
p_accept[F, 男性] 0.066 0.012 0.043 0.089 0.000 0.000 5224.0 2612.0 1.0
p_accept[F, 女性] 0.077 0.014 0.052 0.103 0.000 0.000 6109.0 3036.0 1.0
../_images/5c5f9ff73e139d0eac6e0b45d4a70630054460c6b58ee7283f2105658a31feb6.png ../_images/3010d4b8d85dd77d24a3a3ee76719dffc9f8178a7c2ff8c7dd82daf042b3d746.png
# Fancier plot of department/gender acceptance probability distributions
_, ax = plt.subplots(figsize=(10, 5))
for ii, dept in enumerate(ADMISSIONS.dept.unique()):
    color = f"C{ii}"
    for gend in ADMISSIONS["applicant.gender"].unique():
        label = f"{dept}:{gend}"
        linestyle = "-." if gend == "female" else "-"
        post = direct_effect_inference.posterior["p_accept"].sel(department=dept, gender=gend)
        az.plot_dist(post, label=label, color=color, plot_kwargs={"linestyle": linestyle})
plt.xlabel("acceptance probability")
plt.legend(title="Deptartment:Gender");
../_images/415ead0f5eb46bcfc4e325e994a2b3c7037697f738f32d9ea967b25cf401601a.png

直接因果效应#

# Direct Causal Effect
_, ax = plt.subplots(figsize=(8, 4))

for ii, dept in enumerate(ADMISSIONS.dept.unique()):
    color = f"C{ii}"
    label = f"{dept}"

    female_p_accept = direct_effect_inference.posterior["p_accept"].sel(
        gender="female", department=dept
    )
    male_p_accept = direct_effect_inference.posterior["p_accept"].sel(
        gender="male", department=dept
    )

    contrast = male_p_accept - female_p_accept
    az.plot_dist(contrast, color=color, label=label)

plt.xlabel("gender contrast (acceptance probability)")
plt.ylabel("density")
plt.title(
    "Direct Causal Effect (Institutional Discrimination):\npositive values indicate male advantage"
)
plt.axvline(0, linestyle="--", color="black", label="No difference")
plt.legend(title="Department");
../_images/6eabdde5bbcbad977e684de962381eabfad76709e45c1684722af59be8e99a79.png

平均直接效应#

  • 性别的因果效应 在各院系之间平均 (边缘化)

  • 取决于每个院系的申请分布

  • 这很容易作为模拟来完成

后分层#

  • 为目标人群重新加权估计值

  • 允许我们将从一所大学拟合的模型应用于估计另一所具有不同院系分布的大学的因果影响

  • 在这里,我们使用经验分布来重新加权估计值

# Use the empirical distribution of departments -- we'd update this for a different university
applications_per_dept = ADMISSIONS.groupby("dept").sum()["applications"].values
total_applications = applications_per_dept.sum()
department_weight = applications_per_dept / total_applications

female_alpha = direct_effect_inference.posterior.sel(gender="female")["alpha"]
male_alpha = direct_effect_inference.posterior.sel(gender="male")["alpha"]

weighted_female_alpha = female_alpha * department_weight
weighed_male_alpha = male_alpha * department_weight

contrast = weighed_male_alpha - weighted_female_alpha

_, ax = plt.subplots(figsize=(8, 4))
az.plot_dist(contrast)

plt.axvline(0, linestyle="--", color="k", label="No Difference")
plt.xlabel("causal effect of perceived gender")
plt.ylabel("density")
plt.title("Average Direct Causal of Gender (perception),\npositive values indicate male advantage")
plt.xlim([-0.3, 0.3])
plt.legend();
../_images/bf5a715fc85760603d78e9013f8d45cfa7f6feaad8f4fd889e829847579c1444.png

为了验证平均过程,我们可以查看来自后验的 p_accept 样本的对比,这提供了类似的结果。但是,查看后验显然不适用于对样本外大学进行预测。

direct_posterior = direct_effect_inference.posterior

# Select by gender; note: p_accept already has link function applied
female_posterior = direct_posterior.sel(gender="female")[
    "p_accept"
]  # shape: (chain x draw x department)
male_posterior = direct_posterior.sel(gender="male")[
    "p_accept"
]  # shape: (chain x draw x department)
contrast = male_posterior - female_posterior  # shape: (chain x draw x department)

_, ax = plt.subplots(figsize=(8, 4))
# marginalize / collapse the contrast across all departments
az.plot_dist(contrast)
plt.axvline(0, linestyle="--", color="k", label="No Difference")
plt.xlim([-0.3, 0.3])
plt.xlabel("causal effect of perceived gender")
plt.ylabel("density")
plt.title("Posterior contrast of $p_{accept}$,\npositive values indicate male advantage")
plt.legend();
../_images/0761bc15a4341cc20369f2c055b92d42af311819f68a4990413aeba9e94ccca1.png

歧视?#

很难说

  • 巨大的结构性影响

  • 申请人的分布可能是由于歧视造成的

  • 混淆因素很可能存在

奖励:生存分析#

  • 计数通常建模为事件发生时间(即指数分布或伽玛分布)——目标是估计事件发生率

  • 很棘手,因为忽略 删失数据 可能会导致推断错误

    • 左删失:我们不知道计时器何时开始

    • 右删失:观察期在事件发生之前结束

示例:奥斯汀猫咪收养#

  • 2 万只猫

  • 事件:被收养 (1) 或未被收养 (0)

  • 删失机制

    • 在被收养前死亡

    • 逃跑

    • 尚未被收养

目标:确定黑色猫咪的收养率是否低于非黑色猫咪。

CATS = utils.load_data("AustinCats")
CATS.head()
ID 事件发生天数 外出日期 外出事件 进入日期 进入事件 品种 颜色 收养年龄
0 A730601 1 2016 年 7 月 8 日 上午 09:00:00 转移 2016 年 7 月 7 日 下午 12:11:00 流浪 家猫短毛混血 蓝色虎斑 7
1 A679549 25 2014 年 6 月 16 日 下午 01:54:00 转移 2014 年 5 月 22 日 下午 03:43:00 流浪 家猫短毛混血 黑色/白色 1
2 A683656 4 2014 年 7 月 17 日 下午 04:57:00 收养 2014 年 7 月 13 日 下午 01:20:00 流浪 雪鞋猫混血 山猫色重点色 2
3 A709749 41 2015 年 9 月 22 日 下午 12:49:00 转移 2015 年 8 月 12 日 下午 06:29:00 流浪 家猫短毛混血 玳瑁色 12
4 A733551 9 2016 年 9 月 1 日 上午 12:00:00 转移 2016 年 8 月 23 日 下午 02:35:00 流浪 家猫短毛混血 棕色虎斑/白色 1

建模结果变量:days_to_event#

建模事件发生时间的两种常用分布

指数分布:

  • 两者中较简单的一种(单参数)

  • 假设恒定速率

  • 在具有相同平均速率的非负连续分布集中,熵最大分布。

  • 假设在事件发生(机器故障)之前发生一件事情(例如,零件故障)

伽玛分布

  • 两者中更复杂的一种(两个参数)

  • 在具有相同均值和平均对数的分布集中,熵最大分布。

  • 假设在事件发生(机器故障)之前发生多件事情(例如,零件故障)

_, axs = plt.subplots(1, 2, figsize=(10, 4))

days = np.linspace(0, 100, 100)

plt.sca(axs[0])
for scale in [20, 50]:
    expon = stats.expon(loc=0.01, scale=scale)
    utils.plot_line(days, expon.pdf(days), label=f"$\\lambda$=1/{scale}")
plt.xlabel("days")
plt.ylabel("density")
plt.legend()
plt.title("Exponential Distribution")


plt.sca(axs[1])
N = 8
alpha = 10
for scale in [2, 5]:
    gamma = stats.gamma(a=10, loc=alpha, scale=scale)
    utils.plot_line(days, gamma.pdf(days), label=f"$\\alpha$={alpha}, $\\beta$={scale}")
plt.xlabel("days")
plt.ylabel("density")
plt.legend()
plt.title("Gamma Distribution");
../_images/16b63aef4e140181238b6520c95502c2655a139a9dd425bdde257647b2e35651.png

删失和非删失观测#

  • 观测数据使用累积分布;即,事件在时间 x 之前发生的概率

  • 未观测到的(删失)数据则需要 CDF 的互补;即,事件尚未发生的概率

xs = np.linspace(0, 5, 100)
_, axs = plt.subplots(1, 2, figsize=(10, 5))
plt.sca(axs[0])
cdfs = {}
for lambda_ in [0.5, 1]:
    cdfs[lambda_] = ys = 1 - np.exp(-lambda_ * xs)
    utils.plot_line(xs, ys, label=f"$\\lambda={lambda_}$")

plt.xlabel("x")
plt.ylabel("$1 - \\exp(-x)$")
plt.legend()
plt.title("CDF\np(event happened before or at time x| $\\lambda$)", fontsize=12)

plt.sca(axs[1])
for lambda_ in [0.5, 1]:
    ys = 1 - cdfs[lambda_]
    utils.plot_line(xs, ys, label=f"$\\lambda={lambda_}$")

plt.xlabel("x")
plt.ylabel("$\\exp(-x)$")
plt.legend()
plt.title("CCDF\np(event hasn't happened before or at time x| $\\lambda$)", fontsize=12);
../_images/884b739928a0c012c75e4266f451f73ed32799562971b0b7c9c837d9cee7c562.png

统计模型#

\[\begin{split} \begin{align*} D_i | A_i &= 1 \sim \text{Exponential}(\lambda_i) \\ D_i | A_i &= 0 \sim \text{ExponentialCCDF}(\lambda_i) \\ \lambda_i &= \frac{1}{\mu_i} \\ \log \mu_i &= \alpha_{\text{猫咪颜色}[i]} \\ \alpha_{黑色, 其他} &\sim \text{Exponential}(\gamma) \end{align*} \end{split}\]
  • \(D_i | A_i = 1\) - 观测到的收养

  • \(D_i | A_i = 1\) - 尚未观测到的收养

  • \(\alpha_{\text{猫咪颜色}[i]}\) 每种猫咪颜色的对数平均收养时间

  • \(\log \mu_i\) – 链接函数确保 \(\alpha\) 为正数

  • \(\lambda_i = \frac{1}{\mu_i}\) 简化了估计平均 收养时间

\(\alpha\) 找到合理的超参数#

我们需要确定指数先验均值参数 \(\gamma\) 的合理数据。为此,我们将查看收养时间的经验分布

# determine reasonable prior for alpha
plt.subplots(figsize=(6, 3))
CATS[CATS.out_event == "Adoption"].days_to_event.hist(bins=100)
plt.xlim([0, 500]);
../_images/f4d7f8c47f46052747ae4143792e9d8ffd89fced4f2a1ffd8e4a55d3f0942e89.png

使用上面的经验直方图,我们看到大部分概率质量在 0 到 200 之间,因此让我们使用 50 作为预期等待时间。

CAT_COLOR_ID, CAT_COLOR = pd.factorize(CATS.color.apply(lambda x: "Other" if x != "Black" else x))
ADOPTED_ID, ADOPTED = pd.factorize(
    CATS.out_event.apply(lambda x: "Other" if x != "Adoption" else "Adopted")
)
DAYS_TO_ADOPTION = CATS.days_to_event.values.astype(float)
LAMBDA = 50

with pm.Model(coords={"cat_color": CAT_COLOR}) as adoption_model:

    # Censoring
    right_censoring = DAYS_TO_ADOPTION.copy()
    right_censoring[ADOPTED_ID == 0] = DAYS_TO_ADOPTION.max()

    # Priors
    gamma = 1 / LAMBDA
    alpha = pm.Exponential("alpha", gamma, dims="cat_color")

    # Likelihood
    log_adoption_rate = 1 / alpha[CAT_COLOR_ID]
    pm.Censored(
        "adopted",
        pm.Exponential.dist(lam=log_adoption_rate),
        lower=None,
        upper=right_censoring,
        observed=DAYS_TO_ADOPTION,
    )
    adoption_inference = pm.sample()
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 14 seconds.

可怜的黑色小猫 🐈‍⬛#

看来黑色猫咪确实需要更长的时间才能被收养。

后验摘要#

_, ax = plt.subplots(figsize=(4, 2))
az.plot_forest(adoption_inference, var_names=["alpha"], combined=True, hdi_prob=0.89, ax=ax)
plt.xlabel("days to adoption");
../_images/7463a9a3cc421cbb443b994720e1bed8cee6605deb849895b6701f3fae3130d8.png

后验分布#

for ii, cat_color in enumerate(["Black", "Other"]):
    color = "black" if cat_color == "Black" else "C0"
    posterior_alpha = adoption_inference.posterior.sel(cat_color=cat_color)["alpha"]
    az.plot_dist(posterior_alpha, color=color, label=cat_color)
plt.xlabel("waiting time")
plt.ylabel("density")
plt.legend();
../_images/c782345a63e71e4be9174b60c54c5ac3b0c49bb59391b20356de0530fa35b5e1.png
days_until_adoption_ = np.linspace(1, 100)
n_posterior_samples = 25

for cat_color in ["Black", "Other"]:
    color = "black" if cat_color == "Black" else "C0"
    posterior = adoption_inference.posterior.sel(cat_color=cat_color)
    alpha_samples = posterior.alpha.sel(chain=0)[:n_posterior_samples].values
    for ii, alpha in enumerate(alpha_samples):
        label = cat_color if ii == 0 else None
        plt.plot(
            days_until_adoption_,
            np.exp(-days_until_adoption_ / alpha),
            color=color,
            alpha=0.25,
            label=label,
        )

plt.xlabel("days until adoption")
plt.ylabel("fraction")
plt.legend();
../_images/df73d1638b5890cef1285abab12e483f79231d7598ea9d3ee59ef39348747888.png

作者#

  • 移植到 PyMC:Dustin Stansbury (2024)

  • 基于 Richard McElreath 的 Statistical Rethinking (2023) 讲座

%load_ext watermark
%watermark -n -u -v -iv -w -p pytensor,aeppl,xarray
Last updated: Tue Dec 17 2024

Python implementation: CPython
Python version       : 3.12.5
IPython version      : 8.27.0

pytensor: 2.26.4
aeppl   : not installed
xarray  : 2024.7.0

statsmodels: 0.14.2
numpy      : 1.26.4
pandas     : 2.2.2
matplotlib : 3.9.2
scipy      : 1.14.1
xarray     : 2024.7.0
pymc       : 5.19.1
arviz      : 0.19.0

Watermark: 2.5.0

许可声明#

此示例库中的所有笔记本均根据 MIT 许可证 提供,该许可证允许修改和再分发用于任何用途,前提是保留版权和许可声明。

引用 PyMC 示例#

要引用此笔记本,请使用 Zenodo 为 pymc-examples 存储库提供的 DOI。

重要提示

许多笔记本改编自其他来源:博客、书籍……在这种情况下,您也应该引用原始来源。

另请记住引用您的代码使用的相关库。

这是一个 bibtex 中的引用模板

@incollection{citekey,
  author    = "<notebook authors, see above>",
  title     = "<notebook title>",
  editor    = "PyMC Team",
  booktitle = "PyMC examples",
  doi       = "10.5281/zenodo.5654871"
}

渲染后可能看起来像