审查数据模型#

from copy import copy

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pymc as pm
import seaborn as sns

from numpy.random import default_rng

%config InlineBackend.figure_format = 'retina'
rng = default_rng(1234)
az.style.use("arviz-darkgrid")

关于贝叶斯生存分析的示例笔记本触及了审查数据的点。审查是缺失数据问题的一种形式，其中大于某个阈值的观测值被裁剪到该阈值，或者小于某个阈值的观测值被裁剪到该阈值，或者两者都进行裁剪。这些分别称为右审查、左审查和区间审查。在本示例笔记本中，我们考虑区间审查。

审查数据出现在许多建模问题中。两个常见的例子是

生存分析： 当研究某种医疗 treatment 对生存时间的影响时，不可能延长研究直到所有受试者都死亡。在研究结束时，为许多患者收集的唯一数据是他们在接受 treatment 后的一段时间 \(T\) 内仍然活着：实际上，他们的真实生存时间大于 \(T\)。
传感器饱和： 传感器可能具有有限的范围，上限和下限将只是传感器可以报告的最高和最低值。例如，许多水银温度计仅报告非常狭窄的温度范围。

本示例笔记本介绍了在 PyMC3 中处理审查数据的两种不同方法

一个插补审查模型，它将审查数据表示为参数，并为所有审查值生成合理的可能值。由于这种插补，该模型能够生成可能的一组人为编造的值，这些值本来会被审查。每个审查元素都引入一个随机变量。
一个未插补审查模型，其中审查数据被积分出来，并且仅通过对数似然来考虑。这种方法更充分地处理大量审查数据，并且收敛速度更快。

为了建立基线，我们将与未审查数据的未审查模型进行比较。

# Produce normally distributed samples
size = 500
true_mu = 13.0
true_sigma = 5.0
samples = rng.normal(true_mu, true_sigma, size)

# Set censoring limits
low = 3.0
high = 16.0


def censor(x, low, high):
    x = copy(x)
    x[x <= low] = low
    x[x >= high] = high
    return x


# Censor samples
censored = censor(samples, low, high)

# Visualize uncensored and censored data
_, ax = plt.subplots(figsize=(10, 3))
edges = np.linspace(-5, 35, 30)
ax.hist(samples, bins=edges, density=True, histtype="stepfilled", alpha=0.2, label="Uncensored")
ax.hist(censored, bins=edges, density=True, histtype="stepfilled", alpha=0.2, label="Censored")
[ax.axvline(x=x, c="k", ls="--") for x in [low, high]]
ax.legend();

../_images/5d33add778ae781898173c0065c76cd136bc93722573ac843c9f89db0a03f7c2.png

未审查模型#

def uncensored_model(data):
    with pm.Model() as model:
        mu = pm.Normal("mu", mu=((high - low) / 2) + low, sigma=(high - low))
        sigma = pm.HalfNormal("sigma", sigma=(high - low) / 2.0)
        observed = pm.Normal("observed", mu=mu, sigma=sigma, observed=data)
    return model

我们应该预测，在未审查数据上运行未审查模型，我们将获得均值和方差的合理估计。

uncensored_model_1 = uncensored_model(samples)
with uncensored_model_1:
    idata = pm.sample()

az.plot_posterior(idata, ref_val=[true_mu, true_sigma], round_to=3);

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]

100.00% [8000/8000 00:03<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 4 seconds.

../_images/8691c1e31edbedbb5a0e5e4aa7f5364043f254dcb1aae85490ac4509244432f6.png

这正是我们发现的。

然而，问题在于，在审查数据上下文中，我们无法访问真实值。如果我们在审查数据上使用相同的未审查模型，我们预计我们的参数估计将会有偏差。如果我们计算均值和标准差的点估计，那么我们可以看到我们可能低估了此特定数据集和审查边界的均值和标准差。

print(f"mean={np.mean(censored):.2f}; std={np.std(censored):.2f}")

mean=12.32; std=3.76

uncensored_model_2 = uncensored_model(censored)
with uncensored_model_2:
    idata = pm.sample()

az.plot_posterior(idata, ref_val=[true_mu, true_sigma], round_to=3);

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]

100.00% [8000/8000 00:03<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 4 seconds.

../_images/0d343ef387a38347fd570208b0137c0401559c61706d5045ac8577a63c2fc4d6.png

上面的图证实了这一点。