使用 ModelBuilder 类部署 PyMC 模型#

动机#

许多用户在将他们的 PyMC 模型部署到生产环境时遇到困难,因为部署/保存/加载用户创建的模型没有很好的标准化。其中一个原因是在 PyMC 中没有像 scikit-learn 或 TensorFlow 那样直接保存或加载模型的方法。新的 ModelBuilder 类的目的在于通过提供受 scikit-learn 启发的 API 来包装您的 PyMC 模型,从而改进此工作流程。

新的 ModelBuilder 类允许用户使用 fit()predict()save()load() 方法。用户可以创建他们想要的任何模型,继承 ModelBuilder 类,并使用预定义的方法。

让我们完整地过一遍工作流程,从一个简单的线性回归 PyMC 模型开始,就像通常编写的那样。当然,这个模型只是您自己模型的占位符。

from typing import Dict, List, Optional, Tuple, Union

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import xarray as xr

from numpy.random import RandomState

%config InlineBackend.figure_format = 'retina'
RANDOM_SEED = 8927

rng = np.random.default_rng(RANDOM_SEED)
az.style.use("arviz-darkgrid")
# Generate data
x = np.linspace(start=0, stop=1, num=100)
y = 0.3 * x + 0.5 + rng.normal(0, 1, len(x))

标准语法#

通常 PyMC 模型将具有这种形式

with pm.Model() as model:
    # priors
    a = pm.Normal("a", mu=0, sigma=1)
    b = pm.Normal("b", mu=0, sigma=1)
    eps = pm.HalfNormal("eps", 1.0)

    # observed data
    y_model = pm.Normal("y_model", mu=a + b * x, sigma=eps, observed=y)

    # Fitting
    idata = pm.sample()
    idata.extend(pm.sample_prior_predictive())

    # posterior predict
    idata.extend(pm.sample_posterior_predictive(idata))
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [a, b, eps]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
Sampling: [a, b, eps, y_model]
Sampling: [y_model]

我们将如何部署这个模型?保存拟合的模型,在实例上加载它,然后预测?没那么简单。

ModelBuilder 就是为此目的而构建的。它目前是 pymc-experimental 包的一部分,我们可以使用 pip install pymc-experimental 通过 pip 安装。顾名思义,此功能仍处于实验阶段,可能会发生更改。

模型构建器类#

让我们导入 ModelBuilder 类。

from pymc_extras.model_builder import ModelBuilder

要定义我们期望的模型,我们需要从 ModelBuilder 类继承。我们需要定义几个方法。

class LinearModel(ModelBuilder):
    # Give the model a name
    _model_type = "LinearModel"

    # And a version
    version = "0.1"

    def build_model(self, X: pd.DataFrame, y: pd.Series, **kwargs):
        """
        build_model creates the PyMC model

        Parameters:
        model_config: dictionary
            it is a dictionary with all the parameters that we need in our model example:  a_loc, a_scale, b_loc
        X : pd.DataFrame
            The input data that is going to be used in the model. This should be a DataFrame
            containing the features (predictors) for the model. For efficiency reasons, it should
            only contain the necessary data columns, not the entire available dataset, as this
            will be encoded into the data used to recreate the model.

        y : pd.Series
            The target data for the model. This should be a Series representing the output
            or dependent variable for the model.

        kwargs : dict
            Additional keyword arguments that may be used for model configuration.
        """
        # Check the type of X and y and adjust access accordingly
        X_values = X["input"].values
        y_values = y.values if isinstance(y, pd.Series) else y
        self._generate_and_preprocess_model_data(X_values, y_values)

        with pm.Model(coords=self.model_coords) as self.model:
            # Create mutable data containers
            x_data = pm.MutableData("x_data", X_values)
            y_data = pm.MutableData("y_data", y_values)

            # prior parameters
            a_mu_prior = self.model_config.get("a_mu_prior", 0.0)
            a_sigma_prior = self.model_config.get("a_sigma_prior", 1.0)
            b_mu_prior = self.model_config.get("b_mu_prior", 0.0)
            b_sigma_prior = self.model_config.get("b_sigma_prior", 1.0)
            eps_prior = self.model_config.get("eps_prior", 1.0)

            # priors
            a = pm.Normal("a", mu=a_mu_prior, sigma=a_sigma_prior)
            b = pm.Normal("b", mu=b_mu_prior, sigma=b_sigma_prior)
            eps = pm.HalfNormal("eps", eps_prior)

            obs = pm.Normal("y", mu=a + b * x_data, sigma=eps, shape=x_data.shape, observed=y_data)

    def _data_setter(
        self, X: Union[pd.DataFrame, np.ndarray], y: Union[pd.Series, np.ndarray] = None
    ):
        if isinstance(X, pd.DataFrame):
            x_values = X["input"].values
        else:
            # Assuming "input" is the first column
            x_values = X[:, 0]

        with self.model:
            pm.set_data({"x_data": x_values})
            if y is not None:
                pm.set_data({"y_data": y.values if isinstance(y, pd.Series) else y})

    @staticmethod
    def get_default_model_config() -> Dict:
        """
        Returns a class default config dict for model builder if no model_config is provided on class initialization.
        The model config dict is generally used to specify the prior values we want to build the model with.
        It supports more complex data structures like lists, dictionaries, etc.
        It will be passed to the class instance on initialization, in case the user doesn't provide any model_config of their own.
        """
        model_config: Dict = {
            "a_mu_prior": 0.0,
            "a_sigma_prior": 1.0,
            "b_mu_prior": 0.0,
            "b_sigma_prior": 1.0,
            "eps_prior": 1.0,
        }
        return model_config

    @staticmethod
    def get_default_sampler_config() -> Dict:
        """
        Returns a class default sampler dict for model builder if no sampler_config is provided on class initialization.
        The sampler config dict is used to send parameters to the sampler .
        It will be used during fitting in case the user doesn't provide any sampler_config of their own.
        """
        sampler_config: Dict = {
            "draws": 1_000,
            "tune": 1_000,
            "chains": 3,
            "target_accept": 0.95,
        }
        return sampler_config

    @property
    def output_var(self):
        return "y"

    @property
    def _serializable_model_config(self) -> Dict[str, Union[int, float, Dict]]:
        """
        _serializable_model_config is a property that returns a dictionary with all the model parameters that we want to save.
        as some of the data structures are not json serializable, we need to convert them to json serializable objects.
        Some models will need them, others can just define them to return the model_config.
        """
        return self.model_config

    def _save_input_params(self, idata) -> None:
        """
        Saves any additional model parameters (other than the dataset) to the idata object.

        These parameters are stored within `idata.attrs` using keys that correspond to the parameter names.
        If you don't need to store any extra parameters, you can leave this method unimplemented.

        Example:
            For saving customer IDs provided as an 'customer_ids' input to the model:
            self.customer_ids = customer_ids.values #this line is done outside of the function, preferably at the initialization of the model object.
            idata.attrs["customer_ids"] = json.dumps(self.customer_ids.tolist())  # Convert numpy array to a JSON-serializable list.
        """
        pass

        pass

    def _generate_and_preprocess_model_data(
        self, X: Union[pd.DataFrame, pd.Series], y: Union[pd.Series, np.ndarray]
    ) -> None:
        """
        Depending on the model, we might need to preprocess the data before fitting the model.
        all required preprocessing and conditional assignments should be defined here.
        """
        self.model_coords = None  # in our case we're not using coords, but if we were, we would define them here, or later on in the function, if extracting them from the data.
        # as we don't do any data preprocessing, we just assign the data given by the user. Note that it's a very basic model,
        # and usually we would need to do some preprocessing, or generate the coords from the data.
        self.X = X
        self.y = y

现在我们可以创建 LinearModel 对象。我们需要处理的第一步是数据生成

X = pd.DataFrame(data=np.linspace(start=0, stop=1, num=100), columns=["input"])
y = 0.3 * x + 0.5
y = y + np.random.normal(0, 1, len(x))

model = LinearModel()

在创建 LinearModel 类的对象后,我们可以使用 .fit() 方法拟合模型。

拟合数据#

fit() 方法接受一个参数 data,我们需要在该数据上拟合模型。元数据保存在 InferenceData 对象中,跟踪也存储在该对象中。以下是存储的字段

  • id : 这是根据 model_config、sample_conifg、版本和 model_type 给定模型的唯一 ID。用户可以使用它来检查模型是否与他们定义的另一个模型匹配。

  • model_type : 模型类型告诉我们它是什么类型的模型。在这种情况下,它输出 线性模型

  • version : 如果您想改进模型,您可以按版本跟踪模型。随着版本的更改,id 中的唯一哈希值也会更改。

  • sample_conifg : 它存储用户为此特定模型设置的采样器配置的值。

  • model_config : 它存储用户为此特定模型设置的模型配置的值。

idata = model.fit(X, y)
/var/home/fonnesbeck/repos/pymc-examples/.pixi/envs/default/lib/python3.12/site-packages/pymc/data.py:321: FutureWarning: MutableData is deprecated. All Data variables are now mutable. Use Data instead.
  warnings.warn(
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [a, b, eps]

Sampling 3 chains for 1_000 tune and 1_000 draw iterations (3_000 + 3_000 draws total) took 1 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
Sampling: [a, b, eps, y]
Sampling: [y]

将模型保存到文件#

拟合模型后,我们可能需要保存它以将模型作为文件共享,以便其他人可以再次使用它。要 save()load(),我们可以使用以下语法快速调用相应任务的方法。

fname = "linear_model_v1.nc"
model.save(fname)

这会在给定路径和名称下保存一个文件
一个 NetCDF .nc 文件,用于存储模型的推断数据。

加载模型#

现在,如果我们想部署此模型,或者只是让其他人使用它来预测数据,他们需要两件事

  1. LinearModel 类(可能在 .py 文件中)

  2. linear_model_v1.nc 文件

有了这些,您可以轻松地在不同的环境(例如生产环境)中加载拟合的模型

model_2 = LinearModel.load(fname)
/var/home/fonnesbeck/repos/pymc-examples/.pixi/envs/default/lib/python3.12/site-packages/arviz/data/inference_data.py:157: UserWarning: fit_data group is not defined in the InferenceData scheme
  warnings.warn(
/var/home/fonnesbeck/repos/pymc-examples/.pixi/envs/default/lib/python3.12/site-packages/pymc/data.py:321: FutureWarning: MutableData is deprecated. All Data variables are now mutable. Use Data instead.
  warnings.warn(

请注意,load() 是一个类方法,我们不需要实例化 LinearModel 对象。

type(model_2)
__main__.LinearModel

预测#

接下来,我们可能想对新数据进行预测。predict() 方法允许用户使用拟合的模型对新数据进行后验预测。

我们的首要任务是创建我们需要预测的数据。

x_pred = np.random.uniform(low=1, high=2, size=10)
prediction_data = pd.DataFrame({"input": x_pred})
type(prediction_data["input"].values)
numpy.ndarray

ModelBuilder 提供了两种预测方法

  1. 点估计(均值),使用 predict()

  2. 完整后验预测(样本),使用 predict_posterior()

pred_mean = model_2.predict(prediction_data)
# samples
pred_samples = model_2.predict_posterior(prediction_data)
Sampling: [y]

Sampling: [y]

在使用 predict() 之后,我们可以绘制我们的数据,并以图形方式查看我们的 LinearModel 的效果如何。

fig, ax = plt.subplots(figsize=(7, 7))
posterior = az.extract(idata, num_samples=20)
x_plot = xr.DataArray(np.linspace(1, 2, 100))
y_plot = posterior["b"] * x_plot + posterior["a"]
Line2 = ax.plot(x_plot, y_plot.transpose(), color="C1")
Line1 = ax.plot(x_pred, pred_mean, "x")
ax.set(title="Posterior predictive regression lines", xlabel="x", ylabel="y")
ax.legend(
    handles=[Line1[0], Line2[0]], labels=["predicted average", "inferred regression line"], loc=0
);
../_images/bbc4da85106ec9c236813f170272c9b249e0a8af0911c428cddbf54bb0752c74.png
%load_ext watermark
%watermark -n -u -v -iv -w -p pymc_experimental
Last updated: Sat Dec 14 2024

Python implementation: CPython
Python version       : 3.12.5
IPython version      : 8.27.0

pymc_experimental: 0.1.2

pymc_extras: 0.2.0
matplotlib : 3.9.2
arviz      : 0.19.0
xarray     : 2024.7.0
pandas     : 2.2.2
numpy      : 1.26.4
pymc       : 5.19.1

Watermark: 2.5.0

作者#

  • 由 Shashank Kirtania 和 Thomas Wiecki 于 2023 年创作。

  • 由 Michał Raczycki 于 2023 年 8 月修改和更新

许可声明#

此示例库中的所有 notebook 均根据 MIT 许可证 提供,该许可证允许修改和再分发以用于任何用途,前提是保留版权和许可声明。

引用 PyMC 示例#

要引用此 notebook,请使用 Zenodo 为 pymc-examples 仓库提供的 DOI。

重要提示

许多 notebook 改编自其他来源:博客、书籍……在这种情况下,您也应该引用原始来源。

另请记住引用您的代码使用的相关库。

这是一个 bibtex 中的引用模板

@incollection{citekey,
  author    = "<notebook authors, see above>",
  title     = "<notebook title>",
  editor    = "PyMC Team",
  booktitle = "PyMC examples",
  doi       = "10.5281/zenodo.5654871"
}

一旦渲染,它可能看起来像