【PATE】生徒側の実装を解説【差分プライバシ】

Posted at 2024-11-02

研究に行き詰まったのでPATEの生徒側の実装をもう一度全て確認してみる

同時に解説することで理解度を深める

環境設定

二値分類にのみ対応
Catboostで学習
CTRデータセット特有の操作も含まれる
選び方はargmax
それ以外の部分は省略

解説

インポート部分・標準入力からのagrsは省略

from lib.utils import (
    sample_one_vectorized,
    weighted_acc,
    optimal_pred,
    get_eps,
    get_eps_data_independent,
)

# Data
print("==> Preparing data..")
# 省略
else:
    continue_var = ["I" + str(i) for i in range(1, 14)]
    cat_features = ["C" + str(i) for i in range(1, 27)]
    trainset_path = os.path.join(args.data_path, args.dataset + "_train.csv")
    testset_path = os.path.join(args.data_path, args.dataset + "_test.csv")
    train = pd.read_csv(trainset_path)
    test = pd.read_csv(testset_path)

    y_train_original = train[["Label"]]
    x_train_original = train.drop(["Label"], axis=1)
    y_test = test[["Label"]]
    x_test = test.drop(["Label"], axis=1)

データの読み込み部分。全てpd.DataFrame型

y_train_original: トレーニングデータの正解ラベル
x_train_original: トレーニングデータの特徴量
y_test: テストデータの正解ラベル
x_test: テストデータの特徴量

np.random.seed(args.seed)

x_test, x_val, y_test, y_val = train_test_split(
    x_test, y_test, test_size=0.20, stratify=y_test, random_state=256
)

student_indices = np.random.permutation(len(x_train_original))

train_test_split: テストデータから検証データを分割
- なお、教師データとの整合性のため、トレーニングデータから検証データを分割するのは難しい
stratify=y_test: y_testと同じ分布を保つように分割する
student_indices: ランダムなインデックスリスト

# Aggregate votes
print("==> Aggregating votes..")


def get_one_teacher_vote(prob, mode):
    if mode == "argmax":
        votes = np.zeros_like(prob)
        votes[np.arange(len(prob)), prob.argmax(1)] = 1
    elif mode == "sample":
        samples = sample_one_vectorized(prob)
        votes = np.zeros_like(prob)
        votes[np.arange(len(prob)), samples] = 1
    else:
        raise NotImplementedError
    return votes

mode == argmaxの場合
- 与えられた予想確率分布probのうち最も大きいもののindexに1を格納
- ワンホットエンコーディング

teacher_votes = []
teacher_votes_test = []
for i in range(args.n_teachers):
    with open(
        args.model_path + "teacher_{0}/checkpoint_stats.pkl".format(i), "rb"
    ) as f:
        checkpoint = pickle.load(f)
    teacher_prob = checkpoint["train_original_prob"]
    teacher_prob_test = checkpoint["test_prob"]
    teacher_votes.append(get_one_teacher_vote(teacher_prob, args.tally_method))
    teacher_votes_test.append(
        get_one_teacher_vote(teacher_prob_test, args.tally_method)
    )

agg_votes = sum(teacher_votes)
agg_votes_test = sum(teacher_votes_test)

教師ごとにトレーニングデータが分割されているため、教師ごとに処理をしていく
なお、

teacher_prob: 教師iにおけるトレーニングデータ全体の特徴量に基づいたラベルの予測確率分布
- np.ndarray型, (n_samples, num_classes)
agg_votes: 何人の教師がデータポイントiをラベルjに属すると予測したか
- np.ndarray型, (n_samples, num_classes)

# eps = get_eps(agg_votes, args.mechanism, args.n_samples, args.result_noise, args.selection_noise, args.noise_threshold)
# print('data dependent eps is {0}'.format(eps))

eps = get_eps_data_independent(args.n_samples, args.result_noise, 1e-4)
print("data independent eps is {0}".format(eps))

データ構造に依存する${\varepsilon}$と依存しない${\varepsilon}$を計算
- 依存するver.について計算手法を理解したい

class_weights = [y_train_original.mean(), 1 - y_train_original.mean()]
B = (
    y_train_original.mean() * class_weights[1]
    + (1 - y_train_original.mean()) * class_weights[0]
).item()

print("B is {0}".format(B))

class_weights: 各クラスの出現確率の逆数を各クラスの重みとして設定
B: ${2\mu _0 \mu _1}$
- 後々正規化のために使う

y_train_opt_pred = optimal_pred(agg_votes, class_weights)
advantage_train = (
    weighted_acc(y_train_original.to_numpy().squeeze(), y_train_opt_pred, class_weights)
) / B
y_test_opt_pred = optimal_pred(agg_votes_test, class_weights)
advantage_test = (
    weighted_acc(y_test.to_numpy().squeeze(), y_test_opt_pred, class_weights)
) / B

weighted_acc: 重み付き精度
- 正解したデータポイントにおけるラベルの重みの平均値
- ${\text{wa} = \frac{1}{N}\sum_{i=1}^N \mathbb 1(y_i = \hat y_i)\cdot w_{y_i}}$
B: 正規化のため変数
- ${\text B = \sum_{k=1}^K p_{k}\cdot w_{k}}$
advantage_train: トレーニングデータ全体の正解ラベルと予測ラベルの重み付き正答率
- ${0\le \frac{\frac{1}{N}\sum_{i=1}^N \mathbb 1(y_i = \hat y_i)\cdot w_{y_i}}{\sum_{i=1}^N p_{y_i}\cdot w_{y_i}}\le 1}$
- なぜなら全ての予測が正しいとき、Bと等しくなるから

# need to reorder to align the prediction and indices
agg_votes = agg_votes[student_indices]

シャッフルする

def noisy_threshold_labels_custom(
    votes,
    mechanism,
    threshold,
    selection_noise_scale,
    result_noise_scale,
    mode,
    class_1_portion,
):
    def noise(scale, mechanism, shape):
        if scale == 0:
            return 0
        if mechanism.startswith("lnmax"):
            return np.random.laplace(0, scale, shape)
        elif mechanism.startswith("gnmax"):
            return np.random.normal(0, scale, shape)
        else:
            raise NotImplementedError
    if mechanism == "gnmax_conf":
        noisy_votes = votes + noise(selection_noise_scale, mechanism, votes.shape)
        over_t_mask = noisy_votes.max(axis=1) > threshold
        over_t_counts = votes[over_t_mask] + noise(
            result_noise_scale, mechanism, votes[over_t_mask].shape
        )
    else:
        noisy_votes = votes + noise(result_noise_scale, mechanism, votes.shape)
        over_t_mask = noisy_votes.max(axis=1) > float("-inf")
        over_t_counts = noisy_votes
    if mode == "argmax":
        over_t_labels = over_t_counts.argmax(axis=1)
    else:
        raise NotImplementedError

    return over_t_labels, over_t_mask

教師の投票などを受け取って、ある程度自信のあるラベルにのみノイズを加えたラベルと、自信のあるラベルのインデックスにTrueが入った配列を返す関数
lnmax: ラプラスノイズ
gnmax: ガウスノイズ
以下の、

    if mechanism == "gnmax_conf":
        noisy_votes = votes + noise(selection_noise_scale, mechanism, votes.shape)
        over_t_mask = noisy_votes.max(axis=1) > threshold
        over_t_counts = votes[over_t_mask] + noise(
            result_noise_scale, mechanism, votes[over_t_mask].shape
        )

の部分では、アンサンブルを実装している

labels, threshold_mask = noisy_threshold_labels_custom(
    votes=agg_votes,
    mechanism=args.mechanism,
    threshold=args.noise_threshold,
    selection_noise_scale=args.selection_noise,
    result_noise_scale=args.result_noise,
    mode=args.selection_method,
    class_1_portion=y_train_original.mean().values[0],
)

threshold_indices = threshold_mask.nonzero()[0]
indices = student_indices[threshold_indices][: args.n_samples]
labels = labels[: args.n_samples]

labels: ノイズ付加後のラベル（シャッフル済み）
threshold_mask: 教師の投票数が閾値を超えたラベルのインデックスにTrueが入った配列
- すなわちある程度自信を持って正解しているであろうラベル
threshold_indices: ある程度自信のあるラベルのインデックス
indices: シャッフル後のラベルのうち、自信のあるラベルの前n_samples個抽出
labels: ノイズ付加後のn_samples個のラベル
- これ、labels[threshold_indices]にしないと整合性取れなくない？

x_student = x_train_original.iloc[indices]
y_student_actual = y_train_original.iloc[indices]
y_student = labels

x_student: 閾値を超えたn_samples個のオリジナル特徴量
y_student: ノイズ付加後のn_samples個のラベル

print("==> Training..")
checkpoint = {}
if args.algo == "catboost":
    cat_features = [col for col in train.columns if "C" in col]
    model = CatBoostClassifier(
        iterations=100,
        learning_rate=0.4,
        task_type="CPU",
        loss_function="Logloss",
        depth=8,
    )

    fit_model = model.fit(
        x_student,
        y_student,
        eval_set=(x_val, y_val),
        cat_features=cat_features,
        verbose=10,
    )

    y_test_prob = model.predict(
        x_test,
        prediction_type="Probability",
        ntree_end=model.get_best_iteration(),
        thread_count=-1,
        verbose=None,
    )

    y_val_prob = model.predict(
        x_val,
        prediction_type="Probability",
        ntree_start=0,
        ntree_end=model.get_best_iteration(),
        thread_count=-1,
        verbose=None,
    )

    y_student_prob = model.predict(
        x_student,
        prediction_type="Probability",
        ntree_start=0,
        ntree_end=model.get_best_iteration(),
        thread_count=-1,
        verbose=None,
    )

    y_train_original_prob = model.predict(
        x_train_original,
        prediction_type="Probability",
        ntree_start=0,
        ntree_end=model.get_best_iteration(),
        thread_count=-1,
        verbose=None,
    )
else:
    raise NotImplementedError

Catboost以外のコード部分は省略

x_studentとy_studentを用いてノイズを付加したラベルで学習

print("==> Evaluating..")

checkpoint["log_loss"] = log_loss(y_test, y_test_prob[:, 1])

print("test log loss is {0}".format(checkpoint["log_loss"]))

# TODO add if clause for logistic regression
y_student_opt_pred = optimal_pred(y_student_prob, class_weights)
checkpoint["advantage_student"] = (
    weighted_acc(
        y_student_actual.to_numpy().squeeze(), y_student_opt_pred, class_weights
    )
) / B
print(
    "advantage student dataset after training is {0}".format(
        checkpoint["advantage_student"]
    )
)

y_student_opt_pred: ノイズ付きで学習したモデルが予測した予測確率分布に重みをかけたもののうち、最も大きい値となったラベルを選ぶ
advantage_student: 実際のラベルとノイズ付きかつ重みをかけた上で最も大きいラベルがどれくらいあっているかの精度を、重み付きで計測
- あっている部分の重みの平均値を出している
  トレーニング、テストも同様なので省略

train_predictions = np.argmax(y_train_original_prob, axis=1)
train_labels = y_train_original.to_numpy().squeeze()
print(
    "train acc after training is {0}".format((train_predictions == train_labels).mean())
)
print("train prediction percent 0 is {0}".format(1 - np.mean(train_predictions)))

# テストも同様なので省略

checkpoint["eps"] = eps

train acc after training: 純粋に最も大きい確率を選び、それが純粋にどれくらいあっているかの指標

保存関係は省略

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up