日本人用にトレーニングを追加した dlib の顔認証モデル

Last updated at 2024-05-10Posted at 2024-05-07

日本人用にトレーニングした dlib の顔認証モデル

ディープメトリックラーニングによる高品質の顔認証

dlib には優秀な顔認証モデルがあります。しかし、残念ながら人種に偏りがあり、特に北東アジア人、とりわけ日本人に関してはまったく役に立たないと言っても過言ではありません。ですが、それは仕方のないことだと思います。訓練に使用した顔認証用のデータセットに偏りがあったためです。

そこで私は日本人を中心とした多くのデータセットを収集し、ゼロから訓練を行いました。これには途方もない時間を要しましたが、ある程度は実用に耐えうる基準に達したと思います。日本人用に訓練したとは言っても欧米人の顔認証に関しても dlib のモデルに近い結果となってます。dlib の example で用意されているハリウッドのアクションヒーローの写真も dlib と同様に分類が可能です。

冒頭の画像は日本人の俳優の分類を私のモデルで行ったモノです。
この下の画像は dlib『dlib_face_recognition_resnet_model_v1.dat』モデルを使用したときの分類結果です。

ソースコード

/*
  Written by Taguchi.
    The original source code is in the examples folder of dlib. This source code has been 
    modified so that it can be tested by specifying "taguchi_face_recognition_resnet_model_v1.dat" and
    "dlib_face_recognition_resnet_model_v1.dat" with the command.

     Example 1) dnn_face_recognition_ex faces/JapaneseBaldActor.jpg taguchi_face_recognition_resnet_model_v1.dat
     Example 2) dnn_face_recognition_ex faces/JapaneseBaldActor.jpg dlib_face_recognition_resnet_model_v1.dat
*/

#include <dlib/dnn.h>
#include <dlib/gui_widgets.h>
#include <dlib/clustering.h>
#include <dlib/string.h>
#include <dlib/image_io.h>
#include <dlib/image_processing/frontal_face_detector.h>

using namespace dlib;
using namespace std;

// --------------------------------------------------------------

// The next bit of code defines a ResNet network.  It's basically copied
// and pasted from the dnn_imagenet_ex.cpp example, except we replaced the loss
// layer with loss_metric and made the network somewhat smaller.  Go read the introductory
// dlib DNN examples to learn what all this stuff means.
//
// Also, the dnn_metric_learning_on_images_ex.cpp example shows how to train this network.
// The dlib_face_recognition_resnet_model_v1 model used by this example was trained using
// essentially the code shown in dnn_metric_learning_on_images_ex.cpp except the
// mini-batches were made larger (35x15 instead of 5x5), the iterations without progress
// was set to 10000, and the training dataset consisted of about 3 million images instead of
// 55.  Also, the input layer was locked to images of size 150.
template <template <int, template<typename>class, int, typename> class block, int N, template<typename>class BN, typename SUBNET>
using residual = add_prev1<block<N, BN, 1, tag1<SUBNET>>>;

template <template <int, template<typename>class, int, typename> class block, int N, template<typename>class BN, typename SUBNET>
using residual_down = add_prev2<avg_pool<2, 2, 2, 2, skip1<tag2<block<N, BN, 2, tag1<SUBNET>>>>>>;

template <int N, template <typename> class BN, int stride, typename SUBNET>
using block = BN<con<N, 3, 3, 1, 1, relu<BN<con<N, 3, 3, stride, stride, SUBNET>>>>>;

template <int N, typename SUBNET> using ares = relu<residual<block, N, affine, SUBNET>>;
template <int N, typename SUBNET> using ares_down = relu<residual_down<block, N, affine, SUBNET>>;

template <typename SUBNET> using alevel0 = ares_down<256, SUBNET>;
template <typename SUBNET> using alevel1 = ares<256, ares<256, ares_down<256, SUBNET>>>;
template <typename SUBNET> using alevel2 = ares<128, ares<128, ares_down<128, SUBNET>>>;
template <typename SUBNET> using alevel3 = ares<64, ares<64, ares<64, ares_down<64, SUBNET>>>>;
template <typename SUBNET> using alevel4 = ares<32, ares<32, ares<32, SUBNET>>>;

using anet_type = loss_metric<fc_no_bias<128, avg_pool_everything<
    alevel0<
    alevel1<
    alevel2<
    alevel3<
    alevel4<
    max_pool<3, 3, 2, 2, relu<affine<con<32, 7, 7, 2, 2,
    input_rgb_image_sized<150>
    >>>>>>>>>>>>;

// --------------------------------------------------------------

std::vector<matrix<rgb_pixel>> jitter_image(
    const matrix<rgb_pixel>& img
);

// --------------------------------------------------------------

int main(int argc, char** argv) try
{
    if (argc != 3)
    {
        cout << "Run this example by invoking it like this: " << endl;
        cout << "   ./dnn_face_recognition_ex faces/bald_guys.jpg" << endl;
        cout << endl;
        cout << "You will also need to get the face landmarking model file as well as " << endl;
        cout << "the face recognition model file.  Download and then decompress these files from: " << endl;
        cout << "http://dlib.net/files/shape_predictor_5_face_landmarks.dat.bz2" << endl;
        cout << "http://dlib.net/files/dlib_face_recognition_resnet_model_v1.dat.bz2" << endl;
        cout << endl;
        return 1;
    }

    // The first thing we are going to do is load all our models.  First, since we need to
    // find faces in the image we will need a face detector:
    frontal_face_detector detector = get_frontal_face_detector();
    // We will also use a face landmarking model to align faces to a standard pose:  (see face_landmark_detection_ex.cpp for an introduction)
    shape_predictor sp;
    deserialize("shape_predictor_5_face_landmarks.dat") >> sp;
    // And finally we load the DNN responsible for face recognition.
    anet_type net;
    //deserialize("dlib_face_recognition_resnet_model_v1.dat") >> net;
    //deserialize("taguchi_face_recognition_resnet_model_v1.dat") >> net;
    deserialize(argv[2]) >> net;

    matrix<rgb_pixel> img;
    load_image(img, argv[1]);
    // Display the raw image on the screen
    image_window win(img);

    // Run the face detector on the image of our action heroes, and for each face extract a
    // copy that has been normalized to 150x150 pixels in size and appropriately rotated
    // and centered.
    std::vector<matrix<rgb_pixel>> faces;
    for (auto face : detector(img))
    {
        auto shape = sp(img, face);
        matrix<rgb_pixel> face_chip;
        extract_image_chip(img, get_face_chip_details(shape, 150, 0.25), face_chip);
        faces.push_back(move(face_chip));
        // Also put some boxes on the faces so we can see that the detector is finding
        // them.
        win.add_overlay(face);
    }

    if (faces.size() == 0)
    {
        cout << "No faces found in image!" << endl;
        return 1;
    }

    // This call asks the DNN to convert each face image in faces into a 128D vector.
    // In this 128D vector space, images from the same person will be close to each other
    // but vectors from different people will be far apart.  So we can use these vectors to
    // identify if a pair of images are from the same person or from different people.  
    std::vector<matrix<float, 0, 1>> face_descriptors = net(faces);


    // In particular, one simple thing we can do is face clustering.  This next bit of code
    // creates a graph of connected faces and then uses the Chinese whispers graph clustering
    // algorithm to identify how many people there are and which faces belong to whom.
    std::vector<sample_pair> edges;
    for (size_t i = 0; i < face_descriptors.size(); ++i)
    {
        for (size_t j = i; j < face_descriptors.size(); ++j)
        {
            // Faces are connected in the graph if they are close enough.  Here we check if
            // the distance between two face descriptors is less than 0.6, which is the
            // decision threshold the network was trained to use.  Although you can
            // certainly use any other threshold you find useful.
            if (length(face_descriptors[i] - face_descriptors[j]) < 0.6)
                edges.push_back(sample_pair(i, j));
        }
    }
    std::vector<unsigned long> labels;
    const auto num_clusters = chinese_whispers(edges, labels);
    // This will correctly indicate that there are 4 people in the image.
    cout << "number of people found in the image: " << num_clusters << endl;


    // Now let's display the face clustering results on the screen.  You will see that it
    // correctly grouped all the faces. 
    std::vector<image_window> win_clusters(num_clusters);
    for (size_t cluster_id = 0; cluster_id < num_clusters; ++cluster_id)
    {
        std::vector<matrix<rgb_pixel>> temp;
        for (size_t j = 0; j < labels.size(); ++j)
        {
            if (cluster_id == labels[j])
                temp.push_back(faces[j]);
        }
        win_clusters[cluster_id].set_title("face cluster " + cast_to_string(cluster_id));
        win_clusters[cluster_id].set_image(tile_images(temp));
    }




    // Finally, let's print one of the face descriptors to the screen.  
    cout << "face descriptor for one face: " << trans(face_descriptors[0]) << endl;

    // It should also be noted that face recognition accuracy can be improved if jittering
    // is used when creating face descriptors.  In particular, to get 99.38% on the LFW
    // benchmark you need to use the jitter_image() routine to compute the descriptors,
    // like so:
    matrix<float, 0, 1> face_descriptor = mean(mat(net(jitter_image(faces[0]))));
    cout << "jittered face descriptor for one face: " << trans(face_descriptor) << endl;
    // If you use the model without jittering, as we did when clustering the bald guys, it
    // gets an accuracy of 99.13% on the LFW benchmark.  So jittering makes the whole
    // procedure a little more accurate but makes face descriptor calculation slower.


    cout << "hit enter to terminate" << endl;
    cin.get();
}
catch (std::exception& e)
{
    cout << e.what() << endl;
}

// --------------------------------------------------------------------

std::vector<matrix<rgb_pixel>> jitter_image(
    const matrix<rgb_pixel>& img
)
{
    // All this function does is make 100 copies of img, all slightly jittered by being
    // zoomed, rotated, and translated a little bit differently. They are also randomly
    // mirrored left to right.
    thread_local dlib::rand rnd;

    std::vector<matrix<rgb_pixel>> crops;
    for (int i = 0; i < 100; ++i)
        crops.push_back(jitter_image(img, rnd));

    return crops;
}

LFWデターセットによる dlib との精度の比較

私はある程度の客観性を提示するために LFW デターセットによる精度をテストしました。
まずは dlib のモデル "dlib_face_recognition_resnet_model_v1.dat" でテストしました。
結果は 0.993833% でした。

dlib original version
C:\Users\Taguchi\Desktop\dlib_face_recognition_resnet_model_v1_lfw_test\build>main
dist thresh: 0.6
margin: 0.04
overall lfw accuracy: 0.993833
pos lfw accuracy: 0.994667
neg lfw accuracy: 0.993
fold mean: 0.995
fold mean: 0.991667
fold mean: 0.991667
fold mean: 0.99
fold mean: 0.996667
fold mean: 0.996667
fold mean: 0.99
fold mean: 0.995
fold mean: 0.996667
fold mean: 0.995
rscv.mean(): 0.993833
rscv.stddev(): 0.00272732
ERR accuracy: 0.993
ERR thresh: 0.595778

続いて私のモデル "taguchi_face_recognition_resnet_model_v1.dat" でテストしました。
結果は 0.9895% でした。

Taguchi model version
C:\Users\Taguchi\dlib_face_recognition_resnet_model_v1_lfw_test\build>main
dist thresh: 0.6
margin: 0.04
overall lfw accuracy: 0.9895
pos lfw accuracy: 0.984
neg lfw accuracy: 0.995
fold mean: 0.986667
fold mean: 0.99
fold mean: 0.986667
fold mean: 0.99
fold mean: 0.988333
fold mean: 0.993333
fold mean: 0.981667
fold mean: 0.993333
fold mean: 0.991667
fold mean: 0.993333
rscv.mean(): 0.9895
rscv.stddev(): 0.00377205
ERR accuracy: 0.989
ERR thresh: 0.614296

ご覧のように dlib と私のモデルでは 0.004333% dlib のほうが精度を上回っています。
この精度の差は誤差の範囲と言えるでしょう。
実際、北東アジア人の顔認証を無料で手に入れたいと考える人にとっては朗報です。

650万枚の顔データセットによる訓練

私の顔認証モデルの訓練には、16,000人以上の人物の顔を合計で650万枚以上使用しました。
このうち約47%が日本人(日本人以外のアジア人も若干含む)の顔データです。
この訓練用データセットはいくつかの公開されていたデータセットに加えて、インターネットから収集したものを加えました。
最も苦労したのは、いくつかのデータセットの中には重複した人物が含まれていることです。
この重複した人物を取り除くことに多くの時間を要しました。

顔認証モデルの入手

dlib のモデル 'dlib_face_recognition_resnet_model_v1.dat' および 'shape_predictor_5_face_landmarks.dat'は以下のリンクより取得してください。
※ 拡張子に'.bz2'が付いていますので、ダウンロード後に解凍してください

dlib.net/files

私のモデル 'taguchi_face_recognition_resnet_model_v1.dat' とソースコード 'dnn_face_recognition_ex.cpp' の入手は github を参照してください。
Taguchi models

コンパイル方法

dlib.net にコンパイル方法が記載されていますので参考にしてください。
How to compile dlib

実行

コンパイル後に出来たexeファイルと同じフォルダーに以下を配置してください。

faces フォルダーを配置して、その配下に JapaneseBaldActor.jpg, bald_guys.jpg を配置してください。
'dlib_face_recognition_resnet_model_v1.dat' を配置してください。
'shape_predictor_5_face_landmarks.dat' を配置してください。
'taguchi_face_recognition_resnet_model_v1.dat' を配置してください。

コマンドプロンプトより以下のコマンドを打ってください。

Taguchi model の例

> dnn_face_recognition_ex faces/JapaneseBaldActor.jpg taguchi_face_recognition_resnet_model_v1.dat

dlib model の例

> dnn_face_recognition_ex faces/JapaneseBaldActor.jpg dlib_face_recognition_resnet_model_v1.dat

制限事項

私のモデルには訓練用データが不足しているため、精度が著しく悪い場合があります。

アフリカ系のルーツを持つ人物
18歳未満の人物
マスクを付けている人物

上記の用途には向かないことがあります。
注意してご利用ください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up