More than 5 years have passed since last update.

RaspberryPi4/3のCPUで爆速DeepLearning (レアデバイスもあるよ)

Last updated at 2020-02-15Posted at 2020-02-15

これは connpass - 2020/02/15(土) 第2回ディープラーニングガジェット品評会の発表資料です。プレゼン用に箇条書きが多いですがご容赦願います。なお、私の全成果物は Github と Qiita にフルコミットしてあります。詳しく知りたい方はAppendixをご覧下さい。この記事はプレゼンが終わったらその場でQiitaへ公開します。
https://deepgatget.connpass.com/event/156270/

1. PINTO

名古屋に住んでいます。しがないビルドジャンキーです。ありきたりじゃないのが死ぬほど好きです。 RaspberryPi はビルド専用端末です。

2. おことわり

撮影OK
SNS共有OK
アドバイスOK
飲み会のお誘い歓迎
プルリク歓迎
ディスりNG

3. 今回トライしたこと

RaspberryPi4 の CPUのみ でDeepLearning (EdgeTPU, NCS2使わない縛り)
公式公開モデルのMobileNetV2-SSDLite データセット変更
- MS-COCO (90class) -> Pascal-VOC (20class)
Tensorflow v2.1.0 の v1-API 対応 (セルフビルドしてインストールするだけ)
モデルの ８bit量子化 (Integer Quantization)
Tensorflow LiteのPythonAPIを改造
RaspberryPi4 を Raspbian Buster 64bit 化
RaspberryPi4 の オーバークロック (1.5GHz->1.75GHz)
- 2020/02/07のkernelアップデートで1.8GHz対応されていました
- https://www.raspberrypi.org/blog/a-new-raspbian-update/

4. 公式モデル MobileNetV2-SSDLite (MS-COCO) を Pascal-VOC でリトレーニング

大したことはしません。データセットを変更していちからトレーニングするだけです。クラス数が 90クラスから 20クラスへ減少することにより、 RaspberryPi4 の CPUによる後処理で 5ms ほどパフォーマンスが改善しました。手順は下記のQiita記事をご覧ください。
MobileNetV2-SSDLiteのPascal-VOCデータセットによる学習 [Docker版リメイク]
Dockerを使用してこんな感じの環境でガガガッと学習します。ヒュ〜、無駄にカッコイイ。

5. Tensorflow Lite

軽量・高速推論に最適化されたフレームワーク
モデルの量子化・実行
カスタムオペレータ対応
非対応のオペレータは Flex Delegate によりTensorflow側へオフロード可能
モデルのプルーニング
EdgeTPU
ARM64 に最適化

5-1. 普通のTensorflow Liteのワークフロー

Tensorflow v2.x.x 以降、標準では従来資産の .pb (GraphDef) ファイルを .tflite 形式に変換できません。でも、 Tensorflow v2.1.0 の最新のオペレーターに対応させたうえで .pb の変換も行いたいですよね。

5-2. 改造後のTensorflow Liteのワークフロー

ですので、「そうだ、京都へ行こう。」的なノリで、 Tensorflow v2.1.0 のワークフローを変更しました。 --config=v1 オプションを指定して Tensorflow をフルビルドするだけです。いつもめちゃくちゃな発想ばかりしていますので、そのうちGoogleの中の人に怒られそうです。
v1-API版 Ubuntu 18.04 x86_64用 Tensorflow v2.1.0インストーラ tensorflow-2.1.0-cp36-cp36m-linux_x86_64.whl

5-3. Tensorflow v2.1.0 for v1-API のWheelインストーラフルビルド

ビルドの下準備なども含めた全手順は コチラのREADME をご覧ください。
ワークフローを変更するためのビルドコマンドだけを抜き出したものは下記です。

v1-APIを有効にしたTensorflow_v2.1.0のフルビルドコマンド

$ sudo bazel build \
--config=opt \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v1 \
//tensorflow/tools/pip_package:build_pip_package

5-4. モデルの量子化

公式チュートリアルによると、 "モデルサイズが4倍に削減され、CPUパフォーマンスが3〜4倍に向上します。さらに、完全に量子化されたモデルは、整数演算特化のハードウェアアクセラレータで使用できます。" だそうです。 整数演算特化のハードウェアアクセラレータ は EdgeTPU のことですね。爆速なのは知っています。バナナもプルプルしちゃうほどです。
Tensorflow for Mobile/IoT - Post-training integer quantization チュートリアル

5-5. 量子化モデルの精度とレイテンシ

8bit量子化を行うことで、若干の精度劣化を伴う代わりにレイテンシは　２分の１　〜　４分の１　になります。

5-6. MobileNetV2-SSDLite の.pbファイルへのエクスポート (Post-Process有り)

Post-Process がモデルの末尾にくっついた .tflite ファイルを生成します。 Post-Process はカスタムオペレータです。

Export_model

$ python3 object_detection/export_tflite_ssd_graph.py \
    --pipeline_config_path=pipeline.config \
    --trained_checkpoint_prefix=model.ckpt-44548 \
    --output_directory=export \
    --add_postprocessing_op=True

5-7. MobileNetV2-SSDLite の .pb の8bit量子化 (Integer Quantization)

Integer_Quantization_from_.pb_file

import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np

def representative_dataset_gen():
    for data in raw_test_data.take(100):
        image = tf.image.resize(data['image'].numpy(), (300, 300))
        image = image[np.newaxis,:,:,:]
        yield [(image - 127.5) * 0.007843]

tf.compat.v1.enable_eager_execution()
raw_test_data, info = tfds.load(name="voc/2007", with_info=True, split="validation", 
                                data_dir="~/TFDS", download=True)
graph_def_file="tflite_graph_with_postprocess.pb"
input_arrays=["normalized_input_image_tensor"]
output_arrays=['TFLite_Detection_PostProcess','TFLite_Detection_PostProcess:1',
               'TFLite_Detection_PostProcess:2','TFLite_Detection_PostProcess:3']
input_tensor={"normalized_input_image_tensor":[1,300,300,3]}

###################################    ↓ ココ TF v2 では本来使用できないAPI
converter = tf.lite.TFLiteConverter.from_frozen_graph(graph_def_file, input_arrays,
                                                      output_arrays,input_tensor)
converter.allow_custom_ops=True
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
tflite_quant_model = converter.convert()
with open('./ssdlite_mobilenet_v2_voc_300_integer_quant_with_postprocess.tflite', 'wb') as w:
    w.write(tflite_quant_model)

6. Tensorflow Liteランタイムの改造

Tensorflow Lite の PythonAPI を改造してマルチスレッド推論に対応させます。

tensorflow/lite/python/interpreter.py

  def set_num_threads(self, i):
    return self._interpreter.SetNumThreads(i)

tensorflow/lite/python/interpreter_wrapper/interpreter_wrapper.cc

PyObject* InterpreterWrapper::SetNumThreads(int i) {
  interpreter_->SetNumThreads(i);
  Py_RETURN_NONE;
}

tensorflow/lite/python/interpreter_wrapper/interpreter_wrapper.h

  PyObject* SetNumThreads(int i);

いじょ。簡単ですね。

7. Raspbian Buster 64bit

Tensorflow Liteの公式チュートリアルやベンチマークはAndroidスマホのものばかりです。何故か？ 64bit ARMへの最適化があるからだと思います。てことで、Raspbianを64bit化します。後の動画で結果は明らかです。 sakaki-さんのリポジトリのREADMEどおりに導入するだけです。

8. CPU推論 - Object Detection

レッツ推論！！

pic.twitter.com/GnmVp2paLe
— Super PINTO (@PINTO03091) February 8, 2020

https://t.co/6ksV5OaryV
— Super PINTO (@PINTO03091) February 8, 2020

9. 深度推定+DeepLearningデバイス〈DepthAI〉

VPU MyriadX と RaspberryPi3 とステレオカメラを一体にしたとんでもデバイスを Luxonis の Brandonさん から試験用にご提供頂きました。 NCS2 と RaspberryPi3 がハードウェアレベルで一体になっているのと同じです。本日は展示デモを行います。量産化前だそうなので世界に10台しかないうちの1台です。
DepthAI　https://luxonis.com/depthai
DepthAI Documentation　https://docs.luxonis.com/

Brandonさんから頂いたDepthAIの性能です。 RaspberryPi3+MyriadX (NCS2同等) の本来のパフォーマンスです。深度も取れます。 USB2.0を経由せず直接バイパスしています。モデルにもよると思いますがサンプルPGは 25FPS です。#RaspberryPi #DepthAI #MyriadX https://t.co/QNPIWFxwbd
— Super PINTO (@PINTO03091) February 12, 2020

Cool !!!!

Calibrating the camera is extremely difficult and struggling. However, RaspberryPi3 + ObjectDetection + Depth estimation seems to be working properly.😄#DepthAI #RaspberryPi #MyriadX https://t.co/lwNcxu40mH
— Super PINTO (@PINTO03091) February 12, 2020

10. RaspberryPi4 + NCS2 のPINTO実装

去年作成した NCS2 x１本 + RaspberryPi4 の動画がコレですからね。やっぱりエッジアクセラレータは特化していて強いです。https://t.co/tEYBwyhTlu
— Super PINTO (@PINTO03091) February 8, 2020

11. 最後に

この界隈は技術の移り変わりがとても激しいです。ですので、新しいものが出てきたら次へ次へと思考シフトしていける柔軟なフットワークが必要だと考えています。そんな中、何故あえて数年前に提案されたモデル等を高速化するのか。最新の技術と準最新の技術を組み合わせると意外とかつては解決が難しかった課題が解決することもあると思っているからです。課題はコストだったのかもしれないし、パフォーマンスだったのかもしれない。でも、シンプルで長年洗練された技術は新しい技術で少し味付けするだけで、機材調達コストが大幅に下がったうえにパフォーマンスが著しく上がって極端に優位性が高くなったり、当時はとても実業務には適応できなかったようなものでもとたんに息を吹き替えしたりする。そういうものにロマンを感じます。そんな時代にあっても、私は一生、生み出す側ではなく利用する側でいるに違いないわけですけれども。。。

12. Appendix

RaspberryPi4/3用 Tensorflow フルビルドインストーラ
RaspberryPi4/3 armv7l,aarch64用 PythonAPI改造済みの Tensorflow Lite Wheelインストーラ

Tensorflow LiteのFlex Delegate対応済みバイナリ
<お手製> 量子化モデルZOO

RaspberryPi用 Bazel

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up