More than 5 years have passed since last update.

[English ver.] [Tensorflow Lite] Various Neural Network Model quantization methods for Tensorflow Lite (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization, EdgeTPU). As of May 05, 2020.

Last updated at 2020-05-23Posted at 2020-05-09

Japanese　English

- English -

1. Introduction

In this article, I'd like to share with you the quantization workflow I've been working on for six months. This is the output of know-how for converting Tensorflow checkpoints (.ckpt/.meta), FreezeGraph (.pb), saved_model (.pb), keras_model (.h5), Tensorflow.js models, and PyTorch checkpoints (.pth) into quantization models for Tensorflow Lite. Since Tensorflow was upgraded from v1.x to v2.x, it is necessary to take special steps to absorb the differences between the versions, and I often feel that there is a lack of material to start the conversion. Tensorflow, Tensorflow Lite, Keras, ONNX, PyTorch, and OpenVINO(OpenCV) are all used in combination.

I work hard on the quantization of the Neural Network every day. I'm using a lighter model to mass produce a quantized model with the goal of making fast reasoning without a GPU on edge terminals such as the RaspberryPi4. As an example, I did 8-bit integer quantization of the model, and the result of multistage inference between two models using only RaspberryPi4 CPU is shown in the following video. Two quantization models, Object Detection (MobileNetV2-SSDLite dm=0.5) and Head Pose Estimation, are run in series.

Head Pose Estimation の RaspberryPi4 CPU only + Tensorflow Lite + 4 Threads はかなりうまくいきました。２段階推論にも関わらずサクサクの 13 FPS です。発想力が足りないため見苦しいオッサンの顔でテストしてしまったことをお許しください。あ〜、久々に達成感🤪https://t.co/hIwxA8eAZC
— Super PINTO (@PINTO03091) April 27, 2020

If you find the video too small to watch, **[PINTO_model_zoo](https://github.com/PINTO0309/PINTO_model_zoo#sample3---head-pose-estimation-multi-stage-inference-with-multi-model)** has an enlarged sample GIF, which can be viewed over Wi-Fi or wired.

2. Table of contents

1. Introduction
2. Table of contents
3. Environment
4. Procedure
　4-1. Check the model's INPUT and OUTPUT names and types, and change the batch size and type
　　4-1-1. In the case of a Tensorflow checkpoint
　　4-1-2. In the case of Tensorflow Freeze_Graph
　　4-1-3. In the case of Tensorflow saved_model
　　4-1-4. In the case of Tensorflow/Keras .h5/.json

　　4-2-2. Quantization from a Tensorflow checkpoint (.meta)

　　4-2-3. Quantization from Tensorflow Freeze_Graph (.pb)

　　4-2-4. Quantization from Tensorflow saved_model (.pb)

　　4-2-5. Quantization from Tensorflow/Keras (.h5/.json)
　　　4-2-5-1. Weight Quantization from .h5/.json (weight quantization)
　　　4-2-5-2. Generating the calibration data set
　　　4-2-5-3. Integer Quantization from .h5/.json (8-bit integer quantization)
　　　4-2-5-4. Full Integer Quantization from .h5/.json (all 8-bit integer quantization)
　　　4-2-5-5. Float16 Quantization from .h5/.json (Float16 quantization)
　　　4-2-5-6. Full Integer Quantization to EdgeTPU convert

　　4-2-7. Quantize the model generated by the TensorFlow Object Detection API
　　　4-2-7-1. Generating a .pb file with Post-Process
　　　4-2-7-2. Weight Quantization from Freeze_Graph (Weight-only quantization)
　　　4-2-7-3. Integer Quantization from Freeze_Graph (8-bit integer quantization)
　　　4-2-7-4. Full Integer Quantization from Freeze_Graph (All 8-bit integer quantization)
　　　4-2-7-5. Float16 Quantization from Freeze_Graph (Float16 quantization)
　　　4-2-7-6. Full Integer Quantization to EdgeTPU convert

　　4-2-8. Quantize models containing operations that are not supported by Tensorflow Lite but are supported by Tensorflow
　　　4-2-8-1. Generate Mask-RCNN Inception V2 .pb file
　　　4-2-8-2. Weight Quantization of Mask-RCNN Inception V2 (Weight-only quantization)
　　　4-2-8-3. Float16 Quantization in Mask-RCNN Inception V2 (Float16 quantization)
　　　4-2-8-4. Running a model with Flex Delegate (Tensorflow Select Ops) enabled

　　4-2-9. Quantization from a model for PyTorch
　　　4-2-9-1. Advance preparation (PyTorch->ONNX)
　　　4-2-9-2. ONNX->Keras conversion by onnx2keras
　　　4-2-9-3. Weight Quantization from saved_model (Weight-only quantization)
　　　4-2-9-4. Integer Quantization from saved_model (8-bit integer quantization)
　　　4-2-9-5. Full Integer Quantization from saved_model (All 8-bit integer quantization)
　　　4-2-9-6. Float16 Quantization from saved_model (Float16 quantization)
　　　4-2-9-7. Full Integer Quantization to EdgeTPU convert

　4-3. Performance benchmarks for the quantization model (.tflite)
　　4-3-1. Building the TFLite Model Benchmark Tool
　　4-3-2. Options for the TFLite Model Benchmark Tool
　　4-3-3. Benchmark example of a model that includes only standard Tensorflow Lite operations (No XNNPACK, 4 Threads)
　　4-3-4. Benchmark example of a model that includes only standard Tensorflow Lite operations (XNNPACK available, 4 Threads)
　　4-3-5. Benchmark examples of models with non-standard Tensorflow Lite operations (Flex enabled, no XNNPACK, 4 Threads)
　　4-3-6. Benchmark examples of models with non-standard Tensorflow Lite operations (Flex enabled, with XNNPACK, 4 Threads)
　　4-3-7. Execution log sample of Benchmark_Tool

5. Finally

6. Reference articles

3. Environment

Tensorflow-GPU v1.15.2
Tensorflow v2.1.0, v2.2.0 or tf-nightly
Accelerated and Tuned Python API Tensorflow Lite
PyTorch
Caffe
OpenVINO 2020.2
OpenCV 4.2
onnx2keras
Netron
RaspberryPi4 + Ubuntu aarch64

[English ver.] [Tensorflow Lite] Various Neural Network Model quantization methods for Tensorflow Lite (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization, EdgeTPU). As of May 05, 2020.

- English -

1. Introduction

2. Table of contents

3. Environment

4. Procedure

4-1. Check the model's INPUT and OUTPUT names and types, and change the batch size and type

4-1-1. In the case of a Tensorflow checkpoint

4-1-2. In the case of Tensorflow Freeze_Graph

4-1-3. In the case of Tensorflow saved_model

4-1-4. In the case of Tensorflow/Keras .h5/.json

4-2. Various quantization procedures

4-2-1. Quantization from a Tensorflow checkpoint (.ckpt)

4-2-1-1. Generating .meta from .index and .data-00000-of-00001

4-2-1-2. Generate Freeze_Graph from checkpoint (.meta)

4-2-1-3. Generate a saved_model from Freeze_Graph

4-2-1-4. Weight Quantization from saved_model (weight-only quantization)

4-2-1-5. Integer Quantization from saved_model (8-bit integer quantization)

4-2-1-6. saved_model to Full Integer Quantization (all 8-bit integer quantization)

4-2-1-7. Float16 Quantization from saved_model (Float16 quantization)

4-2-1-8. Full Integer Quantization to EdgeTPU convert

4-2-2. Quantization from a Tensorflow checkpoint (.meta)

4-2-3. Quantization from Tensorflow Freeze_Graph (.pb)

4-2-4. Quantization from Tensorflow saved_model (.pb)

4-2-5. Quantization from Tensorflow/Keras (.h5/.json)

4-2-5-1. Weight Quantization from .h5/.json (weight quantization)

4-2-5-2. Generating the calibration data set

4-2-5-3. Integer Quantization from .h5/.json (8-bit integer quantization)

4-2-5-4. Full Integer Quantization from .h5/.json (all 8-bit integer quantization)

4-2-5-5. Float16 Quantization from .h5/.json (Float16 quantization)

4-2-5-6. Full Integer Quantization to EdgeTPU convert

4-2-6. Quantization from a model for Tensorflow.js

4-2-6-1. Advance preparation

4-2-6-2. Generating a saved_model from Tensorflow.js

4-2-6-3. Import saved_model generated by Tensorflow v2.x into Tensorflow v1.x and process the input shape

4-2-6-4. Installation of Tensorflow v2.2.0

4-2-6-5. Weight Quantization from saved_model (Weight-only quantization)

4-2-6-6. Integer Quantization from saved_model (8-bit integer quantization)

4-2-6-7. Full Integer Quantization from saved_model (All 8-bit integer quantization)

4-2-6-8. Float16 Quantization from saved_model (Float16 quantization)

4-2-6-9. Full Integer Quantization to EdgeTPU convert

4-2-7. Quantize the model generated by the TensorFlow Object Detection API

4-2-7-1. Generating a .pb file with Post-Process

4-2-7-2. Weight Quantization from Freeze_Graph (Weight-only quantization)

4-2-7-3. Integer Quantization from Freeze_Graph (8-bit integer quantization)

4-2-7-4. Full Integer Quantization from Freeze_Graph (All 8-bit integer quantization)

4-2-7-5. Float16 Quantization from Freeze_Graph (Float16 quantization)

4-2-7-6. Full Integer Quantization to EdgeTPU convert

4-2-8. Quantize models containing operations that are not supported by Tensorflow Lite but are supported by Tensorflow

4-2-8-1. Generate Mask-RCNN Inception V2 .pb file

4-2-8-2. Weight Quantization of Mask-RCNN Inception V2 (Weight-only quantization)

4-2-8-3. Float16 Quantization in Mask-RCNN Inception V2 (Float16 quantization)

4-2-8-4. Running a model with Flex Delegate (Tensorflow Select Ops) enabled

4-2-9. Quantization from a model for PyTorch

4-2-9-1. Advance preparation (PyTorch->ONNX)

4-2-9-2. ONNX->Keras conversion by onnx2keras

4-2-9-3. Weight Quantization from saved_model (Weight-only quantization)

4-2-9-4. Integer Quantization from saved_model (8-bit integer quantization)

4-2-9-5. Full Integer Quantization from saved_model (All 8-bit integer quantization)

4-2-9-6. Float16 Quantization from saved_model (Float16 quantization)

4-2-9-7. Full Integer Quantization to EdgeTPU convert

4-2-10. Quantization of MediaPipe's model BlazeFace(.tflite)

4-2-10-1. Build flatc and download schema.fbs

4-2-10-2. Download MediaPipe's BlazeFace model (.tflite)

4-2-10-3. Converting BlazeFace(.tflite) to saved_model(.pb)

4-2-10-4. Weight Quantization from saved_model (weight-only quantization)

4-2-10-5. Integer Quantization from saved_model (8-bit integer quantization)

4-2-10-6. Full Integer Quantization from saved_model (All 8-bit integer quantization)

4-2-10-7. Float16 Quantization from saved_model (Float16 quantization)

4-2-10-8. Full Integer Quantization to EdgeTPU convert

4-3. Performance benchmarks for the quantization model (.tflite)

4-3-1. Building the TFLite Model Benchmark Tool

4-3-2. Options for the TFLite Model Benchmark Tool

4-3-3. Benchmark example of a model that includes only standard Tensorflow Lite operations (No XNNPACK, 4 Threads)

4-3-4. Benchmark example of a model that includes only standard Tensorflow Lite operations (XNNPACK available, 4 Threads)

4-3-5. Benchmark examples of models with non-standard Tensorflow Lite operations (Flex enabled, no XNNPACK, 4 Threads)

4-3-6. Benchmark examples of models with non-standard Tensorflow Lite operations (Flex enabled, with XNNPACK, 4 Threads)

4-3-7. Execution log sample of Benchmark_Tool

5. Finally

6. Reference articles