More than 5 years have passed since last update.

nVIDIA GTC Japan 2016 report

Posted at 2016-10-06

Executive Summary

数年ぶりにCEOが来たGTC Japan (自動運転により車メーカーの多い日本でのGTC開催が復活したと思われる)
- nVIDIA CEO Jen Hsun Huangによるkey note speech
  - AI revolutionがやってきた
    - GPUは、AI revolutionを支える
  - roboticsも、AIが重要なキーであり、ファナックとstrategic partnershipを結んだ。
    - 各ロボットに、Jetsonのようなembedded GPUを搭載し、中央にlearning用のGPUを設置し、どんどん賢くなって生産性を高める、階層化されたAI/roboticsによるautomationを推進する。
  - dataをDNNにinputすると、modelを吐き出す、これが新しい時代のprogramming model
- pascalは、AI時代を考慮したGPU.
- TensorRTで「推論」(学習ではない)を高速化し、AIのdeployをサポート
- 自動車向けにXavierという現在のParkerの4倍のパフォーマンスを発揮するSoCを今後リリースする
- nVIDIAは、車載コンピューターのためのplatform DrivePX2やそのSoftware Platformを提供している
  - BB8という独自のself-driving carを持ち、PilotNetというmodelやそれを使ったself-drivingのSW platformを提供している

keynote by Jen Hsun Huang

background

AIがkeyword, baiduのtranslationや自動運転も。deepmind. GPU deep learinigが成し遂げた！
computing nee era. PC with internet -> mobile-cloud -> AI&IoT era (now)
mobile cloud evolution by iPhone, Amazon EWS. 25億のmobile users.
AI&IoT era is by deep learning with GPU and 数千億のdevices
Software writes software. Machines are learning reasons.
AI revolutionなので、今日のGTCはAI computing era with GPUがtopic.

AI revolution

2012 ImageNet分類にCNNをGPU deep learningで学習させて、GPU deep learning big bangが始まった。
several days learningでalex net beats decadesの歴史のあるこれまでのコードを打ち破った。
deep learningで音声認識のエラーが6.3% error rateを達成した (Microsoft research)
現在、AI revolutionの準備が整った。GPUがそれを支える。
nVIDIAはcomputing company. あと、日本のearly adaptorの皆さんありがとう、だそうでう。
VR, AR, AIはSFではない。
GTCへの参加者は２年で４倍に。開発者(nVIDIAのSDKを使う人)は40万人になった。AI開発者は２５倍に。
脳はニューロンの多数の塊。GPUも小さなprocessorの塊。

paradigm shift on programming model

これまではソフトウェア開発者がコードを書いてそれを動かした。それはなくならないが、これからはソフトウェアがソフトウェアを書く。データを与え、deep learinngによって学習し、できあがったモデルも、ソフトウェア。これが新しいcomputing arch
alexnetは8 layers on 1.4GFLOPS, 16% error rate. ResNetは152 layers on 22.6GFLOPS, error rate :3.5%. つまり演算パワーを増加させることの需要は非常に高い。

pascal architecture & TensorRT to boost inference (deploying AI)

そこで、Pascalを作った。４年で６５倍。deep learningを目的に設計した。GPU is not a GPU anymore.
推論 (model) をdata centerで動かす。GPU推論エンジンで、スループットを上げる。
そこで、TESLA P4 & P40を作った。これはdata center向けおacceleratorで、2つのGPU推論処理accelerator. CPUのx40 energy efficient
学習用はcuDNNやCUDA. 推論用はTensorRT.

What an amazing with AI!

artistの作風をlearnすれば、realtimeに作品を作り出すことができる。そうGTCのlive画像でも。とデモンストレーション！
GPU inferenceを使っている？（先程のartostoのデモ)

AI in Japan

日本にはpreferred networkがある。すばらしい会社。rakutenでもAIが。mizhoでもdeep learning trading。ABEJAは店舗顧客分析。
Jetson TX1はexact sameなGPU. でfor embedded. 日本はrobotのhome.
Jetson TX1はIoTでも活躍するはず。

Strategic partnership with FANUC

FANUCもnVIDIAのplatformを使う。AI robotics manufacturing visonの実現に向けて。
マシンはお互いの学習結果を共有し、また、事前予見なども行うようになる。また協調動作を行う。これにより効率的になっている。
FANUC AI brain on 学習用GPU -> 推論用GPUと全robotはJetson TX1などを搭載し、全マシン運用のGPU体制を構築
現在は固定化されたtaskしか機能しないが、GPUとdeep learningによりflexibleな対応ができるようになる。

Auto pilot for automobile

AI transportationは１０００兆円の作業
自動運転。DRIVEWORKS ALPHA1.自動運転用operating system.
DriveNet, OpenRoadNet, 自己位置推定
PilotNet -> traffic prediction, path planning, action engine
動ける(Safe place)場所を推定するエンジン . OcpancyGrid. mirrorはふようになる。
learning from human driving.

New Powerful SoC XAVIER for automobile

XAVIER. 8Cores of custom ARM8a. x512 Volta GPU. New Vision Acceleraror. x2 8K HDR video processor. ASIL C level safety.
Xavier is same performance of 2x parker + 2x discrete GPU at 20W 20兆回のdeep learningが可能

Accelerating the race to self-driving cars by denny shapiro, sr. diector of automotive

GPU computing power boosts innovation

現在は既にsimulationで実際のクラッシュと同じ状況を作り出せている。
既に10M+のcarsがnVIDIAを搭載して走っている。

self-driving with DrivePX2

sense -> map | localize | perceive -> plan -> control
baidu/tomtom are partner of AI self driving cars.
free space detection, card 3D detection.
DrivePX2 is equivalent of 150 MacBook!
nVIDIA AI driving platform DRIVEWORKS AUTOCHAUFFER and Xavier
AUTOCHAUFFER
2x Tegra Parker (12 CPU Cores), 2x Dicrete GPUs (Pascal). 20DL TOPS, 120 SPECInt, 80W. 12 simultaneous LVDS camera inputs.
AutoCruise
x1 Tegra Parker SoC 1.4TFLOPS GPU/6Cores/10W, 8GB LPDDR4.

Software architecture for self-driving

DriveNet,OpenRoadNet,Localization -> Occupancy Grid <-> Traffic Prediction, Path Planning, Action Engine <- PilotNet
OpenRoadNetは空いている道を見つける。
OCCUPANCY GRID, Other Vehicles, Roads, Lanes, Road Rules, Free-Space, Map Objectts,
PERCEPTION : 6 cameras providing, a 360 dgree suround view, all cameras: 2.3MPix, frind and read: 6d HFOV, side : 100d HFOV
DETECTION: 車を３次元に認識している。６カメラ同時入力したカメラを用いている。
Elevationもoccupancy gridには含まれる
DriveNet, HDMap, FreeSpace --> Current OcuupancyGrid -> Prediction/Simulation -> Predicted OGs->Safety Analysis -> SuperVisoryCtrl
Predicted OGs -> Temporal Consistency(実際とどれだけ乖離しているか？) -> Current ObjList -> Current OG
Path Proposals From : PilotNet(BB8), H.Driver(human driver), Path Plan(based on physics) => Safety Analysis.
BB8のような柔軟さ、人間、物理学を考慮したパスの３つの候補から一番安全を期すようにしている。

Coming innovation with Xavier

SOLID GPU ROADMAP: 72SGEMM/W (Perf/PowerEfficiency). Voltaは2018にavailable. Xavierの世代では来年チップが来るが、低い消費電力で実現できるroadmapがある。
Mobile GPU in trgra, GeForce in PC, Quadro in workstation, Tesla in SuperComputerのようにone architectureである
Parkerは、256 Pascal CUDA Core, ARM v8 Denver2 + A574 Cherent HMP. A57の50%増しぐらいで演算可能 (Denver)
SafetyEngineを内蔵している。マイコンであり、監視ができるようになっている。Xaveier = 20 TOP DL, 160 SEPCINT, 20W. これは2 PARKER SOC+2PASCAL GPU level.
7 billion transistor. P100が15billionなので、約半分. 16nFFに載せる。Denverの次のバージョンのARM64 CPU.
ASIL C Functional Safety (これがsafety engineによるものかな)
256x Pascal CUDA Coreだったのが、Xavierでは512 Coreになるので。２倍になるけど、演算効率がさらに２倍になるので、４倍のパフォーマンスになるということ。Pascalでも16floatできたはずだが、Parker世代では違った？

MPIにおけるscalableなノード間並列

ノード間並列はスケールしない課題がある。

MPIについて

MPI_AllReduceは全部のサーバーがそのreduceの結果を持つ
All-Reduceはoverheadになる。重みパラメタの要素数が多い場合に、大きくなる。ノード数が増えると、大きくなる。

node間のoverheadの制約の回避策

基本アイデアは、集約処理時間を他のGPU処理時間に隠蔽、集約処理時間を短縮、を行った。
(1) backward処理の隠蔽：角層のbackward処理が終わるごとに層ごとにAll-reduce処理を開始する
(2) forward処理の隠蔽。update処理を分割、forward処理の開始を層ごとに判定。
細分化による並列化。集約処理を細分化して実行。GPUからメモリーへデータを転送、ノード間のデータ転送、Reduce演算、CPUからGPUメモリへのデータ転送

Tensor RT

TensorRTは、（学習ではなく）推論のエンジン。training済みのneural networkに対し、network layerの融合、concatenation layer removal, batch size tuning, etc.
DIGITS -> trained neural network model -> optimization engine -> plan -> execution engine
TensorRT @ Parallell for All. 3d convolutionはまだ通らない。
GPU REST ENGINE
Caffe -> TensorRTへ置き換え。
前処理にGPUを活用。コンカレントな実行ｍバッチサイズを大きく取る。CUDAStreamとスレッドは一致しなくてもかまわない

AZURE with GPU

12GBのtesla k80がazureにはいっているみたい. http://www.nvidia.co.jp/object/tesla-servers-jp.html … K80は24GBだけど、K80ってnvidia-smiで出ているのに、12GBしか認識していない。K40が12GBみたいだが。。。
Microsoftのopen sourceのcomputational network toolkit (CNTK)
Remote Desktop services, OpenGL/CL support
cloud側で、GPU使えるserverがあるので、cadでremoteでrenderingして結果を受け取れる (remote desktopのはなし)
NV = Tesla M60 OpenGL&DiectX, NC=K80 : CUA&OpenCL http://gpu.zure.com , http://aka.as/tryazurehpc http://nvidia.com/grid

Deep learning applications for IoT by Andrew Cresci in inductrial sector as general manager.

deep learning in industrial automation. automation, safety operation, etc....
deep learningではdomain expertはいないという話。データから学ぶのだ、と。
特徴エンジニアリング　vs　deeplearinnig
no tuning for identifying characteristic pattern
one algorithm classifies normal, abnormal, edge.
auto encodier. encoder-> sprase expression -> decoder -> reconfiigure (to input again)
deep learningをauto encoderに使うのは確かに。。。
deep brief network: ノイズ：確率的制限ボルツマシン (RBM), deep auto encoder:で。
auto encoderは、DBNの事前学習後に、DAE (auto encoder)でfine tuning.
11 layers, 14 dimension bottleneck DAEにより978.8% true positive検知率0.0%の誤報率を達成
deep learningは手作業を凌駕。汎用的でscalable.
VIDI: Industrial image analysis software . auto mated inspection & classification.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up