この記事について
Tensorflow AOT (Ahead-of-Time)のツールであるtfcompileの使い方のメモです。tfcompileの詳細については公式サイト(Using AOT compilation)に説明が記載されています。
tfcompileの使い方
tfcompileで使えるフラグ一覧は以下の通りです。
tfcompile performs ahead-of-time compilation of a TensorFlow graph,
resulting in an object file compiled for your target architecture, and a
header file that gives access to the functionality in the object file.
A typical invocation looks like this:
$ tfcompile --graph=mygraph.pb --config=myfile.pbtxt --cpp_class="mynamespace::MyComputation"
usage: ./tfcompile
Flags:
--graph="" string Input GraphDef file. If the file ends in '.pbtxt' it is expected to be in the human-readable proto text format, otherwise it is expected to be in the proto binary format.
--config="" string Input file containing Config proto. If the file ends in '.pbtxt' it is expected to be in the human-readable proto text format, otherwise it is expected to be in the proto binary format.
--dump_fetch_nodes=false bool If set, only flags related to fetches are processed, and the resulting fetch nodes will be dumped to stdout in a comma-separated list. Typically used to format arguments for other tools, e.g. freeze_graph.
--target_triple="x86_64-pc-linux" string Target platform, similar to the clang -target flag. The general format is <arch><sub>-<vendor>-<sys>-<abi>. http://clang.llvm.org/docs/CrossCompilation.html#target-triple.
--target_cpu="" string Target cpu, similar to the clang -mcpu flag. http://clang.llvm.org/docs/CrossCompilation.html#cpu-fpu-abi
--target_features="" string Target features, e.g. +avx2, +neon, etc.
--entry_point="entry" string Name of the generated function. If multiple generated object files will be linked into the same binary, each will need a unique entry point.
--cpp_class="" string Name of the generated C++ class, wrapping the generated function. The syntax of this flag is [[<optional_namespace>::],...]<class_name>. This mirrors the C++ syntax for referring to a class, where multiple namespaces may precede the class name, separated by double-colons. The class will be generated in the given namespace(s), or if no namespaces are given, within the global namespace.
--out_function_object="out_model.o" string Output object file containing the generated function for the TensorFlow model.
--out_header="out.h" string Output header file name.
--out_metadata_object="out_helper.o" string Output object file name containing optional metadata for the generated function.
--out_session_module="" string Output session module proto.
--gen_name_to_index=false bool Generate name-to-index data for Lookup{Arg,Result}Index methods.
--gen_program_shape=false bool Generate program shape data for the ProgramShape method.
--xla_generate_hlo_graph="" string HLO modules matching this regex will be dumped to a .dot file throughout various stages in compilation.
--xla_hlo_graph_addresses=false bool With xla_generate_hlo_graph, show addresses of HLO ops in graph dump.
--xla_hlo_graph_path="" string With xla_generate_hlo_graph, dump the graphs into this path.
--xla_hlo_dump_as_graphdef=false bool Dump HLO graphs as TensorFlow GraphDefs.
--xla_hlo_graph_sharding_color=false bool Assign colors based on sharding assignments when generating the HLO graphs.
--xla_hlo_tfgraph_device_scopes=false bool When generating TensorFlow HLO graphs, if the HLO instructions are assigned to a specific device, prefix the name scope with "devX" with X being the device ordinal.
--xla_log_hlo_text="" string HLO modules matching this regex will be dumped to LOG(INFO).
--xla_generate_hlo_text_to="" string Dump all HLO modules as text into the provided directory path.
--xla_enable_fast_math=true bool Enable unsafe fast-math optimizations in the compiler; this may produce faster code at the expense of some accuracy.
--xla_llvm_enable_alias_scope_metadata=true bool In LLVM-based backends, enable the emission of !alias.scope metadata in the generated IR.
--xla_llvm_enable_noalias_metadata=true bool In LLVM-based backends, enable the emission of !noalias metadata in the generated IR.
--xla_llvm_enable_invariant_load_metadata=true bool In LLVM-based backends, enable the emission of !invariant.load metadata in the generated IR.
--xla_llvm_disable_expensive_passes=false bool In LLVM-based backends, disable a custom set of expensive optimization passes.
--xla_backend_optimization_level=3 int32 Numerical optimization level for the XLA compiler backend.
--xla_disable_hlo_passes="" string Comma-separated list of hlo passes to be disabled. These names must exactly match the passes' names; no whitespace around commas.
--xla_embed_ir_in_executable=false bool Embed the compiler IR as a string in the executable.
--xla_dump_ir_to="" string Dump the compiler IR into this directory as individual files.
--xla_eliminate_hlo_implicit_broadcast=true bool Eliminate implicit broadcasts when lowering user computations to HLO instructions; use explicit broadcast instead.
--xla_cpu_multi_thread_eigen=true bool When generating calls to Eigen in the CPU backend, use multi-threaded Eigen mode.
--xla_gpu_cuda_data_dir="./cuda_sdk_lib" string If non-empty, speficies a local directory containing ptxas and nvvm libdevice files; otherwise we use those from runfile directories.
--xla_gpu_ftz=false bool If true, flush-to-zero semantics are enabled in the code generated for GPUs.
--xla_gpu_disable_multi_streaming=false bool If true, multi-streaming in the GPU backend is disabled.
--xla_gpu_max_kernel_unroll_factor=4 int32 Specify the maximum kernel unroll factor for the GPU backend.
--xla_dump_optimized_hlo_proto_to="" string Dump Hlo after all hlo passes are executed as proto binary into this directory.
--xla_dump_unoptimized_hlo_proto_to="" string Dump HLO before any hlo passes are executed as proto binary into this directory.
--xla_dump_per_pass_hlo_proto_to="" string Dump HLO after each pass as an HloProto in binary file format into this directory.
--xla_test_all_output_layouts=false bool Let ClientLibraryTestBase::ComputeAndCompare* test all permutations of output layouts. For example, with a 3D shape, all permutations of the set {0, 1, 2} are tried.
--xla_test_all_input_layouts=false bool Let ClientLibraryTestBase::ComputeAndCompare* test all permutations of *input* layouts. For example, for 2 input arguments with 2D shape and 4D shape, the computation will run 2! * 4! times for every possible layouts
--xla_hlo_profile=false bool Instrument the computation to collect per-HLO cycle counts
--xla_dump_computations_to="" string Dump computations that XLA executes into the provided directory path
--xla_dump_executions_to="" string Dump parameters and results of computations that XLA executes into the provided directory path
--xla_backend_extra_options="" string Extra options to pass to a backend; comma-separated list of 'key=val' strings (=val may be omitted); no whitespace around commas.
--xla_reduce_precision="" string Directions for adding reduce-precision operations. Format is 'LOCATION=E,M:OPS;NAMES' where LOCATION is the class of locations in which to insert the operations (e.g., 'OP_OUTPUTS'), E and M are the exponent and matissa bit counts respectively, and OPS and NAMES are comma-separated (no spaces) lists of the operation types and names to which to attach the reduce-precision operations. The NAMES string and its preceding ';' may be omitted. This option may be repeated to define multiple sets of added reduce-precision operations.
--xla_gpu_use_cudnn_batchnorm=false bool Allows the GPU backend to implement batchnorm HLOs using cudnn, rather than expanding them to a soup of HLOs.
--xla_cpu_use_mkl_dnn=false bool Generate calls to MKL-DNN in the CPU backend.
ARM版モジュールを出力する場合
--target_triple="armv7-linux-gnueabihf"
--target_triple="aarch64-linux-gnu"
補足
tfcompileのビルド
tfcompileは2018年5月現在ではソースコード提供となっており、利用するにはコンパイルが必要です。
tfcompileをビルドする際にはgoogleのビルドツールbazelを使って以下のコマンドでビルドします。
$bazel build --config=opt --config=//tensorflow/compiler/aot:tfcompile
tfcompileの生成場所
tfcompileのビルドに成功すると以下のディレクトリに生成されます。
(bazel-binはシンボリックリンクになっており、実態はホームディレクトリのキャッシュフォルダ(./cache)内に生成されます。)
$(build_directory)/bazel-bin/tensorflow/compiler/aot/tfcompile