LoginSignup
0
1

More than 5 years have passed since last update.

Tensorflow AOT (tfcompile)の使い方

Last updated at Posted at 2018-05-06

この記事について

Tensorflow AOT (Ahead-of-Time)のツールであるtfcompileの使い方のメモです。tfcompileの詳細については公式サイト(Using AOT compilation)に説明が記載されています。

tfcompileの使い方

tfcompileで使えるフラグ一覧は以下の通りです。


tfcompile performs ahead-of-time compilation of a TensorFlow graph,
resulting in an object file compiled for your target architecture, and a
header file that gives access to the functionality in the object file.
A typical invocation looks like this:

   $ tfcompile --graph=mygraph.pb --config=myfile.pbtxt --cpp_class="mynamespace::MyComputation"

usage: ./tfcompile
Flags:
        --graph=""                              string  Input GraphDef file.  If the file ends in '.pbtxt' it is expected to be in the human-readable proto text format, otherwise it is expected to be in the proto binary format.
        --config=""                             string  Input file containing Config proto.  If the file ends in '.pbtxt' it is expected to be in the human-readable proto text format, otherwise it is expected to be in the proto binary format.
        --dump_fetch_nodes=false                bool    If set, only flags related to fetches are processed, and the resulting fetch nodes will be dumped to stdout in a comma-separated list.  Typically used to format arguments for other tools, e.g. freeze_graph.
        --target_triple="x86_64-pc-linux"       string  Target platform, similar to the clang -target flag.  The general format is <arch><sub>-<vendor>-<sys>-<abi>.  http://clang.llvm.org/docs/CrossCompilation.html#target-triple.
        --target_cpu=""                         string  Target cpu, similar to the clang -mcpu flag.  http://clang.llvm.org/docs/CrossCompilation.html#cpu-fpu-abi
        --target_features=""                    string  Target features, e.g. +avx2, +neon, etc.
        --entry_point="entry"                   string  Name of the generated function.  If multiple generated object files will be linked into the same binary, each will need a unique entry point.
        --cpp_class=""                          string  Name of the generated C++ class, wrapping the generated function.  The syntax of this flag is [[<optional_namespace>::],...]<class_name>.  This mirrors the C++ syntax for referring to a class, where multiple namespaces may precede the class name, separated by double-colons.  The class will be generated in the given namespace(s), or if no namespaces are given, within the global namespace.
        --out_function_object="out_model.o"     string  Output object file containing the generated function for the TensorFlow model.
        --out_header="out.h"                    string  Output header file name.
        --out_metadata_object="out_helper.o"    string  Output object file name containing optional metadata for the generated function.
        --out_session_module=""                 string  Output session module proto.
        --gen_name_to_index=false               bool    Generate name-to-index data for Lookup{Arg,Result}Index methods.
        --gen_program_shape=false               bool    Generate program shape data for the ProgramShape method.
        --xla_generate_hlo_graph=""             string  HLO modules matching this regex will be dumped to a .dot file throughout various stages in compilation.
        --xla_hlo_graph_addresses=false         bool    With xla_generate_hlo_graph, show addresses of HLO ops in graph dump.
        --xla_hlo_graph_path=""                 string  With xla_generate_hlo_graph, dump the graphs into this path.
        --xla_hlo_dump_as_graphdef=false        bool    Dump HLO graphs as TensorFlow GraphDefs.
        --xla_hlo_graph_sharding_color=false    bool    Assign colors based on sharding assignments when generating the HLO graphs.
        --xla_hlo_tfgraph_device_scopes=false   bool    When generating TensorFlow HLO graphs, if the HLO instructions are assigned to a specific device, prefix the name scope with "devX" with X being the device ordinal.
        --xla_log_hlo_text=""                   string  HLO modules matching this regex will be dumped to LOG(INFO).
        --xla_generate_hlo_text_to=""           string  Dump all HLO modules as text into the provided directory path.
        --xla_enable_fast_math=true             bool    Enable unsafe fast-math optimizations in the compiler; this may produce faster code at the expense of some accuracy.
        --xla_llvm_enable_alias_scope_metadata=true     bool    In LLVM-based backends, enable the emission of !alias.scope metadata in the generated IR.
        --xla_llvm_enable_noalias_metadata=true bool    In LLVM-based backends, enable the emission of !noalias metadata in the generated IR.
        --xla_llvm_enable_invariant_load_metadata=true  bool    In LLVM-based backends, enable the emission of !invariant.load metadata in the generated IR.
        --xla_llvm_disable_expensive_passes=false       bool    In LLVM-based backends, disable a custom set of expensive optimization passes.
        --xla_backend_optimization_level=3      int32   Numerical optimization level for the XLA compiler backend.
        --xla_disable_hlo_passes=""             string  Comma-separated list of hlo passes to be disabled. These names must exactly match the passes' names; no whitespace around commas.
        --xla_embed_ir_in_executable=false      bool    Embed the compiler IR as a string in the executable.
        --xla_dump_ir_to=""                     string  Dump the compiler IR into this directory as individual files.
        --xla_eliminate_hlo_implicit_broadcast=true     bool    Eliminate implicit broadcasts when lowering user computations to HLO instructions; use explicit broadcast instead.
        --xla_cpu_multi_thread_eigen=true       bool    When generating calls to Eigen in the CPU backend, use multi-threaded Eigen mode.
        --xla_gpu_cuda_data_dir="./cuda_sdk_lib"        string  If non-empty, speficies a local directory containing ptxas and nvvm libdevice files; otherwise we use those from runfile directories.
        --xla_gpu_ftz=false                     bool    If true, flush-to-zero semantics are enabled in the code generated for GPUs.
        --xla_gpu_disable_multi_streaming=false bool    If true, multi-streaming in the GPU backend is disabled.
        --xla_gpu_max_kernel_unroll_factor=4    int32   Specify the maximum kernel unroll factor for the GPU backend.
        --xla_dump_optimized_hlo_proto_to=""    string  Dump Hlo after all hlo passes are executed as proto binary into this directory.
        --xla_dump_unoptimized_hlo_proto_to=""  string  Dump HLO before any hlo passes are executed as proto binary into this directory.
        --xla_dump_per_pass_hlo_proto_to=""     string  Dump HLO after each pass as an HloProto in binary file format into this directory.
        --xla_test_all_output_layouts=false     bool    Let ClientLibraryTestBase::ComputeAndCompare* test all permutations of output layouts. For example, with a 3D shape, all permutations of the set {0, 1, 2} are tried.
        --xla_test_all_input_layouts=false      bool    Let ClientLibraryTestBase::ComputeAndCompare* test all permutations of *input* layouts. For example, for 2 input arguments with 2D shape and 4D shape, the computation will run 2! * 4! times for every possible layouts
        --xla_hlo_profile=false                 bool    Instrument the computation to collect per-HLO cycle counts
        --xla_dump_computations_to=""           string  Dump computations that XLA executes into the provided directory path
        --xla_dump_executions_to=""             string  Dump parameters and results of computations that XLA executes into the provided directory path
        --xla_backend_extra_options=""          string  Extra options to pass to a backend; comma-separated list of 'key=val' strings (=val may be omitted); no whitespace around commas.
        --xla_reduce_precision=""               string  Directions for adding reduce-precision operations. Format is 'LOCATION=E,M:OPS;NAMES' where LOCATION is the class of locations in which to insert the operations (e.g., 'OP_OUTPUTS'), E and M are the exponent and matissa bit counts respectively, and OPS and NAMES are comma-separated (no spaces) lists of the operation types and names to which to attach the reduce-precision operations.  The NAMES string and its preceding ';' may be omitted.  This option may be repeated to define multiple sets of added reduce-precision operations.
        --xla_gpu_use_cudnn_batchnorm=false     bool    Allows the GPU backend to implement batchnorm HLOs using cudnn, rather than expanding them to a soup of HLOs.
        --xla_cpu_use_mkl_dnn=false             bool    Generate calls to MKL-DNN in the CPU backend.

ARM版モジュールを出力する場合

--target_triple="armv7-linux-gnueabihf" 
--target_triple="aarch64-linux-gnu"

補足

tfcompileのビルド

tfcompileは2018年5月現在ではソースコード提供となっており、利用するにはコンパイルが必要です。
tfcompileをビルドする際にはgoogleのビルドツールbazelを使って以下のコマンドでビルドします。

$bazel build --config=opt --config=//tensorflow/compiler/aot:tfcompile

tfcompileの生成場所

tfcompileのビルドに成功すると以下のディレクトリに生成されます。
(bazel-binはシンボリックリンクになっており、実態はホームディレクトリのキャッシュフォルダ(./cache)内に生成されます。)

$(build_directory)/bazel-bin/tensorflow/compiler/aot/tfcompile
0
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
1