More than 5 years have passed since last update.

(ソースコードメモ)IntelのDNNL

Last updated at 2020-01-27Posted at 2020-01-27

はじめに

Intelの深層学習用の数値計算ライブラリ(DNNL)の実装を見てみる。はじめに、実行エンジンから、JITコードを実行する流れを見る。なお、DNNLのJITコード生成部分(Xbyak)は、文末に記載した。
DNNLのコードは、光成さんが解説記事を書かれている。しかし、それ一つで網羅できるコードでもない。このため記載してみる。コメント等頂ければ幸いである。

実行エンジン

まず、実行エンジンの枠組みと、JITコードの生成実行の概要を説明する。

実行エンジンの枠組み

DNNLの実行エンジンは、primitive_impl_tを継承したオブジェクトを実行する。initメソッドで初期化を行い、executeメソッドで実行する。
なお、このクラスは、CPU/GPUとも共通(src/common)の処理である。

struct primitive_impl_t : public c_compatible {
    primitive_impl_t(const primitive_desc_t *pd) : pd_(pd->clone()) {}
    virtual ~primitive_impl_t() { delete pd_; }

    virtual status_t init() { return status::success; }
    engine_t *engine() const { return pd_->engine(); }
    const primitive_desc_t *pd() const { return pd_; }
    primitive_kind_t kind() const { return pd_->kind(); }
    virtual status_t execute(const exec_ctx_t &ctx) const = 0;

protected:
    const primitive_desc_t *pd_;
private:
    primitive_impl_t() = delete;
    DNNL_DISALLOW_COPY_AND_ASSIGN(primitive_impl_t);
};

JITの生成管理クラス

ここからCPU独自(src/cpu)の処理である。参考までに、GPUでもJITがあるが、OpenCLの実行なので意味が異なる。
さて、DNNLでは、primitive_impl_tクラスの処理の延長でJITが呼ばれる時がある。呼び出されるクラスは、Xbyak::CodeGeneratorを継承しているjit_generatorを継承する。jit_generatorクラスではpreamble、postableメソッドも定義されており、JIT処理のレジスタ退避と復帰を行う。(下記の畳み込みの例を参照)また、キャッシュサイズ取得等のユーティリティ関数も本クラスで定義している。

class jit_generator : public Xbyak::CodeGenerator {

// 中略

    void preamble() {
        if (xmm_to_preserve) {
            sub(rsp, xmm_to_preserve * xmm_len);
            for (size_t i = 0; i < xmm_to_preserve; ++i)
                movdqu(ptr[rsp + i * xmm_len],
                        Xbyak::Xmm(xmm_to_preserve_start + i));
        }
        for (size_t i = 0; i < num_abi_save_gpr_regs; ++i)
            push(Xbyak::Reg64(abi_save_gpr_regs[i]));
        if (mayiuse(avx512_common)) {
            mov(reg_EVEX_max_8b_offt, 2 * EVEX_max_8b_offt);
        }
    }

//　中略

    void postamble() {
        for (size_t i = 0; i < num_abi_save_gpr_regs; ++i)
            pop(Xbyak::Reg64(abi_save_gpr_regs[num_abi_save_gpr_regs - 1 - i]));
        if (xmm_to_preserve) {
            for (size_t i = 0; i < xmm_to_preserve; ++i)
                movdqu(Xbyak::Xmm(xmm_to_preserve_start + i),
                        ptr[rsp + i * xmm_len]);
            add(rsp, xmm_to_preserve * xmm_len);
        }
        uni_vzeroupper();
        ret();
    }

// 以下略

DNNLでのJITコード生成の概要

DNNLでのJITコードの生成と実行概要を、Convolution関数を例にして見てみる。前述の通り、DNNLのエンジンでは、primitive_impl_tを継承し、JITコードを生成し、実行する。大まかな構成は以下のとおりである。

jit_avx512_common_convolution_winograd_fwd_t
- primitive_impl_tを継承し、エンジンで実行するcpu_impl_listにつないでいる。
- init関数で、JITコード生成
  - init_conf@jit_avx512_common_conv_winograd_kernel_f32
    - jit_conv_winograd_conf_t(jcp)を初期化
    - _jit_avx512_common_convolution_winograd_t
      - _jit_avx512_common_conv_winograd_data_kernel_f32
        
        JITコード生成部(jit_generatorを継承している。)
        
        init_conf_common
        
        関数初期化(80行)
        
        gemm_loop_generate
        - GEMM演算コード生成(140行)
- execute関数でJITコード実行

JITコード生成は、以下の通りjit_generatorを継承したクラス_jit_avx512_common_conv_winograd_data_kernel_f32で、gemm_loop_generateを呼び出し行う。

struct _jit_avx512_common_conv_winograd_data_kernel_f32 : public jit_generator {
    _jit_avx512_common_conv_winograd_data_kernel_f32(
            jit_conv_winograd_conf_t ajcp)
        : jit_generator(nullptr, MAX_CODE_SIZE, false), jcp(ajcp) {
        //******************* First iter kernel ********************//
        this->gemm_loop_generate(true);
        gemm_loop_ker_first_iter
                = (decltype(gemm_loop_ker_first_iter))this->getCode();

        //************** Subsequent iterations kernel **************//
        if (jcp.dimK_nb_block > 1) {
            align();
            const Xbyak::uint8 *addr = getCurr();
            this->gemm_loop_generate(false);
            gemm_loop_ker = (decltype(gemm_loop_ker))addr;
        }
    }

JITコードは、gemm_loop_generateで生成する。ここでは、実質4つの関数が実行されコード生成される。このうちjit_generatorで定義されているレジスタの退避および復旧関数(preamble/postamble)である。このため、実質的な処理関数はinner_loopである。

void _jit_avx512_common_conv_winograd_data_kernel_f32::gemm_loop_generate(
        bool is_beta_zero) {
    // const int dimK_simd_block = jcp.dimK_reg_block;

    // for (int dimM_block =0; dimM_block < jcp.dimM_block; dimM_block++)
    //     for (int dimK_block = 0; dimK_block < jcp.dimK_block; dimK_block++)
    //         for (int dimK_reg_block= 0; dimK_reg_block < jcp.dimK_reg_block;
    //         dimK_reg_block++)
    //                 for (int tile =0; tile < jcp.dimN_reg_block; tile++)
    //                     C[dimM_block][tile] +=
    //                     A[dimM_block][dimK_block][dimK_reg_block] *
    //                     broadcast(B[dimK_block][tile][dimK_reg_block]);
    // 1) We do register blocking on A[dimM_block][dimK_block][dimK_reg_block],
    // so we load it before the loop on tile
    // 2) the loop on tile must be fully unrolled. Don't know about the one on
    // dimK_reg_block. I think it should be

// 中略

    /* Preamble */
    preamble();

    /* kernel */
    inner_loops();

    /* Postamble */
    postamble();
    ret();
}

さて、コード生成に続いて、JITコードの実行に移る。JITコードの実行は、primitive_impl_tを継承したjit_avx512_common_convolution_winograd_fwd_tのexecuteメソッドにより行う。

struct jit_avx512_common_convolution_winograd_fwd_t
    : _jit_avx512_common_convolution_winograd_t<true>,
      public primitive_impl_t {

//中略

    jit_avx512_common_convolution_winograd_fwd_t(const pd_t *apd)
        : _jit_avx512_common_convolution_winograd_t<true>(apd->jcp_)
        , primitive_impl_t(apd) {}

    ~jit_avx512_common_convolution_winograd_fwd_t() {};

    typedef typename prec_traits<data_type::f32>::type data_t;

    virtual status_t execute(const exec_ctx_t &ctx) const override {
        auto src = CTX_IN_MEM(const float *, DNNL_ARG_SRC);
        auto weights = CTX_IN_MEM(const float *, DNNL_ARG_WEIGHTS);
        auto bias = CTX_IN_MEM(const float *, DNNL_ARG_BIAS);
        auto dst = CTX_OUT_MEM(float *, DNNL_ARG_DST);
        this->_execute_data_W_S_G_D((float *)src, dst, (float *)weights,
                (float *)bias, ctx.get_scratchpad_grantor());
        return status::success;
    }

private:
    const pd_t *pd() const { return (const pd_t *)primitive_impl_t::pd(); }
};

この延長で、this->_execute_data_W_S_G_Dが呼ばれ、最終的に、gemm_loop_ker_first_iterやgemm_loop_kerの生成されたコードが呼び出される。

template <bool is_fwd>
void _jit_avx512_common_convolution_winograd_t<is_fwd>::_execute_data_W_S_G_D(
        float *inp_ptr, float *out_ptr, float *wei_ptr, float *bias_ptr,
        const memory_tracking::grantor_t &scratchpad) const {

//中略

    parallel_nd(jcp.dimN_nb_block, alpha, alpha, jcp.dimM_nb_block,
            jcp.dimN_block,
            [&](int N_blk1, int oj, int oi, int M_blk1, int N_blk2) {
                kernel_->gemm_loop_ker_first_iter(
                        (float *)&(M(N_blk1, M_blk1, oj, oi, N_blk2, 0, 0, 0)),
                        (const float *)&(U(M_blk1, oj, oi, 0, 0, 0, 0, 0)),
                        (const float *)&(
                                V(N_blk1, oj, oi, N_blk2, 0, 0, 0, 0)));
                for (int K_blk1 = 1; K_blk1 < jcp.dimK_nb_block; K_blk1++) {
                    kernel_->gemm_loop_ker((float *)&(M(N_blk1, M_blk1, oj, oi,
                                                   N_blk2, 0, 0, 0)),
                            (const float *)&(
                                    U(M_blk1, oj, oi, K_blk1, 0, 0, 0, 0)),
                            (const float *)&(V(
                                    N_blk1, oj, oi, N_blk2, K_blk1, 0, 0, 0)));
                }
            });

//以下略

その他

CPUのキャッシュサイズの考慮

CPU毎の最適化を行うためget_cache_sizeで、CPUのキャッシュサイズを取得している。L1, L2, L3のキャッシュサイズおよび、単体(true)かシステム全体(false)かを指定することができる。利用しているのは、

Convolution関数(L1, L2, L3)Winograd有り無し共利用
BatchNorm関数(L3のみ)

なお、コア当たりのキャッシュメモリが決め打ちでコーディングされているが、CascadeLake(略称CLK)に近い値を入れたのではないだろうか？

この関数は、歴史的経緯とは思うが、jit_generator.hppファイルにある

inline unsigned int get_cache_size(int level, bool per_core = true) {
    unsigned int l = level - 1;
    // Currently, if XByak is not able to fetch the cache topology
    // we default to 32KB of L1, 512KB of L2 and 1MB of L3 per core.
    if (cpu.getDataCacheLevels() == 0) {
        const int L1_cache_per_core = 32000;
        const int L2_cache_per_core = 512000;
        const int L3_cache_per_core = 1024000;
        int num_cores = per_core ? 1 : dnnl_get_max_threads();
        switch (l) {
            case (0): return L1_cache_per_core * num_cores;
            case (1): return L2_cache_per_core * num_cores;
            case (2): return L3_cache_per_core * num_cores;
            default: return 0;
        }
    }
    if (l < cpu.getDataCacheLevels()) {
        return cpu.getDataCacheSize(l)
                / (per_core ? cpu.getCoresSharingDataCache(l) : 1);
    } else
        return 0;
}

JIT化されているコード部分

以下のコードが、JIT化されている。Intelの資料参照のこと。例えば以下の部分は、JIT化されている。

Convolution層のWinogradアルゴリズムのGEMM演算部分(順方向)
- Winograd is only for 3x3; only the (special) GEMM part is JIT-ed, from Intel)
- SKX/KNL only
Convolution層の通常アルゴリズム(順方向)
- AVX512では、JIT化されている。(300行ぐらいある)

Intel MKLへの依存について

積和演算では、Intel MKLを選択することがかつてはできた。しかし、V1.0への移行時に依存するコードは、削除する方向で動いている。これは、DNNL内のGEMM関数がJIT化されたことに伴いIntel MKLのGEMMを使わなくなったためである。

テンソルのメモリ管理構造

テンソルのメモリ管理構造は、以下のようになっている。このmemory_desc_wrapperが、DNNLテンソルのメモリ管理で使う構造体である。

memory_desc_wrapper
- memory_desc_t
  - dnnl_memory_desc_t

最終的なテンソルの管理構造体は、dnnl_memory_desc_tである。

/// Memory descriptor. The description is based on a number of dimensions,
/// dimensions themselves, plus information about elements type and memory
/// format. Additionally, contains format-specific descriptions of the data
/// layout.
typedef struct {
    /// Number of dimensions
    int ndims;
    /// Dimensions in the following order:
    /// - CNN data tensors: mini-batch, channel, spatial
    ///   (<code>{N, C, [[D,] H,] W}</code>)
    /// - CNN weight tensors: group (optional), output channel, input channel,
    ///   spatial (<code>{[G,] O, I, [[D,] H,] W}</code>)
    /// - RNN data tensors: time, mini-batch, channels (<code>{T, N, C}</code>)
    ///   or layers, directions, states, mini-batch, channels (<code>{L, D, S, N, C}</code>)
    /// - RNN weight tensor: layers, directions, input channel, gates, output channels
    ///   (<code>{L, D, I, G, O}</code>).
    ///
    /// @note
    ///    The order of dimensions does not depend on the memory format, so
    ///    whether the data is laid out in #dnnl_nchw or #dnnl_nhwc
    ///    the dims for 4D CN data tensor would be <code>{N, C, H, W}</code>.
    dnnl_dims_t dims;

    /// Data type of the tensor elements.
    dnnl_data_type_t data_type;

    /// Size of the data including padding in each dimension.
    dnnl_dims_t padded_dims;

    /// Per-dimension offset from the padding to actual data, the top-level
    /// tensor with offsets applied must lie within the padding area.
    dnnl_dims_t padded_offsets;

    /// Offset from memory origin to the current block, non-zero only in
    /// a description of a memory sub-block.
    dnnl_dim_t offset0;

    /// Memory format kind.
    dnnl_format_kind_t format_kind;
    union {
        /// Description of the data layout for memory formats that use
        /// blocking.
        dnnl_blocking_desc_t blocking;
        /// Tensor of weights for integer 8bit winograd convolution.
        dnnl_wino_desc_t wino_desc;
        /// Tensor of packed weights for RNN.
        dnnl_rnn_packed_desc_t rnn_packed_desc;
        // ... other descriptions possible
    } format_desc;

    dnnl_memory_extra_desc_t extra;
} dnnl_memory_desc_t;

コードを読む前の前提知識

インストール方法

READMEをたどると、書いてある。手順は、以下の通りである。

git clone http://github.com/intel/mkl-dnn
cd mkl-dnn
mkdir -p build && cd build && cmake ..
make install

テストとしては、build配下の以下のコマンド等を動かす。

tests/gtests/test_convolution_backward_data_f32

Xbyak

DNNLの中で使われているJITコード生成器がXbyakである。複数種類のCPU等に対応したコードを動的に生成する場合等に活躍する。Xbyakの使い方は、おおもとのREADMEを参照のこと
JIT用コードの生成は、CodeGeneratorクラスの親クラスであるCodeArrayクラスのdb等のメソッドでtop_に配列として保存する。そして、getCodeメソッドで取り出す。以下にコードを示す。

	void db(int code)
	{
		if (size_ >= maxSize_) {
			if (type_ == AUTO_GROW) {
				growMemory();
			} else {
				throw Error(ERR_CODE_IS_TOO_BIG);
			}
		}
		top_[size_++] = static_cast<uint8>(code);
	}
	void db(const uint8 *code, size_t codeSize)
	{
		for (size_t i = 0; i < codeSize; i++) db(code[i]);
	}
	void db(uint64 code, size_t codeSize)
	{
		if (codeSize > 8) throw Error(ERR_BAD_PARAMETER);
		for (size_t i = 0; i < codeSize; i++) db(static_cast<uint8>(code >> (i * 8)));
	}
	void dw(uint32 code) { db(code, 2); }
	void dd(uint32 code) { db(code, 4); }
	void dq(uint64 code) { db(code, 8); }
	const uint8 *getCode() const { return top_; }

JIT化するコードは以下の記載が必要である。

ラベル関連 LとLabel
命令セット　addなど
レジスタ Zmmなど Zmmは512ビットレジスタ
メモリアクセス zwordなど zwordは、512ビットでアクセスするメモリ領域

データ転送を例にとると以下のように記載する。ここでは、C++のラムダ式の記述を使っている。

                auto load_A = [=](int reg_idx, int offset) {
                    for (int i = 0; i < inc_dimK_reg_block; i++)
                        vmovups(Zmm(reg_idx + i),
                                zword[reg_srcA + 64 * (offset + i)]);
                };

また、ラベルは、ジャンプ命令と連携して利用する。以下では、test命令の結果を使って、jnz命令で記載したラベルに飛ぶ。

            Label unaligned_store, end_store;
            test(reg_dstC, cpu_isa_traits<avx512_common>::vlen - 1);
            jnz(unaligned_store, T_NEAR);
            store_output(true);
            jmp(end_store, T_NEAR);
            L(unaligned_store);
            { store_output(false); }
            L(end_store);

ソースコードは、以下である。

src/cpu/xbyak
- xbyak.h 基本ファイル。コード生成(CodeGenerator)や、レジスタやラベル定義など
- xbyak_mnemonic.h 命令セットの定義

主な命令セット

命令名	対象	概要
vmovups	AVX512	レジスタメモリ間データ転送
vmovntps	AVX512	レジスタメモリ間データ転送(キャッシュ変更なし)
v4fmaddps	AVX512	単精度積和演算をまとめて行う(1度に4回)AVX512_4FMAの場合のみ
vfmadd231ps	AVX512	単精度積和演算をまとめて行う
vpxord	AVX512	ビットごとの排他的論理和
prefetcht0	汎用	プリフェッチ(キャッシュに書き込むか否か)

CPUの機能一覧

CPUの機能識別のためmayiuseという関数が定義されている。DNNLでは、以下のように機能の識別を行っている。

mayiuseの引数	Convのバージョン	Convの処理関数
sse41
avx
avx2
avx512_common		common系winograd
avx512_core	ver_avx512_core	core系winograd
avx512_core_vnni	ver_vnni	core系winograd
avx512_mic		common系winograd
avx512_mic_4ops	ver_4fma	common系winograd
avx512_core_bf16

なお、Convでは、デフォルトは、ver_fmaと設定される。それ以外は、上記のフラグになる。

include
- CPU機能の変数定義 dnnl_types.h
src/cpu
- CPU機能のチェック関数(mayiuse) cpu_isa_trais.h
- JITでの畳み込み(Conv)版数 jit_primitive_conf.hpp
Advanced Vector Extensions 512 (AVX-512) - x86

コンパイラの違い

GNUコンパイラと、Intelコンパイラの違いを、コード観点とコンパイルオプション観点で見てみる。はじめに、コード観点で見る。コンパイラごとに定義されている主な変数は以下の通りである。

__GNUC__
__INTEL_COMPILER
__clang__

畳み込み演算でコンパイラの違いを見ると、データの移動(ロードとストア)のみが異なる。以下に示すコードの、前者がインテルコンパイラであり、後者がGCCである。インテルのコンパイラの場合、キャッシュを触らない命令等を使っている。このため、キャッシュを考慮した最適化を行っていることがわかる。Intelの資料を見ると、C言語での宣言が命令セットに反映されることがわかる。特に、_mm512_stream_ps(CPU命令vmovntps)および_mm512_stream_pd(CPU命令vmovntpd)は、GCCから出力できない。(コンパイラソースコードのMachine Descriptorであるsse.mdおよびi386.mdを確認した)

void inline load_ps(float *dest, const float *src_mem) {
# ifdef __INTEL_COMPILER
    __m512 *Iv512 = (__m512 *)dest;
    Iv512[0] = _mm512_load_ps(src_mem);
# else
    PRAGMA_OMP_SIMD()
    for (int v = 0; v < simd_w; v++)
        dest[v] = src_mem[v];
# endif
}

次に、コンパイルオプション観点で見ると、静的バイナリにするなどの違いがある。cmake/platform.cmakeを参照のこと

    elseif("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU")
        set(DEF_ARCH_OPT_FLAGS "-msse4.1")
        # suppress warning on assumptions made regarding overflow (#146)
        append(CMAKE_CCXX_NOWARN_FLAGS "-Wno-strict-overflow")
    elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Intel")
        set(DEF_ARCH_OPT_FLAGS "-xSSE4.1")
        # workaround for Intel Compiler that produces error caused
        # by pragma omp simd collapse(..)
        append(CMAKE_CCXX_NOWARN_FLAGS "-diag-disable:13379")
        append(CMAKE_CCXX_NOWARN_FLAGS "-diag-disable:15552")
        # disable `was not vectorized: vectorization seems inefficient` remark
        append(CMAKE_CCXX_NOWARN_FLAGS "-diag-disable:15335")
        # disable: foo has been targeted for automatic cpu dispatch
        append(CMAKE_CCXX_NOWARN_FLAGS "-diag-disable:15009")
    endif()
endif()

if(UNIX OR MINGW)
    if(CMAKE_CXX_COMPILER_ID STREQUAL "Intel")
        # Link Intel libraries statically (except for iomp5)
        append(CMAKE_SHARED_LINKER_FLAGS "-liomp5 -static-intel")
        # Tell linker to not complain about missing static libraries
        append(CMAKE_SHARED_LINKER_FLAGS "-diag-disable:10237")
    endif()
endif()

GNU Compiler
- 個別定義箇所
Intel Compiler
- 個別定義箇所その1
- 個別定義箇所その2

参考資料

版数履歴

版数	リリース日	概要
2.0		開発中
1.2		開発中
1.1	2019/10/4
1.0	2019/7/13
0.21	2019/9/17	TensorFlowは、v0.21.2
0.20	2019/6/29	PyTorchは、v0.20.3
0.19	2019/5/15

使う資源については、IntelMKLDNN資料参照

参考資料

説明資料

DNNLの各種情報

DNNL Documentation
Primitives and their implementations　DNNLのJIT実装状況

ソースコードを読むために

MKL-DNNで学ぶIntel CPUの最適化手法(2019/4)
- Convolution Winogradでのキャッシュ最適化の説明など
Xbyakで始めるx86(IA-32)入門(2007/09)

その他

ソースコード

Deep Neural Network Library (DNNL)
- Releases
- include ヘッダーファイル
- src/common CPU/GPU共通コード
- src/cpu CPU側の処理コード
  - jit_generator.hpp
- src/ocl GPU側の処理コード

DNNL共通コード

src/common
- dnnl_thread.hpp
- dnnl_thread_parallel_nd.hpp
  - parallel_nd等の並列化関数定義
- c_types_map.hpp
  - primitive_desc_tの定義箇所
- engine.hpp
  - dnnlの処理をつかさどるエンジン
    - メモリ割り当て、処理エンジン生成、キャッシュの提供を行う。(memory allocation, primitive_desc_t creator, primitive cache)

Convolution周り

(文書)DNNL Convolution
ソースコード
- History for mkl-dnn/src/cpu/jit_avx512_common_convolution_winograd.cpp
- CPUのキャッシュメモリを考慮したコード(prefetchの考慮)
  - jit_avx512_common_conv_winograd_kernel_f32.cpp
  - jit_avx512_core_f32_wino_conv_4x3_kernel.cpp
(主な)パッチ
- cpu: conv: wino: rename jit_avx512_core_conv_winograd -> jit_avx512_core_fp32_wino_conv_4x3
  - jit_avx512_core_fp32_wino_conv_4x3_fwd_tに改名された。v0.16から
- cpu: conv: wino: winograd 2x3 for fp32 inference
  - jit_avx512_core_fp32_wino_conv_2x3_fwd_tが導入された。v0.16から
- cpu: conv: optimize winograd gemm for SKX, FWD and BWD_D
  - jit_avx512_core_convolution_winograd_fwd_tが導入された。v0.14から
- cpu: conv: avx512_common: adding Winograd kernels
  - Winograd(jit_avx512_common_convolution_winograd_fwd_t等)が、初めて導入された。v0.10から

CPU仕様書

Intel® Xeon® Platinum 9282 Processor FMAが2個
- (Forum)AVX512 slower than AVX2 with Intel MKL dgemm on Intel Gold 5118
Intel® 64 and IA-32 Architectures Software Developer Manuals
- 命令セットとそのコードについては、Opcode mapを参照のこと(Volume 2 Appendix A)
Intel® 64 and IA-32 Architectures Optimization Reference Manual

コンパイラ資料

Intel® C++ Compiler 19.1 Developer Guide and Reference
Submitted December 16, 2019
- _may_i_use_cpu_feature Intel DNNLには、mayiuseという関数があるが、それに近い。
Using the GNU Compiler Collection (GCC)
- 3.18.59 x86 Options

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up