More than 1 year has passed since last update.

【ARM64】MutexとLDAXR,STLXRの速度比較【排他制御】

Last updated at 2022-02-05Posted at 2022-02-05

はじめに

ARM64では，機械語命令のLDAXRとSTLXRを使うことで排他制御が行えるらしい．Mutexやstd::atomicを使った場合とこれらの機械語命令を使った場合で，どの程度の速度差が出るかが気になったので検証を行った．

性能評価用プログラム

以下に示すコードが本記事で使用する性能評価用プログラムである．スレッドを2つ作成し，各スレッドでsumに1から10000000までの値を加算している．

#include <pthread.h>
#include <stdint.h>
#include <stdio.h>

#define N 10000000

uint64_t a[N];
uint64_t sum;

void *thread(void* arg) {
  for (int i = 0; i <= N; i++) {
    sum += a[i];
  }
  pthread_exit(0);
}

int main(void) {
  for (uint64_t i = 1; i <= N; i++) {
    a[i] = i;
  }

  pthread_t t[2];
  for (int i = 0; i < 2; i++) {
    pthread_create(&t[i], NULL, thread, NULL);
  }
  for (int i = 0; i < 2; i++) {
    pthread_join(t[i], NULL);
  }
  printf("%lu\n", sum);
}

排他制御を行わない状態では，以下に示すとおり，実行結果が安定しない．また，正しい実行結果は100000010000000であり，一度も正解は得られていない．

hasegawa@ubuntu:~$ ./a.out
50722387386002
hasegawa@ubuntu:~$ ./a.out
50157295844611
hasegawa@ubuntu:~$ ./a.out
50047573967763
hasegawa@ubuntu:~$ ./a.out
50528040424272
hasegawa@ubuntu:~$ ./a.out
50307142328740
hasegawa@ubuntu:~$ ./a.out
50532118081473
hasegawa@ubuntu:~$ ./a.out
50464761737265

変数sumに対して，複数スレッドから同時に読み込み・書き込みを行うことで，値の整合性が崩れてしまうためにこのような現象が起こってしまう．この現象を防ぐためには，Mutexなどを使って排他制御を行う必要がある．

pthread_mutex_lock/pthread_mutex_unlockの速度

排他制御にMutexを使った場合の速度を計測する．Mutexを導入したコードを以下に示す．

#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#define N 10000000

uint64_t a[N];
uint64_t sum;

pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;

void *thread(void *arg) {
  for (uint64_t i = 0; i <= N; i++) {
    pthread_mutex_lock(&mtx);
    sum += a[i];
    pthread_mutex_unlock(&mtx);
  }
  pthread_exit(0);
}

int main(void) {
  for (uint64_t i = 1; i <= N; i++) {
    a[i] = i;
  }

  pthread_t t[2];
  for (int i = 0; i < 2; i++) {
    pthread_create(&t[i], NULL, thread, NULL);
  }
  for (int i = 0; i < 2; i++) {
    pthread_join(t[i], NULL);
  }
  printf("%lu\n", sum);
}

関数thread内の変数sumを操作する部分でMutexを使用している．

Bashのtimeコマンドで時間を10回計測したところ，平均実行時間は624 msだった．

LDAXRとSTLXR命令の速度

LDAXRとSTLXRはARM64の機械語命令である．機械語命令を使用するためには，インラインアセンブラを使用する．以下に，今回計測に使用したコードを示す．

#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#define N 10000000

uint64_t a[N];
uint64_t sum;

void add(uint64_t i) {
  uint64_t *p_sum = &sum;
  __asm__ volatile("1: ldaxr x0, [%[sum_addr]]\n"
                   "add x0, x0, %[i]\n"
                   "stlxr w1, x0, [%[sum_addr]]\n"
                   "cbnz w1, 1b"
                   : [sum_addr] "+r"(p_sum)
                   : [i] "r" (i)
                   : "w0", "w1", "cc", "memory");
}

void *thread(void *arg) {
  for (uint64_t i = 0; i <= N; i++) {
    add(a[i]);
  }
  pthread_exit(0);
}

int main(void) {
  for (uint64_t i = 1; i <= N; i++) {
    a[i] = i;
  }

  pthread_t t[2];
  for (int i = 0; i < 2; i++) {
    pthread_create(&t[i], NULL, thread, NULL);
  }
  for (int i = 0; i < 2; i++) {
    pthread_join(t[i], NULL);
  }
  printf("%lu\n", sum.value);
}

関数addがインラインアセンブラを使用した加算関数であり，排他制御を行っている部分である．拡張インラインアセンブラという構文で記述している．コロンの部分は，それぞれ上から，出力，入力，使用するレジスタを意味している．

アセンブリ部分では，まず，ldaxrでsumをレジスタx0に読み込み，関数の引数iを加算している．次に，stlxrで加算後の値をsumに代入（ストア）している．最後のcbnzでは，ldaxrとstlxrの間でsumが更新されていないかを判定している．値が更新されていた場合は，ldaxrの行へ飛び，同じ操作を繰り返す．

このコードの実行速度をBashのtimeコマンドで計測したところ，平均実行時間は248 msだった．

まとめ

今回，排他制御でMutexを使用した場合と，機械語命令を使用した場合での速度比較を行った．その結果，上述の評価用プログラムでは，機械語命令（STLXR,LDAXR）を使用すると2.5倍高速に処理可能であることが判明した．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

【ARM64】MutexとLDAXR,STLXRの速度比較 【排他制御】

はじめに

性能評価用プログラム

pthread_mutex_lock/pthread_mutex_unlockの速度

LDAXRとSTLXR命令の速度

まとめ

【ARM64】MutexとLDAXR,STLXRの速度比較【排他制御】