OpenCVの並列化フレームワークの謎挙動

OpenCV

Last updated at 2020-12-03Posted at 2020-12-03

はじめに

この記事はOpenCV Advent Calendar 2020の4日目の記事です．

さて皆さん，OpenCVの汎用並列化フレームであるcv::parallel_for_を使ってますか？
私は，ほとんど使ってません（笑）

以下ではこのフレームワークをVisual Studioで使う時にハマってしまった落とし穴をまとめます．
並列化フレームワークはデフォルトのConcurrencyです．
なお，フレームワークを変えると挙動が変わります．

検証したOpenCVのバージョンは4.3と4.5です．

並列化の設定関数

この記事では，以下３つの並列化に関するIDを引っ張る関数について説明します．

cv::getNumberOfCPUs()
cv::getNumThreads()
cv::getThreadNum()

cv::getNumberOfCPUs()

cv::getNumberOfCPUs()は最大スレッド数を返します．
OpenMPではomp_get_max_threads()に相当する関数です．
4コア8スレッドマシンなら8という数字を返します．
cv::setNumThreads(1)で1スレッドしか使うな命令を出してもちゃんと8を返します．

cv::getNumThreads()

cv::getNumThreads()は，設定された最大スレッド数を返します．
OpenMPでは，omp_get_num_threads()関数相当のような名前ですが，挙動が違います．

OpenMPでは，「現在」の並列化スレッド数を返すため，非並列リージョンでは1を返します．
一方，この関数は常時設定された最大スレッド数を返します．
cv::setNumThreads(1)を呼び出して1スレッドしか使うな命令を出すと常時1を吐き出します．

すでに挙動が思ったのと違います．

cv::getThreadNum()

cv::getThreadNum()のメソッドを使ったことがある人はいるでしょうか？
これは，並列に動いているときに，そのコアのスレッド番号を返す関数です．
OpenMPでいうとomp_get_thread_num()関数相当です．
omp_get_num_threads()と間違えやすいですが，number of threadsとthread numberの違いと覚えておけば思い出せるでしょう．

通常，8コアのマシンだとして，OpenMPのomp_get_thread_numは0~7の値を返します．
0スタートのインデックスを返してくれます．
このコアのIDがわかると，コアごとにメモリを割り当てたりできるため，メモリ割り当てを最適化するときに重要な関数です．
（普通の人はマルチコア環境での最適メモリ割り当てとかやらないというのは置いておいて．．．）

さて，これをparallel_for_で行うために，cv::getThreadNum()を使ってみます．
512x512の画像を8並列でキックして動かし，並列化されたリージョンや，threadIDをprintすると下記のようになります．

thread ID: 1 start: 0 end: 64
thread ID: 2 start: 64 end: 128
thread ID: 3 start: 128 end: 192
thread ID: 4 start: 192 end: 256
thread ID: 0 start: 448 end: 512
thread ID: 5 start: 256 end: 320
thread ID: 6 start: 384 end: 448
thread ID: 7 start: 320 end: 384

まぁ期待通りのようですが，たまに以下の挙動をします．

thread ID: 5 start: 0 end: 64
thread ID: 1 start: 64 end: 128
thread ID: 2 start: 128 end: 192
thread ID: 4 start: 192 end: 256
thread ID: 0 start: 448 end: 512
thread ID: 6 start: 320 end: 384
thread ID: 3 start: 256 end: 320
thread ID: 8 start: 384 end: 448

8．．．８？？？
はい．0~7ではなく0~8のIDが返ってきます．
8って何ですか？8スレッドで9個のスレッドIDって何が起こったんですか？

32コアのマシンで動かしてみます．

number of threads: 36
thread ID: 0 start: 498 end: 512
thread ID: 0 start: 484 end: 498
thread ID: 0 start: 469 end: 484
thread ID: 1 start: 0 end: 14
thread ID: 6 start: 28 end: 43
thread ID: 3 start: 43 end: 57
thread ID: 4 start: 57 end: 71
thread ID: 8 start: 71 end: 85
thread ID: 7 start: 85 end: 100
thread ID: 21 start: 100 end: 114
thread ID: 5 start: 114 end: 128
thread ID: 10 start: 128 end: 142
thread ID: 11 start: 142 end: 156
thread ID: 12 start: 156 end: 171
thread ID: 13 start: 171 end: 185
thread ID: 14 start: 185 end: 199
thread ID: 15 start: 199 end: 213
thread ID: 16 start: 213 end: 228
thread ID: 17 start: 228 end: 242
thread ID: 18 start: 242 end: 256
thread ID: 19 start: 256 end: 270
thread ID: 20 start: 270 end: 284
thread ID: 9 start: 284 end: 299
thread ID: 22 start: 299 end: 313
thread ID: 23 start: 313 end: 327
thread ID: 24 start: 327 end: 341
thread ID: 25 start: 341 end: 356
thread ID: 26 start: 356 end: 370
thread ID: 27 start: 370 end: 384
thread ID: 2 start: 14 end: 28
thread ID: 28 start: 384 end: 398
thread ID: 29 start: 398 end: 412
thread ID: 30 start: 412 end: 427
thread ID: 31 start: 427 end: 441
thread ID: 32 start: 441 end: 455
thread ID: 33 start: 455 end: 469

0が３つで，34，35，がない．．．

つまりこんなコードかいてたら，アクセス違反で落ちます．

vector<int> buffer(cv::getNumThreads());
buffer[cv::getThreadNum()]

これを避けるにはどうしたらいいでしょうか？
parallel_for_で動いているリージョンでompの関数を呼んでみるという暴挙に出てましたが，さすがにまともに動いてくれませんでした．
これは，スレッドID固有の割り当てを諦め，Range rangeの範囲から適当に自分で決めたルールでindexを決めるしかないでしょう（悲しい）．

さて，コードは実際どうなっているでしょうか？我々探検隊は（ｒｙ

/** @brief Returns the index of the currently executed thread within the current parallel region. Always
returns 0 if called outside of parallel region.

@deprecated Current implementation doesn't corresponding to this documentation.

The exact meaning of the return value depends on the threading framework used by OpenCV library:
- `TBB` - Unsupported with current 4.1 TBB release. Maybe will be supported in future.
- `OpenMP` - The thread number, within the current team, of the calling thread.
- `Concurrency` - An ID for the virtual processor that the current context is executing on (0
  for master thread and unique number for others, but not necessary 1,2,3,...).
- `GCD` - System calling thread's ID. Never returns 0 inside parallel region.
- `C=` - The index of the current parallel task.
@sa setNumThreads, getNumThreads
 */
CV_EXPORTS_W int getThreadNum();

さぁ．よく読んでみてください．
@deprecatedです．
非推奨と言ってます．
中身を見るまでもなくヘッダで探検は終了です．．

Concurrency - An ID for the virtual processor that the current context is executing on (0
for master thread and unique number for others, but not necessary 1,2,3,...).
これが意味するのが上記の挙動みたいです．
もっと早く教えてほしかった．．．

まとめ

Visual Studioでcv::getThreadNum()のご利用は計画的に．
なお，パラレルフレームワークをOpenMPにしてコンパイルしたらちゃんと期待通りに動きます．

余談

このparallel_for_フレームワークは汎用の並列化ライブラリですが，OpenMPをサポートしていないコンパイラのほうがレアな状態です．
複雑なリダクション処理などをかくにはこのフレームワークを使ったほうが書きやすいですが，ほとんどの場合，OpenMPで代用できますし，OpenMPのほうが圧倒的にコードがすっきりします．

また，このフレームワークを使って8コアマシンでコンパイルして，そのバイナリを36コアマシンで動かしても，ちゃんと36コアとして36並列で動きます．
これは，OpenMPでコンパイルしたファイルを人に渡しても，ちゃんとコア数の違いを吸収して動いてくれます．

そのため，もうOpenMPでいいよね．．．と思ったりもします．

なお，並列化環境でprintfするにはクリティカルセクションが必要です．
parallel_for_環境では，cv::AutoLock, cv::Mutexを使いましょう．
OpenMPでは，以下のディレクティブでスコープのブロックがクリティカルセクションに代わります

# pragma omp critical
{
 cout<<"hoge"<<endl;
}

使用したコード

# include <iostream>
# include <omp.h>

using namespace cv;
using namespace std;
class ParallelTestInvoker : public cv::ParallelLoopBody
{
private:
	const int thread_max;
	const Mat& src;
	cv::Mat& dest;
	cv::Mutex mutex;
public:
	ParallelTestInvoker(const Mat& _src, Mat& _dest)
		: src(_src), dest(_dest), thread_max(cv::getNumThreads())
	{
		dest.create(src.size(), src.type());
		cout << "number of threads: " << thread_max << endl;
	}


	void operator()(const Range& range) const override
	{
		const int nthread = cv::getThreadNum();

		//if (nthread >= 8)
		{
			cv::AutoLock a((Mutex&)mutex);
			cout << "thread ID: " << nthread << " start: " << range.start << " end: " << range.end << endl;
		}
		for (int i = range.start; i != range.end; i++)
		{
			*dest.ptr<uchar>(i) = *src.ptr<uchar>(i);
		}
	}
	void run(const int iteration)
	{
		for (int i = 0; i < iteration; i++)
		{
			parallel_for_(Range(0, src.rows), *this, thread_max);
			cout << "" << endl;
		}
		cout << "================" << endl;
	}
};

void ParallelForTest(Mat& src, Mat& dest, const int iteration)
{
	ParallelTestInvoker ptest(src, dest);
	ptest.run(iteration);
}

void OMPTest(Mat& src, Mat& dest, const int iteration)
{
	dest.create(src.size(), src.type());
	int thread_max = omp_get_max_threads();
	cout << "omp_get_num_procs  | " << omp_get_num_procs() << endl;
	cout << "omp_get_max_threads| " << omp_get_max_threads() << endl;
	cout << "omp_get_num_threads| " << omp_get_num_threads() << endl;
	cout << "start parallel" << endl;
	cout << "omp_get_thread_num/omp_get_num_threads<<endl"<<endl;
	for (int i = 0; i < iteration; i++)
	{
# pragma omp parallel for
		for (int i = 0; i < src.rows; i++)
		{
# pragma omp critical
			{
				cout << omp_get_thread_num()<<"/"<< omp_get_num_threads()<< endl;
			}
		}
	}
}

int main()
{

	Mat a(512, 512, CV_8U);
	Mat b;
	cout << "omp_test" << endl;
	OMPTest(a, b, 1);

	cout << "parallel_for__test" << endl;
	ParallelForTest(a, b, 1);
	
	return 0;
}

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up