More than 1 year has passed since last update.

[C++] set の lower_bound に気をつける

Posted at 2023-01-14

Motivation

C++ プログラミングにおいて、配列(std::vector)の lower_bound を使う機会はよくあり、その時は std::lower_bound を使用することになる。
ちなみに、lower_bound は「指定された要素以上の値が現れる最初の位置のイテレータを取得する」ものである。

実は集合(std::set)にも同様の関数があり、std::set::lower_bound が利用可能だ。
たまにしか使わないので、つい set のデータに対しても、vector 用の関数を呼んでしまうことが(私には)あるため、使用方法のおさらいと共に適切な関数を呼ぶための記憶づけとするため本記事を執筆した。set データに対する std::lower_bound の使用を控えるべき理由は、計算量の観点からである。 ¹

Experiment

vector/set それぞれのデータの中で、ランダムに指定した値以上の最小値を求める、という処理を複数回行い、その時間を計測した。

vector/`std::lower_bound` 使用

#include <stdio.h>
#include <stdlib.h>
#include <bits/stdc++.h>
using namespace std;
using namespace std::chrono;

int main(void){

    // データ数および試行回数
    int n = 1e5;

    // データ投入
	vector<int> v;
	for(int i = 0; i < n; i++) {
		v.push_back(i);
	}

	// 計測開始
	auto start = high_resolution_clock::now();

	for(int i = 0; i < n; i++) {
		int r = rand() % n;
		// vector データの中で、ランダム値以上の最小値を求める
		auto it = lower_bound(v.begin(), v.end(), r);
		// cout << "r:" << r << ", *it:" << *it << endl;// 表示は省略
	}

	// 計測終了
	auto stop = high_resolution_clock::now();
	auto duration = duration_cast<milliseconds>(stop - start);

	// 計測時間の表示
	cout << "It took " << duration.count() << " ms." << endl;	
}

set/`std::set::lower_bound` 使用

#include <stdio.h>
#include <stdlib.h>
#include <bits/stdc++.h>
using namespace std;
using namespace std::chrono;

int main(void){

	// データ数および試行回数
	int n = 1e5;

  // データ投入
	set<int> s;
	for(int i = 0; i < n; i++) {
		s.insert(i);
	}

	// 計測開始
	auto start = high_resolution_clock::now();

	for(int i = 0; i < n; i++) {
		int r = rand() % n;
		// set データの中で、ランダム値以上の最小値を求める
		auto it = s.lower_bound(r);
		// cout << "r:" << r << ", *it:" << *it << endl;
	}

	// 計測終了
	auto stop = high_resolution_clock::now();
	auto duration = duration_cast<milliseconds>(stop - start);

	// 計測時間の表示
	cout << "It took " << duration.count() << " ms." << endl;	
}

n=100,000 として、このプログラムを10回試行した平均値は以下のようになった。

vector	set
34.2 ms	28.9 ms

少しだけ set の方が早めだが、あまり大差はないという結果になった。

set/`std::lower_bound` 使用

// イテレータ部分のみ抜粋、その他は省略
auto it = lower_bound(s.begin(), s.end(), r);

次に set に対して、std::lower_bound によって同様の処理を行なった結果、計測時間は 106,486 ms となり、桁違いに遅かった。

set データに対する std::lower_bound が遅い理由

std::lower_bound はデータのある区間内に対して適用される関数であり、その区間はイテレータで指定する。

// 基本系
lower_bound(開始を指すイテレータ(含む), 終了を指すイテレータ(含まない), key);

// v の全てに対して
lower_bound(v.begin(), v.end(), key);
// v の最初から3つ目までの要素に対して
lower_bound(v.begin(), v.begin() + 3, key);

std::lower_bound に指定するのに使用できるイテレータは std::vector::iterator であり、これを含めイテレータにはいくつか種類があり、このイテレータは random-access iterator というタイプのものである。このタイプは、以下のような処理に対応している。これらは全て O(1) による操作である:

今見ている要素の n 個次の要素を見るようにする（it += n）
今見ている要素の n 個前の要素を見るようにする（it -= n）
n 個次の要素を見るイテレータを得る（it + n）
n 個前の要素を見るイテレータを得る（it - n）

vector データのイテレータ(v.begin() など)は std::vector::iterator を返すため、std::lower_bound への適用が正常に行われる。

一方で、set データが返すイテレータは std::set::iterator であり、これは bidirectional iterator というタイプのものである。そのためこれは std::lower_bound の範囲指定に使用できるイテレータの要件を満たさないため、時間がかかってしまうというものである。
具体的には、比較を行う処理時間は O(log(n)) であるが、イテレータの移動にコストがかかるとのことらしい(なぜなら、ランダムアクセスイテレータではないため、n 個先に移動するのに時間がかかるため)
直感的にも、set は木構造のためランダムアクセスできない、というのはイメージしやすいのではないだろうか。

参考にさせていただいた記事:

Conclusion

vector データには、std::lower_bound を
set データには、std::set::lower_bound を使用するようにしよう。

upper_bound の場合も然り。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up