More than 5 years have passed since last update.

重み付きランダム抽出の実装

Posted at 2016-07-31

目的

重み付きの復元抽出・非復元抽出はほとんど実装例があまりなかったため、実装してみました。

出来ること・出来ないこと

出来ること

標本抽出の要素は総称型で、どんな要素でもWelcome
非復元・復元抽出のいずれにも対応

出来ないこと

今の実装では重みは小数第三位までしか対応していない。
- 改良したとしても、重みの粒度を細かくするとパフォーマンスが低下する。

実装

母集団を表すクラス（Population.java）

Population.java

public class Population<T> {
	private final ArrayList<T> population;
	private final Random rand = new Random();
	private static final BigDecimal ORDER_SHIFT = new BigDecimal(1000);

	private Population(ArrayList<T> population) {
		this.population = population;
	}

	public static <T> Population<T> of(Map<T, BigDecimal> pattern) {
		ArrayList<T> population = new ArrayList<>(1000);
		for(Map.Entry<T, BigDecimal> entry : pattern.entrySet()){
			for(int i = 0, j = entry.getValue().multiply(ORDER_SHIFT).intValueExact(); i < j; i++) {
				population.add(entry.getKey());
			}
		}

		Collections.shuffle(population);

		return new Population<T>(population);
	}

	// 非復元抽出
	public T samplingWithoutReplacement() {
		int index = rand.nextInt(population.size());
		T sample = population.get(index);
		population.removeAll(Collections.singleton(sample));
		/*
		 * 同義ですが、パフォーマンスが劣化したため不採用としました。
		 * （毎回ラムダ式を生成しているため当然か。。。）
		 * population.removeIf(element -> sample.equals(element))
		 */
		return sample;
	}

	// 復元抽出
	public T samplingWithReplacement() {
		int index = rand.nextInt(population.size());
		T sample = population.get(index);
		return sample;
	}
}

呼び出し側（Population.java）

PopulationMain.java

public class PopulationMain {
	public static void main(String[] args) {
		Map<String, BigDecimal> pattern = new HashMap<String, BigDecimal>() {
			{
				put("A", new BigDecimal("0.300"));
				put("B", new BigDecimal("0.200"));
				put("C", new BigDecimal("0.100"));
				put("D", new BigDecimal("0.100"));
				put("E", new BigDecimal("0.100"));
				put("F", new BigDecimal("0.050"));
				put("G", new BigDecimal("0.050"));
				put("H", new BigDecimal("0.050"));
				put("I", new BigDecimal("0.050"));
			}
		};

		long start = System.currentTimeMillis();
		Population<String> population = Population.of(pattern);

		List<String> list1 = new ArrayList<>();
		for (int i = 0; i < 1000000; i++) {
			list1.add(population.samplingWithReplacement());
		}
		list1.stream()
				.collect(Collectors.groupingBy(s -> s, Collectors.counting()))
				.forEach((k, v) -> System.out.println(k + ":" + v));

		List<String> list2 = new ArrayList<>();
		for (int i = 0; i < 100000; i++) {
			Population<String> tmpPopulation = Population.of(pattern);
			list2.add(tmpPopulation.samplingWithoutReplacement());
		}
		list2.stream()
				.collect(Collectors.groupingBy(s -> s, Collectors.counting()))
				.forEach((k, v) -> System.out.println(k + ":" + v));

	}
}

実行結果

復元抽出

100万回のサンプリングを行いました。

標本	重み	抽出数
A	0.300	299645
B	0.200	200083
C	0.100	100023
D	0.100	100277
E	0.100	99930
F	0.050	49842
G	0.050	50015
H	0.050	49981
I	0.050	50204

非復元抽出

ここまで書いておいてなんですが、非復元抽出を証明することって出来ますでしょうか。。。

所感

非復元抽出の証明こそ出来ていませんが、実現は出来ているはず。。。？
愚直な実装ですが、案外パフォーマンスも悪くなく、実用に堪えそうです。
母集団を表すコレクションはArrayListやLinkedList、EclipseCollectionsのFastListを試しましたが、ArrayListが一番高速でした。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up