LoginSignup
1
2

More than 5 years have passed since last update.

golang で reservoir sampling

Last updated at Posted at 2018-03-29

なにそれ

大量のテキストからランダムに行を抽出したい時のアレ。

詳細はこの辺を。
http://sucrose.hatenablog.com/entry/2014/01/11/004615

まぁ書き直しただけなんですけど。

コード

reservoir_sampling.go
package main

import (
    "bufio"
    "flag"
    "fmt"
    "log"
    "math/rand"
    "os"
    "time"
)

var (
    seed       = flag.Int64("s", time.Now().UnixNano(), "random seed")
    samplesNum = flag.Int("n", 1000, "samples num")
)

func main() {
    flag.Parse()

    rand.Seed(*seed)
    k := *samplesNum

    reservoir := make([]string, k)
    scanner := bufio.NewScanner(os.Stdin)

    for i := 0; i < k; i++ {
        if !scanner.Scan() {
            log.Printf("input line less than %d", k)
            break
        }
        reservoir[i] = scanner.Text()
    }

    n := k

    for scanner.Scan() {
        n++
        r := rand.Intn(n)  // [0, n)
        if r < k {
            reservoir[r] = scanner.Text()
        }
    }
    if err := scanner.Err(); err != nil {
        log.Fatal(err)
    }

    for _, s := range reservoir {
        if s != "" {
            fmt.Fprintln(os.Stdout, s)
        }
    }
}

ビルドして

bash
go build reservoir_sampling.go

実行

bash
$ wc -l data.csv
16357121 data.csv

$ time cat data.csv | ./reservoir_sampling -n 10000 | wc -l
10000

real  0m11.707s
user  0m2.691s
sys 0m10.424s

1600万件くらいの中から10000件抽出するのに10秒ちょっと。
手元のmacで python3 だと1分くらいかかったのでわりと満足。

1
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
2