Amazon Redshiftを試してみる #redshift

Amazon Redshiftを試してみます。

データの準備

毎度お馴染み。Amazon EMRのImpalaのテスト用に用意されているジェネレーターが生成した1GBx3個のファイルを利用します。
EC2上とかでもいいですが、今回はローカルのMacをクライアントにしたいので、ツールをローカルにダウンロード。

wget http://elasticmapreduce.s3.amazonaws.com/samples/impala/dbgen-1.0-jar-with-dependencies.jar

データを生成します。

java -cp dbgen-1.0-jar-with-dependencies.jar DBGen -p ./ -b 1 -c 1 -t 1

上記の場合、books 1G,customer 1G, transanction 1Gの計3GBのデータができます。
実行にはJavaが必要です。

データの用意・アップロード

S3に置いておくのがお作法のようなので、アップロードします。保存用のバケットを作成し、必要に応じてディレクトリなどを作成しておきます。

特にツールがなくても、ブラウザ経由でアップロードできます。

くどいようですが、何でAzureのBlobはアップロード機能をつけないのでしょうか？

アップロード完了しました。

諸事情によりtransactionのファイル名がtranzaction(z)となっています。

クライアントツールの準備

ご存知RedshiftはPostgreSQL互換なので、クライアントツールとしては、psqlが使えます。
私は、Mac環境なので、brewを使ってインストールしておきます。なお、Redshiftは8.x互換（最新は9.x)なので、8.xをインストールしておきます。

本当はclientツールだけでいいのですが、わからないので、PostgreSQLごとインストールしました。

brew install brew install postgresql8

psql --version
psql (PostgreSQL) 8.4.22

Redshiftクラスタの構築

Redshiftメニューを選択し、[Launch Cluster]ボタンをクリックし、クラスタの構築を進めます。
クラスタ名、DB名、ログインID,PWの設定などを行ないます。portはとりあえず標準のままにします。

クラスタのタイプを選択します。
ここでは最も低価格な（デフォルトの）構成を採用してみます。Redshiftの本領を発揮させるためにはノード数を増やした方がいいでしょう。

dc1.largeは100円換算で25円/時. 1ヶ月18,750円程度。競合？のAzure SQL Data Warehouseは71.4円/時。最低金額はRedshiftの方が安いですね。

その他、設定をします。とりあえず標準のまま進めます。

情報を確認し、[Launch Cluster]をクリックします。

作成準備ができたので[close]します。

5分程で、availableとなります。

クラスタ名をクリックして情報を確認します。
Endpointにサーバ名:ポート番号、セキュリティーグループ（Firewall）、DB名、ログイン情報などが表示されています。

Redshiftを使ってみる

Firewallの設定をする

クラスター完成後は、Firewallにより、どこからもアクセスできなくなっています。
Security Groupのリンクをクリックし、Security Group設定ページに移動し、5439ポートを開放します。

マイIPを選ぶことで今利用しているクライアントのIPを追加できるようです。

psql -h blueshift.cpcgfwtyukro.ap-northeast-1.redshift.amazonaws.com -p 5439 -U admin -d testdb
Password for user admin:
psql (8.4.22, server 8.0.2)
WARNING: psql version 8.4, server version 8.0.
         Some psql features might not work.
SSL connection (cipher: ECDHE-RSA-AES256-SHA, bits: 256)
Type "help" for help.

testdb=#
testdb=#

dbname=#というプロンプトがでればOKです。

PostgreSQLの最低管理コマンド

create tableやselectなどに加え、下記コマンドを知っていればとりあえず何とかなるでしょう。
コマンドは[\command(word)]という書式を取ります。

#show database
\l

#show tables
\d

#desc
\d table_name

#exit
\q

テーブルの作成

MySQLと違い、コンソールで#とするとエラーになりますね。取って実行して下さい。

#books
create table books(
    id bigint not null distkey sortkey,
    isbn varchar(128),
    category varchar(128),
    publish_date datetime,
    publisher varchar(128),
    price float
);

#customers
create table customers(
    id bigint not null distkey sortkey,
    name varchar(128),
    date_of_birth datetime,
    gender varchar(128),
    state varchar(128),
    email varchar(256),
    phone varchar(128)
);

#transactions
create table transactions(
    id bigint not null distkey sortkey,
    customer_id bigint,
    book_id bigint,
    quantity INT,
    transaction_date datetime
);

データロード

Redshiftプロンプトにおいて、copyコマンドを利用して行ないます。

copy tablename from 's3://bucket/path/to/data' credential 'ws_access_key_id=key;aws_secret_access_key=secret'

という書式です。access_keyとsecretは、「認証情報」メニューで確認・生成できます。

デフォルトで設定されているaccess_keyはsecretを表示できないということで、新しいaccess_keyを生成します。

#books
copy books from 's3://tamabucket/data/books' 
credentials 'aws_access_key_id=xxxxxxxxxx;aws_secret_access_key=yyyyyyyyyy' 
delimiter '|';

#customers
copy customers from 's3://tamabucket/data/customers' 
credentials 'aws_access_key_id=xxxxxxxxxx;aws_secret_access_key=yyyyyyyyyy' 
delimiter '|';

#transactions
copy transactions from 's3://tamabucket/data/transactionz' 
credentials 'aws_access_key_id=xxxxxxxxxx;aws_secret_access_key=yyyyyyyyyy' 
delimiter '|';

ロードは非常に速い印象でした。

クエリ

クエリの時間を表示させたいので、

\timing

として、計測機能をonにします。

クエリ1

select count(*) from books;

Time: 1323.417 ms (1500万件）

クエリ2

select * from books where isbn='2-90323-440-7';

Time: 3582.724 ms

クエリ3

select category,count(*) from books group by category;

Time: 3299.236 ms

クエリ4

select
    transactions.id,
    books.isbn,
    books.category,
    books.price,
    transactions.quantity,
    customers.id,
    customers.name
 from transactions 
 left join books on transactions.book_id = books.id
 left join customers on transactions.customer_id = customers.id
 limit 10;

Time: 6688.158 ms

クエリ5

select
    customers.name,
    sum(books.price) as sum
from (transactions left join books on (transactions.book_id = books.id)) left join customers on (transactions.customer_id = customers.id)
group by customers.name
order by customers.name desc
limit 10;

Time: 6184.344 ms

まとめ

クエリは、2回目移行はキャッシュが聞くので、１回目の速度を計測しました。ただ、数回クラスタを作成し、同じクエリを実行しても、場合により2~3倍程度実行時間が違うことがありました。
今回は合計で3GBと、ビックデータという意味では小さなデータでテストしましたが、だいたいの性質はわかりました。