More than 5 years have passed since last update.

[Oracle Cloud] Data Flow をクイックスタートしてみたその2 (SQL)

Last updated at 2020-04-19Posted at 2020-04-19

はじめに

前回の記事では、OCI Data Flow のクイックスタートその1 をやりました。今回は引き続きその2 をやっていきます。

Qiita 記事 Index

その1 Java
https://qiita.com/sugimount/items/d67f8aa02f0af89299de

その2 SQL
https://qiita.com/sugimount/items/1d8fe15ddbd58f0cdd76

その3 Python
https://qiita.com/sugimount/items/9a4c87493b569f174ba0

クイックスタート手順

以下のドキュメントに書かれている手順を参考にすすめていきます

その1 で作成した Parquet 形式のファイルには、インターネットに公開されている Airbnb の価格情報などが公開されているデータが格納されています。今回のクイックスタートでは、SQL を使ってデータを分析します。

アプリケーションの定義と実行

Data Flow のページで Create Application を押して、クイックスタート用のアプリケーションを作成します

以下のパラメータを入力します。ファイルURLには、チュートリアルとして用意されているSQLファイルを指定します。SQLファイルの中身は後程確認します。

言語 : SQL
名前 : Tutorial Example 2
ファイルURL : oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/oow_lab_2019_sparksql_report.sql
location : oci://<bucket>@<namespace>/optimized_listings
- oci://dataflow-logs@nryjxkqe0mhq/optimized_listings (環境に合わせて、bucket名やnamespaceを変更してください)

パラメータを入力したら、作成を押します

実行します

このままのパラメータで実行します

受入れ済

実行完了となったので、Application の STDOUT(標準出力)を確認します

以下の文字列が表示されます

Tempelhof - Schoneberg	1560	0.0	6000.0	96.5724358974359
Marzahn - Hellersdorf	141	0.0	220.0	56.50354609929078
Charlottenburg-Wilm.	1592	10.0	9000.0	114.27072864321607
Treptow - Kopenick	595	10.0	390.0	52.21680672268908
Mitte	4631	0.0	3000.0	69.69941697257612
Lichtenberg	688	10.0	5000.0	56.11337209302326
Pankow	3541	0.0	5000.0	67.82264896921774
Steglitz - Zehlendorf	437	8.0	500.0	56.327231121281464
Reinickendorf	247	10.0	880.0	48.26315789473684
Friedrichshain-Kreuzberg	5497	0.0	8600.0	60.82881571766418
Neukolln	3499	9.0	500.0	46.56730494426979
Spandau	124	12.0	400.0	58.233870967741936

SQLの中身

それでは、実際にクエリーとして投げられたSQLファイルを確認します。以下のインターネットで公開されているものを実行しています。

drop table if exists listings;
create external table listings(
    id bigint, listing_url string, scrape_id string,
    last_scraped string, name string, summary string,
    space string, description string, experiences_offered string,
    neighborhood_overview string, notes string, transit string,
    access string, interaction string, house_rules string,
    thumbnail_url string, medium_url string, picture_url string,
    xl_picture_url string, host_id string, host_url string,
    host_name string, host_since string, host_location string,
    host_about string, host_response_time string, host_response_rate string,
    host_acceptance_rate string, host_is_superhost string, host_thumbnail_url string,
    host_picture_url string, host_neighborhood string, host_listings_count string,
    host_total_listings_count string, host_verifications string, host_has_profile_pic string,
    host_identity_verified string, street string, neighborhood string,
    neighborhood_cleansed string, neighborhood_group_cleansed string, city string,
    state string, zipcode string, market string,
    smart_location string, country_code string, country string,
    latitude string, longitude string, is_location_exact string,
    property_type string, room_type string, accommodates string,
    bathrooms string, bedrooms string, beds string,
    bed_type string, amenities string, square_feet float,
    price float, weekly_price float, monthly_price float,
    security_deposit float, cleaning_fee float, guests_included string,
    extra_people string, minimum_nights string, maximum_nights string,
    calendar_updated string, has_availability string, availability_30 string,
    availability_60 string, availability_90 string, availability_365 string,
    calendar_last_scraped string, number_of_reviews string, first_review string,
    last_review string, review_scores_rating int, review_scores_accuracy int,
    review_scores_cleanliness int, review_scores_checkin int, review_scores_communication int,
    review_scores_location int, review_scores_value int, requires_license string,
    license string, jurisdiction_names string, instant_bookable string,
    is_business_travel_ready string, cancellation_policy string, require_guest_profile_picture string,
    require_guest_phone_verification string, calculated_host_listings_count string, reviews_per_month string
) stored as parquet location '${location}';

-- Do.
select
    neighborhood_group_cleansed,
    count(*) total,
    min(price) min_price,
    max(price) max_price,
    avg(price) avg_price
from listings
group by neighborhood_group_cleansed;

SQL を実行するときに、まずはじめに Table を定義します。create external table という構文で Data Flow の外部にあるデータソースを使って外部テーブルを作成します。

create external table listings

外部にあるデータソースは、次の構文が関係しています。Object Storage にある Parquet 形式のファイルが格納されているディレクトリを指定しています。

stored as parquet location '${location}';

'${location}' は、Data Flow のパラメータとして値を渡しています。

Object Storage の、optimized_listings を指定しています。Parquet ファイルは、optimized_listingsの配下にあるため外部テーブルとして参照されているようです。

外部テーブルを定義したあとに、Select 文を投げています。Select 文の結果が、アプリケーションの標準出力してコンソール上で確認できます。neighborhood_group_cleansedでグループ分けをして、どれくらいの数の宿泊施設があるのか、最大価格、最小価格はいくらなのか、というデータをSelectしています。

select
    neighborhood_group_cleansed,
    count(*) total,
    min(price) min_price,
    max(price) max_price,
    avg(price) avg_price
from listings
group by neighborhood_group_cleansed;

この結果、Data Flow アプリケーションの標準出力として、以下の文字が出力されているわけですね。

Tempelhof - Schoneberg	1560	0.0	6000.0	96.5724358974359
Marzahn - Hellersdorf	141	0.0	220.0	56.50354609929078
Charlottenburg-Wilm.	1592	10.0	9000.0	114.27072864321607
Treptow - Kopenick	595	10.0	390.0	52.21680672268908
Mitte	4631	0.0	3000.0	69.69941697257612
Lichtenberg	688	10.0	5000.0	56.11337209302326
Pankow	3541	0.0	5000.0	67.82264896921774
Steglitz - Zehlendorf	437	8.0	500.0	56.327231121281464
Reinickendorf	247	10.0	880.0	48.26315789473684
Friedrichshain-Kreuzberg	5497	0.0	8600.0	60.82881571766418
Neukolln	3499	9.0	500.0	46.56730494426979
Spandau	124	12.0	400.0	58.233870967741936

これらのSQLの構文は、Spark SQL の構文を調べると良いです。Spark SQL では、SQL2003 と、HiveQL という2つの構文が使えるようです。

参考URL

Spark SQL は、SQL2003 や、HiveQL を使うことが出来るらしい。
https://logmi.jp/tech/articles/321474

SQL 構文に関するドキュメント (AWS, Azure)
https://docs.aws.amazon.com/ja_jp/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html
https://docs.microsoft.com/ja-jp/azure/databricks/spark/latest/spark-sql/language-manual/create-table

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

[Oracle Cloud] Data Flow をクイックスタートしてみた その2 (SQL)