1

More than 5 years have passed since last update.

@imaifactory(今井雄太)

big data updates 2017/1/4

Last updated at 2017-01-04Posted at 2017-01-04

新年一発目のHadoop Weeklyをざーっと読んでのひととおりのメモ。
https://hadoopweekly.com/Hadoop-Weekly-198.html

Databricksの2016年ブログポストおさらい。Spark2がリリースされたこともあって、DataFrames、Datasets推しな話が多い。

Integrating Deep Learning Libraries with Apache Spark:

Tensorflowを使った画像解析をSparkで分散実行するという話。学習済みのモデルをExecutorに配って ~~モデルパラレル~~ データパラレルな推論を実行するJupyter Notebookのサンプルがついている。

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets:

RDD, DataFrames, Datasetsのおさらい。まだDataFramesとDatasetsの明確な使い分けとかちゃんと追いきれていないけど、↓みたいなことができるのはうれしい。

val ds = spark.read.json("/databricks-public-datasets/data/iot/iot_devices.json").as[DeviceIoTData]

Java的な型付けもされているのでスペース効率もあがると。

Introducing GraphFrames

これまであまり触る機会がなかったグラフ処理だけど、DataFramesベースでクエリできるなら気軽に使えそう。

Structured Streaming In Apache Spark

Streamingもこんな感じにStructuredな世界になっているのです。


// Read data continuously from an S3 location
val inputDF = spark.readStream.json("s3://logs")
 
// Do operations using the standard DataFrame API and write to MySQL
inputDF.groupBy($"action", window($"time", "1 hour")).count()
       .writeStream.format("jdbc")
       .start("jdbc:mysql//...")

Hortonworksブログからもいくつか

10 QUESTIONS ON HORTONWORKS DATA CLOUD FOR AWS

Hortonworks Data Cloud。Hortonworks製のEMRやQuoble、みたいな感じかな。違いは、クラウドコントローラーと呼ばれるデプロイアーも自分でEC2上に構築するということ。

　SPARKSQL, RANGER, AND LLAP VIA SPARK THRIFT SERVER FOR BI SCENARIOS TO PROVIDE ROW, COLUMN LEVEL SECURITY, AND MASKING

タイトルが長くて、しかも全部大文字で読みづらくて何を言ってるかわかりづらいんだけど、SparkからLLAPを使いましょうっていう話。そこにRangerを入れるとSparkからのデータアクセスに対してカラムレベルのアクセス制御ができるよと。
これは実際にためしてみた。Zeppelin + Spark + Hive LLAP。

1

Register as a new user and use Qiita more conveniently

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

1