Posted at

gceにインストールしたsparkでspark-sqlを使ってみる。そうubuntuで。

More than 3 years have passed since last update.


spark-sqlを実行


  • gceでと書いておいて恐縮ですが、gceである必要が全然ありません。

  • でもクライアントも接続先もubuntuです。

  • sparkのインストールは前の投稿で行っているので省略します。


spark-sql起動

1.gcloudで接続します。

$ cd $SPARK_HOME

$ sudo ./bin/spark-spl

Spark assembly has been built with Hive, including Datanucleus jars on classpath

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Unable to initialize logging using hive-log4j.properties, not found on CLASSPATH!
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/07/07 15:32:08 INFO SecurityManager: Changing view acls to: root,
15/07/07 15:32:08 INFO SecurityManager: Changing modify acls to: root,
15/07/07 15:32:08 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, ); users with modify permissions: Set(root, )
15/07/07 15:32:08 INFO Slf4jLogger: Slf4jLogger started
15/07/07 15:32:08 INFO Remoting: Starting remoting
15/07/07 15:32:08 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@instance-1.c.custom-unison-00000.internal:58022]
15/07/07 15:32:08 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@instance-1.c.custom-unison-00000.internal:58022]
15/07/07 15:32:08 INFO Utils: Successfully started service 'sparkDriver' on port 58022.
15/07/07 15:32:08 INFO SparkEnv: Registering MapOutputTracker
15/07/07 15:32:08 INFO SparkEnv: Registering BlockManagerMaster
15/07/07 15:32:08 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20150707153208-86b9
15/07/07 15:32:08 INFO Utils: Successfully started service 'Connection manager for block manager' on port 39311.
15/07/07 15:32:08 INFO ConnectionManager: Bound socket to port 39311 with id = ConnectionManagerId(instance-1.c.custom-unison-00000.internal,00000)
15/07/07 15:32:08 INFO MemoryStore: MemoryStore started with capacity 265.1 MB
15/07/07 15:32:08 INFO BlockManagerMaster: Trying to register BlockManager
15/07/07 15:32:08 INFO BlockManagerMasterActor: Registering block manager instance-1.c.custom-unison-00000.internal:39311 with 265.1 MB RAM
15/07/07 15:32:08 INFO BlockManagerMaster: Registered BlockManager
15/07/07 15:32:08 INFO HttpFileServer: HTTP File server directory is /tmp/spark-a8e3eb25-7a87-4138-8ec5-f387b76c21b1
15/07/07 15:32:08 INFO HttpServer: Starting HTTP Server
15/07/07 15:32:09 INFO Utils: Successfully started service 'HTTP file server' on port 45607.
15/07/07 15:32:09 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/07/07 15:32:09 INFO SparkUI: Started SparkUI at http://instance-1.c.custom-unison-00000.internal:4040
15/07/07 15:32:09 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@instance-1.c.custom-unison-00000.internal:58022/user/HeartbeatReceiver
spark-sql>

sparl-splが起動します。

2.テーブルを作る。

Spark SQL programming guideのexampleをそのまま使って、テーブルを作成する。

spark-sql> CREATE TABLE IF NOT EXISTS src (key INT, value STRING);

15/07/07 15:38:48 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src (key INT, value STRING)

15/07/07 15:38:48 INFO ParseDriver: Parse Completed
15/07/07 15:38:48 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
15/07/07 15:38:48 INFO Driver: <PERFLOG method=Driver.run>
15/07/07 15:38:48 INFO Driver: <PERFLOG method=TimeToSubmit>
15/07/07 15:38:48 INFO Driver: <PERFLOG method=compile>
15/07/07 15:38:48 INFO Driver: <PERFLOG method=parse>
15/07/07 15:38:48 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src (key INT, value STRING)
15/07/07 15:38:48 INFO ParseDriver: Parse Completed
15/07/07 15:38:48 INFO Driver: </PERFLOG method=parse start=1436283528297 end=1436283528297 duration=0>
15/07/07 15:38:48 INFO Driver: <PERFLOG method=semanticAnalyze>
15/07/07 15:38:48 INFO SemanticAnalyzer: Starting Semantic Analysis
15/07/07 15:38:48 INFO SemanticAnalyzer: Creating table src position=27
15/07/07 15:38:48 INFO HiveMetaStore: 0: get_table : db=default tbl=src
15/07/07 15:38:48 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=src
15/07/07 15:38:48 INFO Driver: Semantic Analysis Completed
15/07/07 15:38:48 INFO Driver: </PERFLOG method=semanticAnalyze start=1436283528298 end=1436283528347 duration=49>
15/07/07 15:38:48 INFO Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
15/07/07 15:38:48 INFO Driver: </PERFLOG method=compile start=1436283528296 end=1436283528348 duration=52>
15/07/07 15:38:48 INFO Driver: <PERFLOG method=Driver.execute>
15/07/07 15:38:48 INFO Driver: Starting command: CREATE TABLE IF NOT EXISTS src (key INT, value STRING)
15/07/07 15:38:48 INFO Driver: </PERFLOG method=TimeToSubmit start=1436283528296 end=1436283528349 duration=53>
15/07/07 15:38:48 INFO Driver: <PERFLOG method=runTasks>
15/07/07 15:38:48 INFO Driver: <PERFLOG method=task.DDL.Stage-0>
15/07/07 15:38:48 INFO DDLTask: Default to LazySimpleSerDe for table src
15/07/07 15:38:48 INFO HiveMetaStore: 0: create_table: Table(tableName:src, dbName:default, owner:root, createTime:1436283528, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:int, comment:null), FieldSchema(name:value, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:null, groupPrivileges:null, rolePrivileges:null))
15/07/07 15:38:48 INFO audit: ugi=root ip=unknown-ip-addr cmd=create_table: Table(tableName:src, dbName:default, owner:root, createTime:1436283528, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:int, comment:null), FieldSchema(name:value, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:null, groupPrivileges:null, rolePrivileges:null))
15/07/07 15:38:48 INFO Driver: </PERFLOG method=task.DDL.Stage-0 start=1436283528349 end=1436283528460 duration=111>
15/07/07 15:38:48 INFO Driver: </PERFLOG method=runTasks start=1436283528349 end=1436283528460 duration=111>
15/07/07 15:38:48 INFO Driver: </PERFLOG method=Driver.execute start=1436283528348 end=1436283528460 duration=112>
OK
15/07/07 15:38:48 INFO Driver: OK
15/07/07 15:38:48 INFO Driver: <PERFLOG method=releaseLocks>
15/07/07 15:38:48 INFO Driver: </PERFLOG method=releaseLocks start=1436283528460 end=1436283528461 duration=1>
15/07/07 15:38:48 INFO Driver: </PERFLOG method=Driver.run start=1436283528296 end=1436283528461 duration=165>
15/07/07 15:38:48 INFO Driver: <PERFLOG method=releaseLocks>
15/07/07 15:38:48 INFO Driver: </PERFLOG method=releaseLocks start=1436283528461 end=1436283528461 duration=0>
Time taken: 0.207 seconds
15/07/07 15:38:48 INFO CliDriver: Time taken: 0.207 seconds
15/07/07 15:38:48 INFO Driver: <PERFLOG method=releaseLocks>
15/07/07 15:38:48 INFO Driver: </PERFLOG method=releaseLocks start=1436283528465 end=1436283528465 duration=0>

テーブルができたか確認してみる。

spark-sql> show tables;

15/07/07 15:40:32 INFO ParseDriver: Parsing command: show tables

15/07/07 15:40:32 INFO ParseDriver: Parse Completed
15/07/07 15:40:32 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
15/07/07 15:40:32 INFO Driver: <PERFLOG method=Driver.run>
15/07/07 15:40:32 INFO Driver: <PERFLOG method=TimeToSubmit>
15/07/07 15:40:32 INFO Driver: <PERFLOG method=compile>
15/07/07 15:40:32 INFO Driver: <PERFLOG method=parse>
15/07/07 15:40:32 INFO ParseDriver: Parsing command: show tables
15/07/07 15:40:32 INFO ParseDriver: Parse Completed
15/07/07 15:40:32 INFO Driver: </PERFLOG method=parse start=1436283632094 end=1436283632095 duration=1>
15/07/07 15:40:32 INFO Driver: <PERFLOG method=semanticAnalyze>
15/07/07 15:40:32 INFO Driver: Semantic Analysis Completed
15/07/07 15:40:32 INFO Driver: </PERFLOG method=semanticAnalyze start=1436283632095 end=1436283632103 duration=8>
15/07/07 15:40:32 INFO ListSinkOperator: Initializing Self 0 OP
15/07/07 15:40:32 INFO ListSinkOperator: Operator 0 OP initialized
15/07/07 15:40:32 INFO ListSinkOperator: Initialization Done 0 OP
15/07/07 15:40:32 INFO Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
15/07/07 15:40:32 INFO Driver: </PERFLOG method=compile start=1436283632094 end=1436283632104 duration=10>
15/07/07 15:40:32 INFO Driver: <PERFLOG method=Driver.execute>
15/07/07 15:40:32 INFO Driver: Starting command: show tables
15/07/07 15:40:32 INFO Driver: </PERFLOG method=TimeToSubmit start=1436283632093 end=1436283632105 duration=12>
15/07/07 15:40:32 INFO Driver: <PERFLOG method=runTasks>
15/07/07 15:40:32 INFO Driver: <PERFLOG method=task.DDL.Stage-0>
15/07/07 15:40:32 INFO HiveMetaStore: 0: get_database: default
15/07/07 15:40:32 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
15/07/07 15:40:32 INFO HiveMetaStore: 0: get_tables: db=default pat=.*
15/07/07 15:40:32 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_tables: db=default pat=.*
15/07/07 15:40:32 INFO Driver: </PERFLOG method=task.DDL.Stage-0 start=1436283632105 end=1436283632119 duration=14>
15/07/07 15:40:32 INFO Driver: </PERFLOG method=runTasks start=1436283632105 end=1436283632120 duration=15>
15/07/07 15:40:32 INFO Driver: </PERFLOG method=Driver.execute start=1436283632104 end=1436283632120 duration=16>
OK
15/07/07 15:40:32 INFO Driver: OK
15/07/07 15:40:32 INFO Driver: <PERFLOG method=releaseLocks>
15/07/07 15:40:32 INFO Driver: </PERFLOG method=releaseLocks start=1436283632120 end=1436283632120 duration=0>
15/07/07 15:40:32 INFO Driver: </PERFLOG method=Driver.run start=1436283632093 end=1436283632120 duration=27>
15/07/07 15:40:32 INFO deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
15/07/07 15:40:32 INFO FileInputFormat: Total input paths to process : 1
15/07/07 15:40:32 INFO Driver: <PERFLOG method=releaseLocks>
15/07/07 15:40:32 INFO Driver: </PERFLOG method=releaseLocks start=1436283632161 end=1436283632161 duration=0>
src
Time taken: 0.106 seconds
15/07/07 15:40:32 INFO CliDriver: Time taken: 0.106 seconds
15/07/07 15:40:32 INFO Driver: <PERFLOG method=releaseLocks>
15/07/07 15:40:32 INFO Driver: </PERFLOG method=releaseLocks start=1436283632169 end=1436283632170 duration=1>

確かに[src]ができてます。

では、データをロードしてみます。

spark-sql> LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src;

15/07/07 15:42:17 INFO ParseDriver: Parsing command: LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src

15/07/07 15:42:17 INFO ParseDriver: Parse Completed
15/07/07 15:42:17 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
15/07/07 15:42:17 INFO Driver: <PERFLOG method=Driver.run>
15/07/07 15:42:17 INFO Driver: <PERFLOG method=TimeToSubmit>
15/07/07 15:42:17 INFO Driver: <PERFLOG method=compile>
15/07/07 15:42:17 INFO Driver: <PERFLOG method=parse>
15/07/07 15:42:17 INFO ParseDriver: Parsing command: LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src
15/07/07 15:42:17 INFO ParseDriver: Parse Completed
15/07/07 15:42:17 INFO Driver: </PERFLOG method=parse start=1436283737848 end=1436283737849 duration=1>
15/07/07 15:42:17 INFO Driver: <PERFLOG method=semanticAnalyze>
15/07/07 15:42:17 INFO HiveMetaStore: 0: get_table : db=default tbl=src
15/07/07 15:42:17 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=src
15/07/07 15:42:17 INFO Driver: Semantic Analysis Completed
15/07/07 15:42:17 INFO Driver: </PERFLOG method=semanticAnalyze start=1436283737849 end=1436283737939 duration=90>
15/07/07 15:42:17 INFO Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
15/07/07 15:42:17 INFO Driver: </PERFLOG method=compile start=1436283737848 end=1436283737944 duration=96>
15/07/07 15:42:17 INFO Driver: <PERFLOG method=Driver.execute>
15/07/07 15:42:17 INFO Driver: Starting command: LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src
15/07/07 15:42:17 INFO Driver: </PERFLOG method=TimeToSubmit start=1436283737848 end=1436283737944 duration=96>
15/07/07 15:42:17 INFO Driver: <PERFLOG method=runTasks>
15/07/07 15:42:17 INFO Driver: <PERFLOG method=task.COPY.Stage-0>
Copying data from file:/usr/local/spark-1.1.0-bin-hadoop2.4/examples/src/main/resources/kv1.txt
15/07/07 15:42:17 INFO Task: Copying data from file:/usr/local/spark-1.1.0-bin-hadoop2.4/examples/src/main/resources/kv1.txt to file:/tmp/hive-root/hive_2015-07-07_15-42-17_848_6713777572196549102-1/-ext-10000
Copying file: file:/usr/local/spark-1.1.0-bin-hadoop2.4/examples/src/main/resources/kv1.txt
15/07/07 15:42:17 INFO Task: Copying file: file:/usr/local/spark-1.1.0-bin-hadoop2.4/examples/src/main/resources/kv1.txt
15/07/07 15:42:17 INFO Driver: </PERFLOG method=task.COPY.Stage-0 start=1436283737944 end=1436283737963 duration=19>
15/07/07 15:42:17 INFO Driver: <PERFLOG method=task.MOVE.Stage-1>
Loading data to table default.src
15/07/07 15:42:17 INFO Task: Loading data to table default.src from file:/tmp/hive-root/hive_2015-07-07_15-42-17_848_6713777572196549102-1/-ext-10000
15/07/07 15:42:17 INFO HiveMetaStore: 0: get_table : db=default tbl=src
15/07/07 15:42:17 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=src
15/07/07 15:42:18 INFO HiveMetaStore: 0: get_table : db=default tbl=src
15/07/07 15:42:18 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=src
15/07/07 15:42:18 INFO HiveMetaStore: 0: alter_table: db=default tbl=src newtbl=src
15/07/07 15:42:18 INFO audit: ugi=root ip=unknown-ip-addr cmd=alter_table: db=default tbl=src newtbl=src
15/07/07 15:42:18 INFO HiveMetaStore: 0: get_table : db=default tbl=src
15/07/07 15:42:18 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=src
15/07/07 15:42:18 INFO Driver: </PERFLOG method=task.MOVE.Stage-1 start=1436283737963 end=1436283738097 duration=134>
15/07/07 15:42:18 INFO Driver: <PERFLOG method=task.STATS.Stage-2>
15/07/07 15:42:18 INFO StatsTask: Executing stats task
15/07/07 15:42:18 INFO HiveMetaStore: 0: get_table : db=default tbl=src
15/07/07 15:42:18 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=src
15/07/07 15:42:18 INFO HiveMetaStore: 0: get_table : db=default tbl=src
15/07/07 15:42:18 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=src
15/07/07 15:42:18 INFO HiveMetaStore: 0: alter_table: db=default tbl=src newtbl=src
15/07/07 15:42:18 INFO audit: ugi=root ip=unknown-ip-addr cmd=alter_table: db=default tbl=src newtbl=src
15/07/07 15:42:18 INFO HiveMetaStore: 0: get_table : db=default tbl=src
15/07/07 15:42:18 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=src
Table default.src stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 5812, raw_data_size: 0]
15/07/07 15:42:18 INFO Task: Table default.src stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 5812, raw_data_size: 0]
15/07/07 15:42:18 INFO Driver: </PERFLOG method=task.STATS.Stage-2 start=1436283738097 end=1436283738204 duration=107>
15/07/07 15:42:18 INFO Driver: </PERFLOG method=runTasks start=1436283737944 end=1436283738204 duration=260>
15/07/07 15:42:18 INFO Driver: </PERFLOG method=Driver.execute start=1436283737944 end=1436283738204 duration=260>
OK
15/07/07 15:42:18 INFO Driver: OK
15/07/07 15:42:18 INFO Driver: <PERFLOG method=releaseLocks>
15/07/07 15:42:18 INFO Driver: </PERFLOG method=releaseLocks start=1436283738204 end=1436283738204 duration=0>
15/07/07 15:42:18 INFO Driver: </PERFLOG method=Driver.run start=1436283737848 end=1436283738205 duration=357>
15/07/07 15:42:18 INFO Driver: <PERFLOG method=releaseLocks>
15/07/07 15:42:18 INFO Driver: </PERFLOG method=releaseLocks start=1436283738205 end=1436283738205 duration=0>
Time taken: 0.394 seconds
15/07/07 15:42:18 INFO CliDriver: Time taken: 0.394 seconds
15/07/07 15:42:18 INFO Driver: <PERFLOG method=releaseLocks>
15/07/07 15:42:18 INFO Driver: </PERFLOG method=releaseLocks start=1436283738208 end=1436283738208 duration=0>

OKと出てます。

では、テーブルの中にデータが入ったか確認。

spark-sql> select count(*) FROM src;

15/07/07 15:43:50 INFO ParseDriver: Parsing command: select count(*) FROM src

15/07/07 15:43:50 INFO ParseDriver: Parse Completed
15/07/07 15:43:50 INFO HiveMetaStore: 0: get_table : db=default tbl=src
15/07/07 15:43:50 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=src
15/07/07 15:43:50 INFO deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/07/07 15:43:50 INFO MemoryStore: ensureFreeSpace(454358) called with curMem=0, maxMem=278019440
15/07/07 15:43:50 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 443.7 KB, free 264.7 MB)
15/07/07 15:43:50 INFO SparkContext: Starting job: collect at HiveContext.scala:415
15/07/07 15:43:50 INFO FileInputFormat: Total input paths to process : 1
15/07/07 15:43:50 INFO DAGScheduler: Registering RDD 18 (mapPartitions at Exchange.scala:86)
15/07/07 15:43:50 INFO DAGScheduler: Got job 0 (collect at HiveContext.scala:415) with 1 output partitions (allowLocal=false)
15/07/07 15:43:50 INFO DAGScheduler: Final stage: Stage 0(collect at HiveContext.scala:415)
15/07/07 15:43:50 INFO DAGScheduler: Parents of final stage: List(Stage 1)
15/07/07 15:43:50 INFO DAGScheduler: Missing parents: List(Stage 1)
15/07/07 15:43:50 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[18] at mapPartitions at Exchange.scala:86), which has no missing parents
15/07/07 15:43:50 INFO MemoryStore: ensureFreeSpace(11024) called with curMem=454358, maxMem=278019440
15/07/07 15:43:50 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 10.8 KB, free 264.7 MB)
15/07/07 15:43:50 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[18] at mapPartitions at Exchange.scala:86)
15/07/07 15:43:50 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
15/07/07 15:43:51 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost, PROCESS_LOCAL, 1182 bytes)
15/07/07 15:43:51 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1182 bytes)
15/07/07 15:43:51 INFO Executor: Running task 0.0 in stage 1.0 (TID 0)
15/07/07 15:43:51 INFO Executor: Running task 1.0 in stage 1.0 (TID 1)
15/07/07 15:43:51 INFO HadoopRDD: Input split: file:/user/hive/warehouse/src/kv1.txt:2906+2906
15/07/07 15:43:51 INFO HadoopRDD: Input split: file:/user/hive/warehouse/src/kv1.txt:0+2906
15/07/07 15:43:51 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/07/07 15:43:51 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/07/07 15:43:51 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/07/07 15:43:51 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/07/07 15:43:51 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/07/07 15:43:51 INFO Executor: Finished task 1.0 in stage 1.0 (TID 1). 1895 bytes result sent to driver
15/07/07 15:43:51 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 1895 bytes result sent to driver
15/07/07 15:43:51 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) in 401 ms on localhost (1/2)
15/07/07 15:43:51 INFO DAGScheduler: Stage 1 (mapPartitions at Exchange.scala:86) finished in 0.430 s
15/07/07 15:43:51 INFO DAGScheduler: looking for newly runnable stages
15/07/07 15:43:51 INFO DAGScheduler: running: Set()
15/07/07 15:43:51 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1) in 413 ms on localhost (2/2)
15/07/07 15:43:51 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/07/07 15:43:51 INFO DAGScheduler: waiting: Set(Stage 0)
15/07/07 15:43:51 INFO DAGScheduler: failed: Set()
15/07/07 15:43:51 INFO DAGScheduler: Missing parents for Stage 0: List()
15/07/07 15:43:51 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[22] at map at HiveContext.scala:360), which is now runnable
15/07/07 15:43:51 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@1deaf84d
15/07/07 15:43:51 INFO MemoryStore: ensureFreeSpace(9792) called with curMem=465382, maxMem=278019440
15/07/07 15:43:51 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 9.6 KB, free 264.7 MB)
15/07/07 15:43:51 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[22] at map at HiveContext.scala:360)
15/07/07 15:43:51 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/07/07 15:43:51 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 948 bytes)
15/07/07 15:43:51 INFO Executor: Running task 0.0 in stage 0.0 (TID 2)
15/07/07 15:43:51 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/07/07 15:43:51 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
15/07/07 15:43:51 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 20 ms
15/07/07 15:43:51 INFO StatsReportListener: task runtime:(count: 2, mean: 407.000000, stdev: 6.000000, max: 413.000000, min: 401.000000)
15/07/07 15:43:51 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/07/07 15:43:51 INFO StatsReportListener: 401.0 ms 401.0 ms 401.0 ms 401.0 ms 413.0 ms 413.0 ms 413.0 ms 413.0 ms 413.0 ms
15/07/07 15:43:51 INFO StatsReportListener: shuffle bytes written:(count: 2, mean: 50.000000, stdev: 0.000000, max: 50.000000, min: 50.000000)
15/07/07 15:43:51 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/07/07 15:43:51 INFO StatsReportListener: 50.0 B 50.0 B 50.0 B 50.0 B 50.0 B 50.0 B 50.0 B 50.0 B 50.0 B
15/07/07 15:43:51 INFO StatsReportListener: task result size:(count: 2, mean: 1895.000000, stdev: 0.000000, max: 1895.000000, min: 1895.000000)
15/07/07 15:43:51 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/07/07 15:43:51 INFO StatsReportListener: 1895.0 B 1895.0 B 1895.0 B 1895.0 B 1895.0 B 1895.0 B 1895.0 B 1895.0 B 1895.0 B
15/07/07 15:43:51 INFO StatsReportListener: executor (non-fetch) time pct: (count: 2, mean: 76.642534, stdev: 1.081437, max: 77.723971, min: 75.561097)
15/07/07 15:43:51 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/07/07 15:43:51 INFO StatsReportListener: 76 % 76 % 76 % 76 % 78 % 78 % 78 % 78 % 78 %
15/07/07 15:43:51 INFO StatsReportListener: other time pct: (count: 2, mean: 23.357466, stdev: 1.081437, max: 24.438903, min: 22.276029)
15/07/07 15:43:51 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/07/07 15:43:51 INFO StatsReportListener: 22 % 22 % 22 % 22 % 24 % 24 % 24 % 24 % 24 %
15/07/07 15:43:51 INFO Executor: Finished task 0.0 in stage 0.0 (TID 2). 1076 bytes result sent to driver
15/07/07 15:43:51 INFO DAGScheduler: Stage 0 (collect at HiveContext.scala:415) finished in 0.114 s
15/07/07 15:43:51 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@5ad48e86
15/07/07 15:43:51 INFO SparkContext: Job finished: collect at HiveContext.scala:415, took 0.778304976 s
15/07/07 15:43:51 INFO StatsReportListener: task runtime:(count: 1, mean: 116.000000, stdev: 0.000000, max: 116.000000, min: 116.000000)
15/07/07 15:43:51 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/07/07 15:43:51 INFO StatsReportListener: 116.0 ms 116.0 ms 116.0 ms 116.0 ms 116.0 ms 116.0 ms 116.0 ms 116.0 ms 116.0 ms
15/07/07 15:43:51 INFO StatsReportListener: fetch wait time:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)
15/07/07 15:43:51 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/07/07 15:43:51 INFO StatsReportListener: 0.0 ms 0.0 ms 0.0 ms 0.0 ms 0.0 ms 0.0 ms 0.0 ms 0.0 ms 0.0 ms
15/07/07 15:43:51 INFO StatsReportListener: remote bytes read:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)
15/07/07 15:43:51 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/07/07 15:43:51 INFO StatsReportListener: 0.0 B 0.0 B 0.0 B 0.0 B 0.0 B 0.0 B 0.0 B 0.0 B 0.0 B
500
Time taken: 1.14 seconds
15/07/07 15:43:51 INFO CliDriver: Time taken: 1.14 seconds
spark-sql> 15/07/07 15:43:51 INFO StatsReportListener: task result size:(count: 1, mean: 1076.000000, stdev: 0.000000, max: 1076.000000, min: 1076.000000)
15/07/07 15:43:51 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/07/07 15:43:51 INFO StatsReportListener: 1076.0 B 1076.0 B 1076.0 B 1076.0 B 1076.0 B 1076.0 B 1076.0 B 1076.0 B 1076.0 B
15/07/07 15:43:51 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 2) in 116 ms on localhost (1/1)
15/07/07 15:43:51 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/07/07 15:43:51 INFO StatsReportListener: executor (non-fetch) time pct: (count: 1, mean: 94.827586, stdev: 0.000000, max: 94.827586, min: 94.827586)
15/07/07 15:43:51 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/07/07 15:43:51 INFO StatsReportListener: 95 % 95 % 95 % 95 % 95 % 95 % 95 % 95 % 95 %
15/07/07 15:43:51 INFO StatsReportListener: fetch wait time pct: (count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)
15/07/07 15:43:51 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/07/07 15:43:51 INFO StatsReportListener: 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 %
15/07/07 15:43:51 INFO StatsReportListener: other time pct: (count: 1, mean: 5.172414, stdev: 0.000000, max: 5.172414, min: 5.172414)
15/07/07 15:43:51 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/07/07 15:43:51 INFO StatsReportListener: 5 % 5 % 5 % 5 % 5 % 5 % 5 % 5 % 5 %

ログが多くてわかりにくいですが、


500

Time taken: 1.14 seconds


となっていて500レコード入っているのが取得できました。

hiveと同じように使えてめっちゃ早いなぁ。。。