LoginSignup
4
3

More than 3 years have passed since last update.

ExcelデータをSparkで読む方法(Spark1, RDD)

Last updated at Posted at 2019-03-02

summary

1. csv を RDD に読む

In [1]: import csv
In [2]: rdd  = sc.textFile('testdata1.csv')
In [3]: rdd2 = rdd.mapPartitions(lambda x: csv.reader(x))

2. csv を pandas dataframe に読み spark dataframe に移す

In [1]: import pandas as pd
In [2]: pdf=pd.read_csv('testdata1.csv')
In [3]: sdf = sqlContext.createDataFrame(pdf)

3. excel ファイルを pandas dataframe に読み spark dataframe に移す

In [1]: import pandas as pd
In [2]: pde=pd.ExcelFile('testdata1.xlsx')
In [3]: pds=pde.parse(pde.sheet_names[0], skiprows=1)
In [4]: sdf = sqlContext.createDataFrame(pds)

4. csv を spark dataframe に読む

In [1]: df = spark.read.csv('testdata1.csv', header=True, inferSchema=True, mode="DROPMALFORMED", encoding='UTF-8')
4
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
3