summary
1. csv を RDD に読む
In [1]: import csv
In [2]: rdd = sc.textFile('testdata1.csv')
In [3]: rdd2 = rdd.mapPartitions(lambda x: csv.reader(x))
2. csv を pandas dataframe に読み spark dataframe に移す
In [1]: import pandas as pd
In [2]: pdf=pd.read_csv('testdata1.csv')
In [3]: sdf = sqlContext.createDataFrame(pdf)
3. excel ファイルを pandas dataframe に読み spark dataframe に移す
In [1]: import pandas as pd
In [2]: pde=pd.ExcelFile('testdata1.xlsx')
In [3]: pds=pde.parse(pde.sheet_names[0], skiprows=1)
In [4]: sdf = sqlContext.createDataFrame(pds)
4. csv を spark dataframe に読む
In [1]: df = spark.read.csv('testdata1.csv', header=True, inferSchema=True, mode="DROPMALFORMED", encoding='UTF-8')