1
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

[R] tsvファイルがうまく読み込めない時はfread()がいいかも

Last updated at Posted at 2018-01-06

Mercari Price Suggestion Challenge | Kaggle
において,tsv(タブ区切り)データを読み込む際に苦労しました。

データの読み込みにread.csv()read.delim()を使ったところ,sep="\t"と引数を指定しても"♥️\t2""⊙︿⊙\n155"のように絵文字のあとに続けて\tや\nがあると\tで区切られず,データフレームの列がズレてぐちゃぐちゃになっていました。
(私はWindows7を使っているのですが,Windows7でUTF-8のtsvを読み込んでいることも原因のひとつかもしれません)

一方,fread()を使ったところ,\tをきちんと把握して綺麗にタブ区切りして読み込んでくれました。

なので,tsvがうまく読み込めないときはfread()を使ったほうがよいかもしれません

コード例

read.csv()

R}
df_train_csv <- read.csv("data/train.tsv", sep="\t", quote = "", na.strings="",
                  stringsAsFactors=FALSE, header=TRUE, encoding = 'UTF-8')

> df_train_csv[155,8]
[1] "Iridescent fishbowl slime : beautiful !! ***All Single Slimes are 2 OZ unless otherwise stated *** Special ** BUY 2 GET ONE FREE** (2 oz slimes )(◕‿◕)  Just leave a message after our purchase. Free slimes will be send randomly (2 oz size) ( slime, floam , fishbowl, random color)=^_^=  (╹◡╹) Activator ( tiny bottle enough for one slime) also will be provided for slime if it get too sticky  ** Items will be send through usps with tracking , once i shipped item it's out of my hands , please don't rate me poor because of post office fault !!** ⊙︿⊙\n155"
> df_train_csv[242,8]
[1] "100% authentic Wristlet Phone Case, zip around, MK signature Vanilla Color Retail Price [rm] Firm price Check out my closet ❤️\n242"
> df_train_csv[263040,2]
[1] "♥️ FREESHIP BUTT AND TUMMY SHAPER ♥️\t2"

絵文字の直後の\tや\nが認識されていない

qiita01.PNG

そのせいで列がズレてしまう

read.delim()

R}
library(dplyr)
df_train_delim <- read.delim("data/train.tsv", sep='\t', na.strings="",
                    stringsAsFactors=FALSE, header=TRUE,
                    quote = "", encoding = 'UTF-8')

> df_train_delim[155,8]
[1] "Iridescent fishbowl slime : beautiful !! ***All Single Slimes are 2 OZ unless otherwise stated *** Special ** BUY 2 GET ONE FREE** (2 oz slimes )(◕‿◕)  Just leave a message after our purchase. Free slimes will be send randomly (2 oz size) ( slime, floam , fishbowl, random color)=^_^=  (╹◡╹) Activator ( tiny bottle enough for one slime) also will be provided for slime if it get too sticky  ** Items will be send through usps with tracking , once i shipped item it's out of my hands , please don't rate me poor because of post office fault !!** ⊙︿⊙\n155"
> df_train_delim[242,8]
[1] "100% authentic Wristlet Phone Case, zip around, MK signature Vanilla Color Retail Price [rm] Firm price Check out my closet ❤️\n242"
> df_train_delim[263040,2]
[1] "♥️ FREESHIP BUTT AND TUMMY SHAPER ♥️\t2"

この場合も同様

fread()

R}
library(data.table)
df_train_fread <- fread('data/train.tsv', showProgress = FALSE, encoding = 'UTF-8',
                        data.table = FALSE)

> df_train_fread[155,8]
[1] "Iridescent fishbowl slime : beautiful !! ***All Single Slimes are 2 OZ unless otherwise stated *** Special ** BUY 2 GET ONE FREE** (2 oz slimes )(◕‿◕)  Just leave a message after our purchase. Free slimes will be send randomly (2 oz size) ( slime, floam , fishbowl, random color)=^_^=  (╹◡╹) Activator ( tiny bottle enough for one slime) also will be provided for slime if it get too sticky  ** Items will be send through usps with tracking , once i shipped item it's out of my hands , please don't rate me poor because of post office fault !!** ⊙︿⊙"
> df_train_fread[242,8]
[1] "100% authentic Wristlet Phone Case, zip around, MK signature Vanilla Color Retail Price [rm] Firm price Check out my closet ❤️"
> df_train_fread[263040,2]
[1] "♥️ FREESHIP BUTT AND TUMMY SHAPER ♥️"

うまくいった

qiita02.PNG

1
5
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?