Mercari Price Suggestion Challenge | Kaggle
において,tsv(タブ区切り)データを読み込む際に苦労しました。
データの読み込みにread.csv()
やread.delim()
を使ったところ,sep="\t"
と引数を指定しても"♥️\t2"
や "⊙︿⊙\n155"
のように絵文字のあとに続けて\tや\nがあると\tで区切られず,データフレームの列がズレてぐちゃぐちゃになっていました。
(私はWindows7を使っているのですが,Windows7でUTF-8のtsvを読み込んでいることも原因のひとつかもしれません)
一方,fread()
を使ったところ,\tをきちんと把握して綺麗にタブ区切りして読み込んでくれました。
なので,tsvがうまく読み込めないときはfread()
を使ったほうがよいかもしれません。
コード例
read.csv()
R}
df_train_csv <- read.csv("data/train.tsv", sep="\t", quote = "", na.strings="",
stringsAsFactors=FALSE, header=TRUE, encoding = 'UTF-8')
> df_train_csv[155,8]
[1] "Iridescent fishbowl slime : beautiful !! ***All Single Slimes are 2 OZ unless otherwise stated *** Special ** BUY 2 GET ONE FREE** (2 oz slimes )(◕‿◕) Just leave a message after our purchase. Free slimes will be send randomly (2 oz size) ( slime, floam , fishbowl, random color)=^_^= (╹◡╹) Activator ( tiny bottle enough for one slime) also will be provided for slime if it get too sticky ** Items will be send through usps with tracking , once i shipped item it's out of my hands , please don't rate me poor because of post office fault !!** ⊙︿⊙\n155"
> df_train_csv[242,8]
[1] "100% authentic Wristlet Phone Case, zip around, MK signature Vanilla Color Retail Price [rm] Firm price Check out my closet ❤️\n242"
> df_train_csv[263040,2]
[1] "♥️ FREESHIP BUTT AND TUMMY SHAPER ♥️\t2"
絵文字の直後の\tや\nが認識されていない
そのせいで列がズレてしまう
read.delim()
R}
library(dplyr)
df_train_delim <- read.delim("data/train.tsv", sep='\t', na.strings="",
stringsAsFactors=FALSE, header=TRUE,
quote = "", encoding = 'UTF-8')
> df_train_delim[155,8]
[1] "Iridescent fishbowl slime : beautiful !! ***All Single Slimes are 2 OZ unless otherwise stated *** Special ** BUY 2 GET ONE FREE** (2 oz slimes )(◕‿◕) Just leave a message after our purchase. Free slimes will be send randomly (2 oz size) ( slime, floam , fishbowl, random color)=^_^= (╹◡╹) Activator ( tiny bottle enough for one slime) also will be provided for slime if it get too sticky ** Items will be send through usps with tracking , once i shipped item it's out of my hands , please don't rate me poor because of post office fault !!** ⊙︿⊙\n155"
> df_train_delim[242,8]
[1] "100% authentic Wristlet Phone Case, zip around, MK signature Vanilla Color Retail Price [rm] Firm price Check out my closet ❤️\n242"
> df_train_delim[263040,2]
[1] "♥️ FREESHIP BUTT AND TUMMY SHAPER ♥️\t2"
この場合も同様
fread()
R}
library(data.table)
df_train_fread <- fread('data/train.tsv', showProgress = FALSE, encoding = 'UTF-8',
data.table = FALSE)
> df_train_fread[155,8]
[1] "Iridescent fishbowl slime : beautiful !! ***All Single Slimes are 2 OZ unless otherwise stated *** Special ** BUY 2 GET ONE FREE** (2 oz slimes )(◕‿◕) Just leave a message after our purchase. Free slimes will be send randomly (2 oz size) ( slime, floam , fishbowl, random color)=^_^= (╹◡╹) Activator ( tiny bottle enough for one slime) also will be provided for slime if it get too sticky ** Items will be send through usps with tracking , once i shipped item it's out of my hands , please don't rate me poor because of post office fault !!** ⊙︿⊙"
> df_train_fread[242,8]
[1] "100% authentic Wristlet Phone Case, zip around, MK signature Vanilla Color Retail Price [rm] Firm price Check out my closet ❤️"
> df_train_fread[263040,2]
[1] "♥️ FREESHIP BUTT AND TUMMY SHAPER ♥️"
うまくいった