背景
統計解析環境であるRには、デフォルトで100個以上の解析用データセットが組み込まれており、解析手法を学習する際に、自前でデータを用意する必要がないという利点があります。
ところが、このデータセットにはどのようなものがあるか、どんなデータ型として格納されているかを把握する方法はあまり知られていないため、ここにメモしておきます。
実行コード
data()
関数の返り値に$resultを指定してあげると、データセット一覧をRの演算可能な行列形式で取得することができます。
result
> class(data()$results)
[1] matrix
> nrow(data()$results)
[1] 102
この環境では102個の組み込みデータセットが使用可能だとわかります。
次に、102個のうち先頭10個を表示させてみます。
> head(data()$results[,c("Item","Title")], n = 10)
Item Title
[1,] "AirPassengers" "Monthly Airline Passenger Numbers 1949-1960"
[2,] "BJsales" "Sales Data with Leading Indicator"
[3,] "BJsales.lead (BJsales)" "Sales Data with Leading Indicator"
[4,] "BOD" "Biochemical Oxygen Demand"
[5,] "CO2" "Carbon Dioxide Uptake in Grass Plants"
[6,] "ChickWeight" "Weight versus age of chicks on different diets"
[7,] "DNase" "Elisa assay of DNase"
[8,] "EuStockMarkets" "Daily Closing Prices of Major European Stock Indices, 1991-1998"
[9,] "Formaldehyde" "Determination of Formaldehyde"
[10,] "HairEyeColor" "Hair and Eye Color of Statistics Students"
次に、 BJsales.lead (BJsales)
のような表記をそのまま実行してもデータセットの中身が閲覧できないため、括弧の内部を削除した上で class()
関数の引数として与えます。
Items <- gsub(pattern="\\(.*\\)$",replacement="", x=data()$results[,"Item"])
Titles <- data()$results[,"Title"]
cArray <- array("",NROW(Items))
for(i in 1:NROW(Items)){
eval(parse(text=paste('cArray[i] <- class(',Items[i],')[length(class(',Items[i],'))]',sep="")))
}
dataSets <- cbind(Items,cArray,Titles)
結果をdataSetsという変数にまとめます。
Item名でソートした結果の先頭10行
> head(dataSets[order(Items),], n = 10)
Items cArray Titles
[1,] "ability.cov" "list" "Ability and Intelligence Tests"
[2,] "airmiles" "ts" "Passenger Miles on Commercial US Airlines, 1937-1960"
[3,] "AirPassengers" "ts" "Monthly Airline Passenger Numbers 1949-1960"
[4,] "airquality" "data.frame" "New York Air Quality Measurements"
[5,] "anscombe" "data.frame" "Anscombe's Quartet of 'Identical' Simple Linear Regressions"
[6,] "attenu" "data.frame" "The Joyner-Boore Attenuation Data"
[7,] "attitude" "data.frame" "The Chatterjee-Price Attitude Data"
[8,] "austres" "ts" "Quarterly Time Series of the Number of Australian Residents"
[9,] "beaver1 " "data.frame" "Body Temperature Series of Two Beavers"
[10,] "beaver2 " "data.frame" "Body Temperature Series of Two Beavers"
class名でソートした結果の先頭10行
> head(dataSets[order(cArray),], n = 10)
Items cArray Titles
[1,] "iris3" "array" "Edgar Anderson's Iris Data"
[2,] "state.abb " "character" "US State Facts and Figures"
[3,] "state.name " "character" "US State Facts and Figures"
[4,] "BOD" "data.frame" "Biochemical Oxygen Demand"
[5,] "CO2" "data.frame" "Carbon Dioxide Uptake in Grass Plants"
[6,] "ChickWeight" "data.frame" "Weight versus age of chicks on different diets"
[7,] "DNase" "data.frame" "Elisa assay of DNase"
[8,] "Formaldehyde" "data.frame" "Determination of Formaldehyde"
[9,] "Indometh" "data.frame" "Pharmacokinetics of Indomethacin"
[10,] "InsectSprays" "data.frame" "Effectiveness of Insect Sprays"
時系列型のデータのみを知りたい時
> dataSets[which(dataSets[,"cArray"]=="ts"),]
Items cArray Titles
[1,] "AirPassengers" "ts" "Monthly Airline Passenger Numbers 1949-1960"
[2,] "BJsales" "ts" "Sales Data with Leading Indicator"
[3,] "BJsales.lead " "ts" "Sales Data with Leading Indicator"
[4,] "JohnsonJohnson" "ts" "Quarterly Earnings per Johnson & Johnson Share"
[5,] "LakeHuron" "ts" "Level of Lake Huron 1875-1972"
[6,] "Nile" "ts" "Flow of the River Nile"
[7,] "Seatbelts" "ts" "Road Casualties in Great Britain 1969-84"
[8,] "UKDriverDeaths" "ts" "Road Casualties in Great Britain 1969-84"
[9,] "UKgas" "ts" "UK Quarterly Gas Consumption"
[10,] "USAccDeaths" "ts" "Accidental Deaths in the US 1973-1978"
[11,] "WWWusage" "ts" "Internet Usage per Minute"
[12,] "airmiles" "ts" "Passenger Miles on Commercial US Airlines, 1937-1960"
[13,] "austres" "ts" "Quarterly Time Series of the Number of Australian Residents"
[14,] "co2" "ts" "Mauna Loa Atmospheric CO2 Concentration"
[15,] "discoveries" "ts" "Yearly Numbers of Important Discoveries"
[16,] "fdeaths " "ts" "Monthly Deaths from Lung Diseases in the UK"
[17,] "freeny.y " "ts" "Freeny's Revenue Data"
[18,] "ldeaths " "ts" "Monthly Deaths from Lung Diseases in the UK"
[19,] "lh" "ts" "Luteinizing Hormone in Blood Samples"
[20,] "lynx" "ts" "Annual Canadian Lynx trappings 1821-1934"
[21,] "mdeaths " "ts" "Monthly Deaths from Lung Diseases in the UK"
[22,] "nhtemp" "ts" "Average Yearly Temperatures in New Haven"
[23,] "nottem" "ts" "Average Monthly Temperatures at Nottingham, 1920-1939"
[24,] "presidents" "ts" "Quarterly Approval Ratings of US Presidents"
[25,] "sunspot.month" "ts" "Monthly Sunspot Data, 1749-1997"
[26,] "sunspot.year" "ts" "Yearly Sunspot Data, 1700-1988"
[27,] "sunspots" "ts" "Monthly Sunspot Numbers, 1749-1983"
[28,] "treering" "ts" "Yearly Treering Data, -6000-1979"
[29,] "uspop" "ts" "Populations Recorded by the US Census"
参考URL
パッケージ 'datasets' の情報 - RjpWiki
統計を学びたい人へ贈る、統計解析に使えるデータセットまとめ - ほくそ笑む
実行環境
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Japanese_Japan.932 LC_CTYPE=Japanese_Japan.932 LC_MONETARY=Japanese_Japan.932
[4] LC_NUMERIC=C LC_TIME=Japanese_Japan.932
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.0.0