注意点
2021/09/28時点に確認した結果です。最新情報は別途ご確認ください。
はじめに
以下のデータの内容をまとめたかった(力技です)
Azure Databricks Databricks ファイル システム (DBFS)にマウントされたさまざまなデータセットが含まれています。 これらのデータセットは、ドキュメント全体の例で使用されます。
対象
直下にREADME.mdが配置されているもののみREADME.mdも確認しました。
直下にないものは更にサブフォルダがあったりします。
結果
databricks-datasets内容
path | name | 直下にREADME |
---|---|---|
dbfs:/databricks-datasets/COVID/ | COVID/ | × |
dbfs:/databricks-datasets/README.md | README.md | - |
dbfs:/databricks-datasets/Rdatasets/ | Rdatasets/ | 〇 |
dbfs:/databricks-datasets/SPARK_README.md | SPARK_README.md | - |
dbfs:/databricks-datasets/adult/ | adult/ | 〇 |
dbfs:/databricks-datasets/airlines/ | airlines/ | 〇 |
dbfs:/databricks-datasets/amazon/ | amazon/ | 〇 |
dbfs:/databricks-datasets/asa/ | asa/ | × |
dbfs:/databricks-datasets/atlas_higgs/ | atlas_higgs/ | × |
dbfs:/databricks-datasets/bikeSharing/ | bikeSharing/ | 〇 |
dbfs:/databricks-datasets/cctvVideos/ | cctvVideos/ | × |
dbfs:/databricks-datasets/credit-card-fraud/ | credit-card-fraud/ | × |
dbfs:/databricks-datasets/cs100/ | cs100/ | × |
dbfs:/databricks-datasets/cs110x/ | cs110x/ | × |
dbfs:/databricks-datasets/cs190/ | cs190/ | × |
dbfs:/databricks-datasets/data.gov/ | data.gov/ | 〇 |
dbfs:/databricks-datasets/definitive-guide/ | definitive-guide/ | 〇 |
dbfs:/databricks-datasets/delta-sharing/ | delta-sharing/ | × |
dbfs:/databricks-datasets/flights/ | flights/ | 〇 |
dbfs:/databricks-datasets/flower_photos/ | flower_photos/ | 〇 |
dbfs:/databricks-datasets/flowers/ | flowers/ | 〇 |
dbfs:/databricks-datasets/genomics/ | genomics/ | × |
dbfs:/databricks-datasets/hail/ | hail/ | × |
dbfs:/databricks-datasets/iot/ | iot/ | × |
dbfs:/databricks-datasets/iot-stream/ | iot-stream/ | 〇 |
dbfs:/databricks-datasets/learning-spark/ | learning-spark/ | 〇 |
dbfs:/databricks-datasets/learning-spark-v2/ | learning-spark-v2/ | × |
dbfs:/databricks-datasets/lending-club-loan-stats/ | lending-club-loan-stats/ | × |
dbfs:/databricks-datasets/med-images/ | med-images/ | × |
dbfs:/databricks-datasets/mnist-digits/ | mnist-digits/ | 〇 |
dbfs:/databricks-datasets/news20.binary/ | news20.binary/ | 〇 |
dbfs:/databricks-datasets/nyctaxi/ | nyctaxi/ | × |
dbfs:/databricks-datasets/nyctaxi-with-zipcodes/ | nyctaxi-with-zipcodes/ | × |
dbfs:/databricks-datasets/online_retail/ | online_retail/ | × |
dbfs:/databricks-datasets/overlap-join/ | overlap-join/ | × |
dbfs:/databricks-datasets/power-plant/ | power-plant/ | 〇 |
dbfs:/databricks-datasets/retail-org/ | retail-org/ | 〇 |
dbfs:/databricks-datasets/rwe/ | rwe/ | × |
dbfs:/databricks-datasets/sai-summit-2019-sf/ | sai-summit-2019-sf/ | 〇 |
dbfs:/databricks-datasets/sample_logs/ | sample_logs/ | × |
dbfs:/databricks-datasets/samples/ | samples/ | × |
dbfs:/databricks-datasets/sfo_customer_survey/ | sfo_customer_survey/ | × |
dbfs:/databricks-datasets/sms_spam_collection/ | sms_spam_collection/ | 〇 |
dbfs:/databricks-datasets/songs/ | songs/ | 〇 |
dbfs:/databricks-datasets/structured-streaming/ | structured-streaming/ | × |
dbfs:/databricks-datasets/timeseries/ | timeseries/ | × |
dbfs:/databricks-datasets/tpch/ | tpch/ | 〇 |
dbfs:/databricks-datasets/weather/ | weather/ | × |
dbfs:/databricks-datasets/wiki/ | wiki/ | × |
dbfs:/databricks-datasets/wikipedia-datasets/ | wikipedia-datasets/ | × |
dbfs:/databricks-datasets/wine-quality/ | wine-quality/ | 〇 |
databricks-datasets/README.md
Databricks Hosted Datasets
The data contained within this directory is hosted for users to build
data pipelines using Apache Spark and Databricks.License
Unless otherwise noted (e.g. within the README for a given data set), the data
is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0),
which can be viewed at the following url:
http://creativecommons.org/licenses/by/4.0/legalcodeContributions and Requests
To request or contribute new datasets to this repository, please send an email
to: hosted-datasets@databricks.com.When making the request, include the README.md file you want to publish. Make
sure the file includes information about the source of the data, the license,
and how to get additional information. Please ensure the license for this
data allows it to be hosted by Databricks and consumed by the public.
こちらだけ訳をかけました。
Databricks Hosted Datasets
このディレクトリに含まれるデータは、ユーザーがApache SparkとDatabricksを使って
このディレクトリに含まれるデータは、Apache SparkとDatabricksを使用してデータパイプラインを構築するため>にホストされています。ライセンス
特に断りのない限り(例:特定のデータセットのREADME内)、データは
はCreative Commons Attribution 4.0 International (CC BY 4.0)でライセンスされています。
データは以下のURLで見ることができます。
http://creativecommons.org/licenses/by/4.0/legalcodeご意見・ご要望
このリポジトリへの新しいデータセットの寄稿を希望される方は,電子メール
hosted-datasets@databricks.com に送ってください。リクエストの際には、公開したいREADME.mdファイルを添付してください。このファイルには
このファイルには、データの出所、ライセンス、追加情報の入手方法などの情報が含まれていることを確認してくだ>さい。
追加情報の入手方法が含まれていることを確認してください。また、このデータのライセンスが
このデータのライセンスが、Databricksでホストされ、一般に消費されることを許可するものであることを確認して>ください。
databricks-datasets/SPARK_README.md
Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
rich set of higher-level tools including Spark SQL for SQL and DataFrames,
MLlib for machine learning, GraphX for graph processing,
and Spark Streaming for stream processing.Online Documentation
You can find the latest Spark documentation, including a programming
guide, on the project web page
and project wiki.
This README file only contains basic setup instructions.Building Spark
Spark is built using Apache Maven.
To build Spark and its example programs, run:build/mvn -DskipTests clean package
(You do not need to do this if you downloaded a pre-built package.)
More detailed documentation is available from the project site, at
"Building Spark".Interactive Scala Shell
The easiest way to start using Spark is through the Scala shell:
./bin/spark-shell
Try the following command, which should return 1000:
scala> sc.parallelize(1 to 1000).count()
Interactive Python Shell
Alternatively, if you prefer Python, you can use the Python shell:
./bin/pyspark
And run the following command, which should also return 1000:
sc.parallelize(range(1000)).count()
Example Programs
Spark also comes with several sample programs in the
examples
directory.
To run one of them, use./bin/run-example <class> [params]
. For example:./bin/run-example SparkPi
will run the Pi example locally.
You can set the MASTER environment variable when running examples to submit
examples to a cluster. This can be a mesos:// or spark:// URL,
"yarn" to run on YARN, and "local" to run
locally with one thread, or "local[N]" to run locally with N threads. You
can also use an abbreviated class name if the class is in theexamples
package. For instance:MASTER=spark://host:7077 ./bin/run-example SparkPi
Many of the example programs print usage help if no params are given.
Running Tests
Testing first requires building Spark. Once Spark is built, tests
can be run using:./dev/run-tests
Please see the guidance on how to
run tests for a module, or individual tests.A Note About Hadoop Versions
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.Please refer to the build documentation at
"Specifying the Hadoop Version"
for detailed guidance on building for a particular distribution of Hadoop, including
building for particular Hive and Hive Thriftserver distributions.Configuration
Please refer to the Configuration Guide
in the online documentation for an overview on how to configure Spark.
ここまでdatabricks-datasets
databricks-datasetsサブディレクトリ
Rdatasets/README.md
Rdatasets
is a collection of 747 datasets that were originally distributed alongside the >statistical software environmentR
and some of its add-on packages. The goal is to make >these data more broadly accessible for teaching and statistical software development.The list of available datasets (csv and docs) is available here:
For more information, please see the README file within the latest
data
subdirectoryVersions
- data-001 is from the git hash: aa0d6940a9
adult/README.md
Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" >dataset.
Source
Donor:
Ronny Kohavi and Barry Becker
Data Mining and Visualization
Silicon Graphics.
e-mail: ronnyk '@' live.com for questions.Data Set Information:
Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean >records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& >(HRSWK>0))
Prediction task is to determine whether a person makes over 50K a year.
Attribute Information
Label:
50K, <=50K
Attributes:
- age: continuous
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, >Without-pay, Never-worked
- fnlwgt: continuous
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, >9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
- education-num: continuous
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, >Married-spouse-absent, Married-AF-spouse
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, >Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, >Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
- sex: Female, Male
- capital-gain: continuous
- capital-loss: continuous
- hours-per-week: continuous
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, >Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, >Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, >Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, >Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands
Citation
If you publish material based on databases obtained from this repository, then, in your >acknowledgements, please note the assistance you received by using this repository. This will >help others to obtain the same data sets and replicate your experiments. We suggest the >following pseudo-APA reference format for referring to this repository:
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, >CA: University of California, School of Information and Computer Science.
airlines/README.md
Airline On-Time Statistics and Delay Causes
Background
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks >the on-time performance of domestic flights operated by large air carriers. Summary information >on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air >Travel Consumer Report, published about 30 days after the month's end, as well as in summary >tables posted on this website. BTS began collecting details on the causes of flight delays in >June 2003. Summary statistics and raw data are made available to the public at the time the Air >Travel Consumer Report is released.
FAQ Information is available at http://www.rita.dot.gov/bts/help_with_data/aviation/index.html
Data Source
http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
Usage Restrictions
The data is released under the Freedom of Information act. More information can be found at >http://www.fas.org/sgp/foia/citizen.html
amazon/README.md
Amazon Reviews datasets
Thedata20K
andtest4K
datasets were created by Professor Julian McAuley at the University of California San Diego with the permission for use in >thedatabricks-datasets
bucket by Databricks users.Source: Image-based recommendations on styles and substitutes.
J. McAuley, C. Targett, J. Shi, A. van den Hengel.
SIGIR, 2015.
bikeSharing/README.md
Bike Sharing Dataset
Hadi Fanaee-T
Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto
INESC Porto, Campus da FEUP
Rua Dr. Roberto Frias, 378
4200 - 465 Porto, PortugalBackground
Bike sharing systems are new generation of traditional bike rentals where whole process from >membership, rental and return
back has become automatic. Through these systems, user is able to easily rent a bike from a >particular position and return
back at another position. Currently, there are about over 500 bike-sharing programs around the >world which is composed of
over 500 thousands bicycles. Today, there exists great interest in these systems due to their >important role in traffic,
environmental and health issues.Apart from interesting real world applications of bike sharing systems, the characteristics of >data being generated by
these systems make them attractive for the research. Opposed to other transport services such >as bus or subway, the duration
of travel, departure and arrival position is explicitly recorded in these systems. This feature >turns bike sharing system into
a virtual sensor network that can be used for sensing mobility in the city. Hence, it is >expected that most of important
events in the city could be detected via monitoring these data.Dataset
Bike-sharing rental process is highly correlated to the environmental and seasonal settings. >For instance, weather conditions,
precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The >core data set is related to
the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, >Washington D.C., USA which is
publicly available in http://capitalbikeshare.com/system-data. We aggregated the data on two >hourly and daily basis and then
extracted and added the corresponding weather and seasonal information. Weather information are >extracted from http://www.freemeteo.com.Associated Tasks
Regression:
- Predication of bike rental count hourly or daily based on the environmental and >seasonal settings.
Event and Anomaly Detection:
- Count of rented bikes are also correlated to some events in the town which easily are >traceable via search engines.
For instance, query like "2012-10-30 washington d.c." in Google returns related results >to Hurricane Sandy. Some of the important events are
identified in [1]. Therefore the data can be used for validation of anomaly or event >detection algorithms as well.Files
- hour.csv : bike sharing counts aggregated on hourly basis. Records: 17379 hours
- day.csv : bike sharing counts aggregated on daily basis. Records: 731 days
Dataset characteristics
Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/>holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered >clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
License
Use of this dataset in publications must be cited to the following publication:
[1] Fanaee-T, Hadi, and Gama, Joao, "Event labeling combining ensemble detectors and background >knowledge", Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, >doi:10.1007/s13748-013-0040-3.
@article{
year={2013},
issn={2192-6352},
journal={Progress in Artificial Intelligence},
doi={10.1007/s13748-013-0040-3},
title={Event labeling combining ensemble detectors and background knowledge},
url={http://dx.doi.org/10.1007/s13748-013-0040-3},
publisher={Springer Berlin Heidelberg},
keywords={Event labeling; Event detection; Ensemble learning; Background knowledge},
author={Fanaee-T, Hadi and Gama, Joao},
pages={1-15}
}Contact
For further information about this dataset please contact Hadi Fanaee-T (hadi.fanaee@fe.up.pt)
##data.gov/README.md
Data.gov Datasets
This folder houses data that is copied from http://www.data.gov/. This >vast trove of data is published and maintained by the government of the United States.
We only provide a small subset of datasets that are published on the site and it's worth exploring http://www.data.gov/ itself if you want to find other data to work with!
definitive-guide/README.md
This folder contains all of the datasets used in The Definitive Guide.
The datasets are as follow.
Flight Data
This data comes from the United States Bureau of Transportation. Please see the website for >more information: https://www.rita.dot.gov/bts/help_with_data/aviation/index.html
Retail Data
Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case >study of RFM model-based customer segmentation using data mining, Journal of Database Marketing >and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before >print: 27 August 2012. doi: 10.1057/dbm.2012.17).
The data was downloaded from the UCI Machine Learning Repository. Please see this page for more >information: http://archive.ics.uci.edu/ml/datasets/Online+Retail
Bike Data
This data comes from the Bay Area Bike Share network. Please see this page for more infomation: >http://www.bayareabikeshare.com/open-data
Sensor Data (Heterogeneity Human Activity Recognition Dataset)
Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, >Anind Dey, Tobias Sonne, and Mads Møller Jensen "Smart Devices are Different: Assessing and >Mitigating Mobile Sensing Heterogeneities for Activity Recognition" In Proc. 13th ACM >Conference on Embedded Networked Sensor Systems (SenSys 2015), Seoul, Korea, 2015. [Web Link]
The data was downloaded from the UCI Machine Learning Repository. It is formally known as the Heterogeneity Human Activity Recognition Dataset. Please see this page for more information: https://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition
flights/README.md
On-Time Performance Datasets
The source
airports
dataset can be found at [OpenFlights Airport, airline and route data]>(http://openflights.org/data.html).The
flights
, also known as thedeparturedelays
, dataset can be found at Airline On-Time Performance and Causes of Flight Delays: On_Time Data
flower_photos/README.md
Flowers
This data set was obtained from
https://www.tensorflow.org/datasets/catalog/tf_flowers
The source of the data is:
Author: "The TensorFlow Team",
Title: "Flowers",
Url: "http://download.tensorflow.org/example_images/flower_photos.tgz”Data Set Information
A large set of images of flowers.
License and/or Citation
All images in this archive are licensed under the Creative Commons By-Attribution License, >available at:
https://creativecommons.org/licenses/by/2.0/The photographers are listed below, thanks to all of them for making their work available, and >please be sure to credit them for any use as per the license.
See the full list of photos and photographers in LICENSE.txt.
Citation:
@ONLINE {tfflowers,
author = "The TensorFlow Team",
title = "Flowers",
month = "jan",
year = "2019",
url = "http://download.tensorflow.org/example_images/flower_photos.tgz" }
flowers/README.md
Flowers
This data set was obtained from
https://www.tensorflow.org/datasets/catalog/tf_flowers
The source of the data is:
Author: "The TensorFlow Team",
Title: "Flowers",
Url: "http://download.tensorflow.org/example_images/flower_photos.tgz”Data Set Information
A Delta table contains a large set of images of flowers. The ‘content’ column is a binary >column of the images, and the ‘label’ column is a string column of the labels. The ‘path’ >column the dbfs path of the image and the ‘size’ column contains the width and height of the >image.
License and/or Citation
All images in this archive are licensed under the Creative Commons By-Attribution License, >available at:
https://creativecommons.org/licenses/by/2.0/The photographers are listed below, thanks to all of them for making their work available, and >please be sure to credit them for any use as per the license.
(See the full list of photos and photographers in LICENSE.txt.)
Citation:
@ONLINE {tfflowers,
author = "The TensorFlow Team",
title = "Flowers",
month = "jan",
year = "2019",
url = "http://download.tensorflow.org/example_images/flower_photos.tgz" }
##iot-stream/README.md
IOT Device Data
This dataset was created by Databricks.
It contains fake generated data in json and csv formats.
e.g.
{"user_id": 12, "calories_burnt": 489.79998779296875, "num_steps": 9796, "miles_walked": 4.>8979997634887695, "time_stamp": "2018-07-24 03:54:00.893775", "device_id": 10}
Data Set Information
Schema for data-device:
[StructField(id,LongType,false), StructField(user_id,LongType,true), StructField(device_id,LongType,true), StructField(num_steps,LongType,true), StructField(miles_walked,FloatType,true), StructField(calories_burnt,FloatType,true), StructField(timestamp,StringType,true), StructField(value,StringType,true)]
Schema for data-user:
[StructField(userid,IntegerType,true), StructField(gender,StringType,true), StructField(age,IntegerType,true), StructField(height,IntegerType,true), StructField(weight,IntegerType,true), StructField(smoker,StringType,true), StructField(familyhistory,StringType,true), StructField(cholestlevs,StringType,true), StructField(bp,StringType,true), StructField(risk,IntegerType,true)]
License and/or Citation
Copyright (2018) Databricks, Inc.
This dataset is licensed under a Creative Commons Attribution 4.0 International Licensehttps://creativecommons.org/licenses/by/4.0/.
learning-spark/README.md
Learning Spark - Example Data From The Book
This dataset holds the files for examples in the Learning Spark book. These examples are used >throughout
the book.For more information, please see the
README from the Learning Spark github projectLicense
The files in the Learning Spark github project are licensed with the
MIT license as defined in https://github.com/databricks/learning-spark/blob/master/LICENSE.mdVersions
- data-001 is from the git hash: 13c39f22b1
##mnist-digits/README.md
MNIST handwritten digits dataset
Data Source
LibSVM Datasets
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist
Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM >Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011.Original Data Set Source
Yann LeCun, L. Bottou, Y. Bengio, and P. Haffner.
Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278-2324, November 1998.
MNIST database available at http://yann.lecun.com/exdb/mnist/
news20.binary/README.md
20 Newsgroups Dataset -- Binary Classification
This is a processed version of the 20 Newsgroup Dataset, saved in a parquet format.
Attribute Information
- newsgroup:string, Name of Newsgroup
- content:string, Document Content
- relatedToSci:integer, 1/0 binary indicator to determine if article belongs to a sci newsgroup >or not
List of Newsgroups:
- alt.atheism
- comp.graphics
- comp.os.ms-windows.misc
- comp.sys.ibm.pc.hardware
- comp.sys.mac.hardware
- comp.windows.x
- misc.forsale
- rec.autos
- rec.motorcycles
- rec.sport.baseball
- rec.sport.hockey
- sci.crypt
- sci.electronics
- sci.med
- sci.space
- soc.religion.christian
- talk.politics.guns
- talk.politics.mideast
- talk.politics.misc
- talk.religion.misc
Source
####Original Owner and Donor
Tom Mitchell
School of Computer Science
Carnegie Mellon University
tom.mitchell@cmu.eduDate Donated: September 9, 1999
Acknowledgements, Copyright Information, and Availability
You may use this material free of charge for any educational purpose, provided attribution is >given in any lectures or publications that make use of this material.
power-plant/README.md
Combined Cycle Power Plant Data Set
Power Plant Sensor Readings Data SetSource
http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
##Summary
The example data is provided by UCI at UCI Machine Learning Repository Combined Cycle Power >Plant Data Set
You can read the background on the UCI page, but in summary we have collected a number of >readings from sensors at a Gas Fired Power Plant (also called a Peaker Plant) and now we want >to use those sensor readings to predict how much power the plant will generate.Usage License
If you publish material based on databases obtained from this repository, then, in your >acknowledgements, please note the assistance you received by using this repository. This will >help others to obtain the same data sets and replicate your experiments. We suggest the >following reference format for referring to this repository:
Pınar Tüfekci, Prediction of full load electrical power output of a base load operated combined >cycle power plant using machine learning methods, International Journal of Electrical Power & >Energy Systems, Volume 60, September 2014, Pages 126-140, ISSN 0142-0615, Link, LinkHeysem Kaya, Pınar Tüfekci , Sadık Fikret Gürgen: Local and Global Learning Methods for Predicting Power of a Combined Gas & Steam Turbine, Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE 2012, pp. 13-18 (Mar. 2012, Dubai)
retail-org/README.md
Synthetic Retail Dataset
This dataset is a collection of files representing different dimensions and facts for a retail >organization.Provenance
This dataset was generated by Databricks.Data Set Information
- Sales Orders: sales_orders/sales_orders.json records the customers' originating purchase >order.
- Purchase Orders: purchase_orders/purchase_orders.xml contains the raw materials that are >being purchased.
- Products: products/products.csv contains products that the company sells.
- Goods Receipt: goods_receipt/goods_receipt.parquet contains the arrival time of purchased >orders.
- Customers: customers/customers.csv contains those customers who are located in the US and >are buying the finished products.
- Suppliers: suppliers/suppliers.csv contains suppliers that provide raw materials in the >US.
- Sales Stream: sales_stream/sales_stream.json/ is a folder containing JSON files for >streaming purposes.
- Promotions: promotions/promotions.csv contains additional benefits on top of normal >purchases.
- Active Promotions: active_promotions/active_promotions.parquet shows how customers are >progressing towards becoming eligible for promotions.
- Loyalty Segment: loyalty_segment/loyalty_segment.csv contains segmented customer data to >appeal to all types of guests using targeted rewards and promotions.
License and/or Citation
Copyright (2020) Databricks, Inc. This dataset is licensed under a Creative Commons Attribution 4.0 International Licensehttps://creativecommons.org/licenses/by/4.0/
sai-summit-2019-sf/README.md
README
Introduction
Fire Calls-For-Service includes all fire units responses to calls. Each record includes the >call number, incident number, address, unit identifier, call type, and disposition. All >relevant time intervals are also included. Because this dataset is based on responses, and >since most calls involved multiple units, there are multiple records for each call number. >Addresses are associated with a block number, intersection or call box, not a specific address.
License
The data itself is available under an ODC Public Domain Dedication and License.
Additional Information
See https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3
sms_spam_collection/README.md
SMS Spam Collection v. 1
The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected >for mobile phone spam research. It has one collection composed by 5,574 English, real and >non-enconded messages, tagged according being legitimate (ham) or spam.
Composition
This corpus has been collected from free or free for research sources at the Internet:
- A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. >This is a UK forum in which cell phone users make public claims about SMS spam messages, most >of them without reporting the very spam message received. The identification of the text of >spam messages in the claims is a very hard and time-consuming task, and it involved carefully >scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/.
- A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a >dataset of about 10,000 legitimate messages collected for research at the Department of >Computer Science at the National University of Singapore. The messages largely originate from >Singaporeans and mostly from students attending the University. These messages were collected >from volunteers who were made aware that their contributions were going to be made publicly >available. The NUS SMS Corpus is avalaible at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/>corpora/smsCorpus/.
- A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at http://>etheses.bham.ac.uk/253/1/Tagg09PhD.pdf.
Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and >322 spam messages and it is public available at: http://www.esp.uem.es/jmgomez/smsspamcorpus/.
songs/README.md
Sample of Million Song Dataset
Source
This data is a small subset of the Million Song Dataset.
The original data was contributed by The Echo Nest.
Prepared by T. Bertin-MahieuxAttribute Information
- artist_id:string
- artist_latitude:double
- artist_longitude:double
- artist_location:string
- artist_name:string
- duration:double
- end_of_fade_in:double
- key:int
- key_confidence:double
- loudness:double
- release:string
- song_hotnes:double
- song_id:string
- start_of_fade_out:double
- tempo:double
- time_signature:double
- time_signature_confidence:double
- title:string
- year:double
- partial_sequence:int
Citation
Using the dataset?
Please cite the following paper pdf [bib]>(http://labrosa.ee.columbia.edu/millionsong/sites/default/files/millionsong_ismir11_1.bib):
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere.
The Million Song Dataset. In Proceedings of the 12th International Society
for Music Information Retrieval Conference (ISMIR 2011), 2011.Acknowledgements
The Million Song Dataset was created under a grant from the National Science Foundation, >project IIS-0713334. The original data was contributed by The Echo Nest, as part of an >NSF-sponsored GOALI collaboration. Subsequent donations from SecondHandSongs.com, musiXmatch.>com, and last.fm, as well as further donations from The Echo Nest, are gratefully acknowledged.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.
tpch/README.md
The data in this directory was generated to run the TPC-H benchmark.
The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business >oriented
ad-hoc queries and concurrent data modifications. The queries and the data populating the >database have
been chosen to have broad industry-wide relevance. This benchmark illustrates decision support >systems
that examine large volumes of data, execute queries with a high degree of complexity, and give >answers
to critical business questions.For more information, refer to the Transaction Processing Performance Council's
TPC-H pageVersions
- data-001 is a ~10GB TPC-H dataset and was generated by Parviz Deyhim parviz@databricks.com
wine-quality/README.md
Wine Quality Data Set
Two datasets related to red and white variants of the Portuguese "Vinho Verde" wine.
Provenance
This data set was obtained from http://archive.ics.uci.edu/ml/datasets/wine+quality.
The source of the data is:
Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez
A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region>(CVRVV), Porto, Portugal. @2009License and/or Citation
Example:
This data set is licensed under the following license: See citations.Applicable citations:
Cortez, Paulo (2009). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, >CA: University of California, School of Information and Computer Science.P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.