TPC-DS scale factor = 1000(1TB)のデータセットを用意する（ DBFS上のInitスクリプトの有効期間終了ver）

Last updated at 2025-04-07Posted at 2025-01-30

※コードなど詳細についてはgithubにまとめました。ご覧ください。

はじめに

TPC-DSデータセット 1TB(scale facter = 1000)の生成にかなり苦労したので、
参考になった資料、躓きポイントをまとめました。

簡単なまとめとなり大変恐縮ですが、皆様の一助となれば幸いです。

手順

1. 以下の記事を見て全体像を把握

2. 1.の記事にリンクのあるGithubのクローンをDatabricksに置く

2-1. 【準備】ADLSの準備とクラスタを作成しておく

▽クラスタ構成（一例）

2-2. TPC-DS-Configure.pyを実行する

DBFS上のInitスクリプトは有効期間が終了しているので以下対応策です。
※Unity Catalog無効環境で実施してます。
①ノートブックに記載のある以下を実行し、スクリプトを一旦dbfsに保存

dbutils.fs.mkdirs("dbfs:/databricks/scripts")

dbutils.fs.put("/databricks/scripts/tpcds-install.sh","""
#!/bin/bash
sudo apt-get --assume-yes install gcc make flex bison byacc git

cd /usr/local/bin
git clone https://github.com/databricks/tpcds-kit.git
cd tpcds-kit/tools
make OS=LINUX""", True)

② DBFS上のファイルを /FileStore ディレクトリにコピー

dbutils.fs.cp("/databricks/scripts/tpcds-install.sh", "/FileStore/tpcds-install.sh")

③ダウンロードリンクの生成:

displayHTML("<a href='/files/tpcds-install.sh' download>Download tpcds-install.sh</a>")

④表示されたリンクをクリックして、ファイルをダウンロード

⑤ダウンロードした.shファイルをvisual studioなどを使用し以下のように修正します。
修正せず、そのままinitスクリプトとして使用するとエラーになります。

#!/bin/bash

# 必要なパッケージのインストール
apt-get update
apt-get --assume-yes install gcc make flex bison byacc git

# tpcds-kitのクローンとビルド
cd /usr/local/bin
git clone https://github.com/databricks/tpcds-kit.git
cd tpcds-kit/tools
make OS=LINUX

⑥スクリプトをローカルからDatabricksの任意のワークスペースにアップロード

⑦ワークスペースにアップロードしたスクリプトのパスを用いてクラスターのinitスクリプトに登録

⑧クラスターを再起動

▽参考
https://learn.microsoft.com/ja-jp/azure/databricks/init-scripts/#cluster-scoped-init-script

2-3. TPC-DS-GenerateData.pyを実行する

scale facter = 1000　など適宜値を変更する
私が実行したコードは以下

%scala
import com.databricks.spark.sql.perf.tpcds.TPCDSTables

// Set:
val scaleFactor = "1000" // scaleFactor defines the size of the dataset to generate (in GB).
val scaleFactoryInt = scaleFactor.toInt

val scaleName = if(scaleFactoryInt < 1000){
    f"${scaleFactoryInt}%03d" + "GB"
  } else {
    f"${scaleFactoryInt / 1000}%03d" + "TB"
  }

val fileFormat = "parquet" // valid spark file format like parquet, csv, json.
val rootDir = s"/mnt/datalake/raw/tpc-ds/source_files_${scaleName}_${fileFormat}"
val databaseName = "tpcds" + scaleName // name of database to create.

// Run:
val tables = new TPCDSTables(sqlContext,
    dsdgenDir = "/usr/local/bin/tpcds-kit/tools", // location of dsdgen 
    scaleFactor = scaleFactor,
    useDoubleForDecimal = false, // true to replace DecimalType with DoubleType 
    useStringForDate = false) // true to replace DateType with StringType

tables.genData(
    location = rootDir,
    format = fileFormat,
    overwrite = true, // overwrite the data that is already there
    partitionTables = true, // create the partitioned fact tables 
    clusterByPartitionColumns = true, // shuffle to get partitions coalesced into single files. 
    filterOutNullPartitionValues = true, // true to filter out the partition with NULL key value
    tableFilter = "", // "" means generate all tables
    numPartitions = 1000) // how many dsdgen partitions to run - number of input tasks.

// Create the specified database
sql(s"create database IF NOT EXISTS $databaseName")

// Create metastore tables in a specified database for your data.
// Once tables are created, the current database will be switched to the specified database.
tables.createExternalTables(rootDir, fileFormat, databaseName, overwrite = true, discoverPartitions = true)

// Or, if you want to create temporary tables
// tables.createTemporaryTables(location, fileFormat)

// For Cost-based optimizer (CBO) only, gather statistics on all columns:
tables.analyzeTables(databaseName, analyzeColumns = true)

53分で終了しました。

なぜかGitHubのコードで以下が二回繰り返されているので片方は削除
削除しないとエラーになる
またIF NOT EXISTSを追加した

// Create the specified database
sql(s"create database IF NOT EXISTS $databaseName")

2-4. customerテーブルを表示し、実行が成功していることを再確認

2-5. ADLS2内の中身も確認

パーティションなし・1Parquetファイルあたりのサイズを調整したい場合

ここまでで、TPC-DSの環境設定とデータセット生成を一通り行うことができましたが、
今後これらのデータセットを活用するうえで、パーティションなし・1Parquetファイルあたりのサイズを調整する必要がでてきました。

オプションとなりますので、以下の記事にまとめました。

その他参考リンク

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up