0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

DataFusionを使ってJSONをSQLで処理

Posted at

DuckDBを使ってJSONをSQLで処理」と同じ事を Apache DataFusion で試してみました。

DataFusion にはユーザー定義関数(UDF)の機能がありますが、ここでは UDF を用いずに標準的な機能を使用しています。

はじめに

DuckDB で使ったものと同じ JSON Lines ファイルを使用します。

items.jsonl
{"id":1,"name":"item1","price":1100,"attrs":{"code":"0001","category":"A1"},"variants":[{"color":"white", "size":"S"},{"color":"black","size":"F"}]}
{"id":2,"name":"item2","price":2300,"attrs":{"code":"0002"},"variants":[{"color":"red"},{"color":"blue"},{"color":"white"}]}
{"id":3,"name":"item3","price":360,"attrs":{"code":"0003"}}
{"id":4,"name":"item4","price":4700,"attrs":{"code":"0004","category":"A1"},"variants":[{"color":"green", "size":"S"}]}
{"id":5,"name":"item5","price":590,"attrs":{"code":"0005","category":"B2"},"variants":[{"color":"red"},{"color":"white","size":"F"},{"color":"white","size":"L"}]}
{"id":6,"name":"item6","price":6200,"attrs":{"code":"0006","category":"C3"},"variants":[{"color":"blue","size":"S"},{"color":"green","size":"L"}]}

このファイルを items テーブルとして取り込んで SQL を適用する処理例がこちらです。

src/main.rs
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();
    // 拡張子 .jsonl を処理するための設定
    let opts = NdJsonReadOptions::default().file_extension(".jsonl");
    // items.jsonl ファイルの内容を items テーブルとして取り込む
    ctx.register_json("items", "./items.jsonl", opts).await?;

    run_query(
        &ctx,
        "SELECT id, attrs.code, attrs, variants FROM items WHERE id < 4",
    )
    .await?;

    ...省略

    Ok(())
}

async fn run_query(ctx: &SessionContext, query: &str) -> datafusion::error::Result<()> {
    let df = ctx.sql(query).await?;
    // 結果出力
    df.show().await?;

    Ok(())
}

デフォルトでは JSON ファイルの拡張子が .json で無ければならないようなので、file_extension(".jsonl") で .jsonl を読み込めるようにしています。

また、Cargo.toml の dependencies はこのようになっています。

Cargo.toml
[dependencies]
datafusion = "46.0.1"
tokio = { version = "1.0", features = ["rt-multi-thread"] }

SQL処理

概ね DuckDB と同じような事が可能です。

(a) attrs, variants の出力

id が 4未満で次の値を抽出する SQL はこのようになります。

  • id
  • attrs の code
  • attrs
  • variants
run_query(
    &ctx,
    "SELECT id, attrs.code, attrs, variants FROM items WHERE id < 4",
)
.await?;
(a) 実行結果
+----+-------------------+----------------------------+-----------------------------------------------------------------------+
| id | items.attrs[code] | attrs                      | variants                                                              |
+----+-------------------+----------------------------+-----------------------------------------------------------------------+
| 1  | 0001              | {category: A1, code: 0001} | [{color: white, size: S}, {color: black, size: F}]                    |
| 2  | 0002              | {category: , code: 0002}   | [{color: red, size: }, {color: blue, size: }, {color: white, size: }] |
| 3  | 0003              | {category: , code: 0003}   |                                                                       |
+----+-------------------+----------------------------+-----------------------------------------------------------------------+

(b) attrs.category の条件指定

attrs の category が 'A1' のものだけを抽出する SQL です。

run_query(
    &ctx,
    "SELECT id, name, attrs.category FROM items WHERE attrs.category = 'A1'",
)
.await?;
(b) 実行結果
+----+-------+-----------------------+
| id | name  | items.attrs[category] |
+----+-------+-----------------------+
| 1  | item1 | A1                    |
| 4  | item4 | A1                    |
+----+-------+-----------------------+

(c) variants の条件指定

variants に color が 'white' を含むものを抽出します。

DataFusion でも unnest 関数が使えるので、サブクエリで DISTINCT する方法は可能です。

run_query(
    &ctx,
    "
    SELECT DISTINCT id, name
    FROM
        (SELECT id, name, unnest(variants) AS v FROM items)
    WHERE
        v.color = 'white'
    ",
)
.await?;

WITH も利用可能です。

run_query(
    &ctx,
    "
    WITH x AS (SELECT id, name, unnest(variants) AS v FROM items)
    SELECT DISTINCT id, name FROM x WHERE v.color = 'white' 
    ",
)
.await?;

また、次のように unnest(variants) から color の値だけを抽出する事も可能です。

run_query(
    &ctx,
    "
    SELECT DISTINCT id, name
    FROM
        (SELECT id, name, unnest(variants).color AS color FROM items)
    WHERE
        color = 'white'
    ",
)
.await?;
run_query(
    &ctx,
    "
    SELECT DISTINCT id, name
    FROM
        (SELECT id, name, unnest(variants)['color'] AS color FROM items)
    WHERE
        color = 'white'
    ",
)
.await?;
run_query(
    &ctx,
    "
    SELECT DISTINCT id, name
    FROM
        (SELECT id, name, get_field(unnest(variants), 'color') AS color FROM items)
    WHERE
        color = 'white'
    ",
)
.await?;
(c) 実行結果
+----+-------+
| id | name  |
+----+-------+
| 1  | item1 |
| 2  | item2 |
| 5  | item5 |
+----+-------+

なお、結果の順序は変動します。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?