Sparkでテキストの行数を数える

Last updated at 2024-03-27Posted at 2024-03-27

2024/4/12に翔泳社よりApache Spark徹底入門を出版します！

書籍のサンプルノートブックをウォークスルーしていきます。Python/Chapter02/2-1 Line Countとなります。

翻訳ノートブックのリポジトリはこちら。

ノートブックはこちら

spark.version

'3.5.0'

strings = spark.read.text("/databricks-datasets/learning-spark-v2/SPARK_README.md")
strings.show(10, truncate=False)

+------------------------------------------------------------------------------+
|value                                                                         |
+------------------------------------------------------------------------------+
|# Apache Spark                                                                |
|                                                                              |
|Spark is a fast and general cluster computing system for Big Data. It provides|
|high-level APIs in Scala, Java, Python, and R, and an optimized engine that   |
|supports general computation graphs for data analysis. It also supports a     |
|rich set of higher-level tools including Spark SQL for SQL and DataFrames,    |
|MLlib for machine learning, GraphX for graph processing,                      |
|and Spark Streaming for stream processing.                                    |
|                                                                              |
|<http://spark.apache.org/>                                                    |
+------------------------------------------------------------------------------+
only showing top 10 rows

strings.count()

filtered = strings.filter(strings.value.contains("Spark"))
filtered.count()

はじめてのDatabricks

Databricks無料トライアル

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up