2024/4/12に翔泳社よりApache Spark徹底入門を出版します!
書籍のサンプルノートブックをウォークスルーしていきます。Python/Chapter02/2-1 Line Count
となります。
翻訳ノートブックのリポジトリはこちら。
ノートブックはこちら
spark.version
'3.5.0'
strings = spark.read.text("/databricks-datasets/learning-spark-v2/SPARK_README.md")
strings.show(10, truncate=False)
+------------------------------------------------------------------------------+
|value |
+------------------------------------------------------------------------------+
|# Apache Spark |
| |
|Spark is a fast and general cluster computing system for Big Data. It provides|
|high-level APIs in Scala, Java, Python, and R, and an optimized engine that |
|supports general computation graphs for data analysis. It also supports a |
|rich set of higher-level tools including Spark SQL for SQL and DataFrames, |
|MLlib for machine learning, GraphX for graph processing, |
|and Spark Streaming for stream processing. |
| |
|<http://spark.apache.org/> |
+------------------------------------------------------------------------------+
only showing top 10 rows
strings.count()
95
filtered = strings.filter(strings.value.contains("Spark"))
filtered.count()
17