本书作者比尔·钱伯斯和马太·扎哈里亚在强调Spark 2.0的改进和新功能的同时,将Spark题分为不同的部分,每个部分都有其独特的目标。你将探索Spark的结构化API的基本操作和常见功能以及Structured Streaming,后者是用于构建端到端流应用的一种全新的高层API。开发人员和系统管理员会学Spark监控、调优、调试的基础知识,探索机器学习技术以及Spark可扩展机器学习库MLlib的部署场景。
Preface
Part I. Gentle Overview of Big Data and Spark
1. What Is Apache Spark
Apache Spark's Philosophy
Context: The Big Data Problem
History of Spark
The Present and Future of Spark
Running Spark
Downloading Spark Locally
Launching Spark's Interactive Consoles
Running Spark in the Cloud
Data Used in This Book
2. A Gentle Introduction to Spark
Spark's Basic Architecture
Spark Applications
Spark's Language APIs
Spark's APIs
Starting Spark
The SparkSession
DataFrames
Partitions
Transformations
Lazy Evaluation
Actions
Spark UI
An End-to-End Example
DataFrames and SQL
Conclusion
3. A Tour of Spark's Too1set
Running Production Applications
Datasets: Type-Safe Structured APIs
Structured Streaming
Machine Learning and Advanced Analytics
Lower-Level APIs
SparkR
Spark's Ecosystem and Packages
Conclusion
Part II. Structured APls--DataFrames, SQL, and Datasets
4. Structured API Overview
DataFrames and Datasets
Schemas
Overview of Structured Spark Types
DataFrames Versus Datasets
Columns
Rows
Spark Types
Overview of Structured API Execution
Logical Planning
Physical Planning
Execution