High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark. 2nd Edition Holden Karau, Adi Polak, Rachel Warren

(ebook) (audiobook) (audiobook)

Promocja Przejdź

High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark. 2nd Edition Holden Karau, Adi Polak, Rachel Warren - okladka książki

High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark. 2nd Edition Holden Karau, Adi Polak, Rachel Warren - audiobook MP3

High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark. 2nd Edition Holden Karau, Adi Polak, Rachel Warren - audiobook CD

Autorzy:: Holden Karau, Adi Polak, Rachel Warren
Wydawnictwo:: O'Reilly Media (Z chęcią przeczytam książkę w języku polskim)
Ocena:: Bądź pierwszym, który oceni tę książkę
Stron:: 412
Dostępne formaty:: ePub

Mobi

Ebook

186,15 zł ~~219,00 zł~~ (-15%)

186,15 zł najniższa cena z 30 dni

Dodaj do koszyka Dostępny natychmiast po opłaceniu zakupu lub Kup na prezent Kup 1-kliknięciem

Przenieś na półkę

Do przechowalni

Kup w zestawie z dodatkowym rabatem

High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark. 2nd Edition Holden Karau, Adi Polak, Rachel Warren

Frontend Development Dario Benevento

Apache: The Definitive Guide. The Definitive Guide, 3rd Edition. 3rd Edition Ben Laurie, Peter Laurie

Cena zestawu: 423.06 zł

Oszczędzasz: 103,94 zł (20%)

Dodaj do koszyka

Apache Spark is amazing when everything clicks. But if you haven't seen the performance improvements you expected or still don't feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau, Adi Polak, and Rachel Warren walk you through the secrets of the Spark code base and demonstrate performance optimizations that will help your data pipelines run faster, scale to larger datasets, and avoid costly antipatterns.

Ideal for data engineers, software engineers, data scientists, and system administrators, the second edition of High Performance Spark presents new use cases, code examples, and best practices for Spark 4.x and beyond. This book gives you a fresh perspective on this continually evolving framework and shows you how to work around bumps on your Spark and PySpark journey.

With this book, you'll learn how to:

Accelerate your ML workflows with integrations including PyTorch
Handle key skew and take advantage of Spark's new dynamic partitioning
Make your code reliable with scalable testing and validation techniques
Make Spark high performance
Deploy Spark on Kubernetes and similar environments
Take advantage of GPU acceleration with RAPIDS and resource profiles
Get your Spark jobs to run faster
Use Spark to productionize exploratory data science projects
Handle even larger datasets with Spark
Gain faster insights by reducing pipeline running times

Wybrane bestsellery

Promocja

Description Modern frontend development is the art of building the digital bridge between users and technology, and mastering it requires more than just code. It requires not only technical expertise in UI/UX principles and web technologies, but a deep understanding of the psychological mechanisms that determine whether users engage with or leave t
- ebook
Frontend Development

Dario Benevento

(125,10 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Promocja

Description The guiding principles for an architect is the comprehensive roadmap to understanding the platforms core components and why they matter for todays businesses. Microsoft Power Platform is transforming how enterprises build solutions in todays fast-paced digital era. By enabling low-code innovation and empowering citizen developers, it he
- ebook
Microsoft Power Platform

Goloknath Mishra

(125,10 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Promocja

Description Rust has revolutionized modern development by providing unmatched performance and security guarantees, making it the ideal foundation for building reliable web applications. While its development has not slowed down the slightest, it already has a vibrant ecosystem to support diverse developer needs. Readers will learn to build minimali
- ebook
Web Development in Rust

Viktor Daróczi

(89,91 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja

Description Python is the industry standard for modern software development, known for its readability and ability to integrate into virtually every domain, from scripting to complex system design. This book is your practical guide to moving beyond Python basics and mastering the art of building complete, deployable applications. Each chapter blend
- ebook
Python Real-World Projects

Arun Prakash Shivakumar

(89,91 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja

Description Python is a versatile programming language that can help solve problems in various fields. With PyCharm as an IDE, you will learn to build Python applications step-by-step. This book is for beginner to intermediate software developers and data scientists who want to use Python for web development and for data science projects. This book
- ebook
Application Development with PyCharm

Muhammad Asif

(89,91 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja

Navigating the complexities of large-scale spatial data can be daunting. In order to unleash the power of massive and complex datasets, you'll need a cutting-edge tool like Apache Sedona. This innovative distributed computing system, designed specifically for spatial data, has diverse applications in fields such as mobility, telematics, agriculture
- ebook
Cloud Native Geospatial Analytics with Apache Sedona. A Hands-On Guide for Working with Large-Scale Spatial Data

Pawel Tokaj, Jia Yu, Mo Sarwat

(203,15 zł najniższa cena z 30 dni)

203.15 zł ~~239.00 zł (-15%)~~
Promocja

Description Elixir is the modern, powerful programming language designed for massive scale and reliability, perfectly suited for todays concurrent web applications. Built on the proven Erlang virtual machine (BEAM), Elixir empowers developers to build fast, fault-tolerant systems that simply do not crash. This book provides a clear, sequential path
- ebook
Elixir and Phoenix for Beginners

Karthikeyan Paramasivan

(89,91 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja

Overcome challenges in building transactional guarantees on rapidly changing data by using Apache Hudi. With this practical guide, data engineers, data architects, and software architects will discover how to seamlessly build an interoperable lakehouse from disparate data sources and deliver faster insights using your query engine of choice. Author
- ebook
Apache Hudi: The Definitive Guide. Building Robust, Open, and High-Performing Data Lakehouses

Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran

(203,15 zł najniższa cena z 30 dni)

203.15 zł ~~239.00 zł (-15%)~~
Promocja

Description Artificial intelligence is redefining how software is created, enabling developers to code faster, improve accuracy, and bring innovative ideas to life. In todays competitive technology landscape, AI-assisted programming is no longer optional; its a core skill for building modern web applications and machine learning solutions. This boo
- ebook
AI-assisted Programming for Web and Machine Learning

Dr. Muralidhar Kurni, Ramesh Krishnamaneni, Dr. Srinivasa K. G.

(89,91 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja

Revolutionize your understanding of modern data management with Apache Polaris (incubating), the open source catalog designed for data lakehouse industry standard Apache Iceberg. This comprehensive guide takes you on a journey through the intricacies of Apache Iceberg data lakehouses, highlighting the pivotal role of Iceberg catalogs. Authors Alex
- ebook
Apache Polaris: The Definitive Guide. Enriching Apache Iceberg Data Lakehouses with an Open Source Catalog

Alex Merced, Andrew Madson, Tomer Shiran

(228,65 zł najniższa cena z 30 dni)

228.65 zł ~~269.00 zł (-15%)~~

O autorach książki

Holden Karau is a software development engineer and is active in the open source. She has worked on a variety of search, classification, and distributed systems problems at IBM, Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a bachelor's of mathematics degree in computer science. Other than software, she enjoys playing with fire and hula hoops, and welding.

Adi Polak jest doświadczoną inżynierką, wiceprezeską do spraw programistów w firmie Treeverse, członkinią wielu grup eksperckich. Bierze udział w organizowaniu takich konferencji jak Data + AI Summit by Databricks, Current by Confluent i Scale by the Bay. Doświadczenie w uczeniu maszynowym zdobywała, prowadząc badania dla wielu firm z listy Fortune 500.

Holden Karau, Adi Polak, Rachel Warren - pozostałe książki

Promocja

Dzięki tej książce nauczysz się holistycznego podejścia, które zdecydowanie usprawni współpracę między zespołami. Najpierw zapoznasz się z podstawowymi informacjami o przepływach pracy związanych z uczeniem maszynowym przy użyciu Apache Spark i pakietu PySpark. Nauczysz się też zarządzać cyklem życia eksperymentów dla potrzeb uczenia maszynowego za pomocą biblioteki MLflow. Z kolejnych rozdziałów dowiesz się, jak od strony technicznej wygląda korzystanie z platformy uczenia maszynowego. W książce znajdziesz również opis wzorców wdrażania, wnioskowania i monitorowania modeli w środowisku produkcyjnym.
- książka
- ebook
Spark. Rozproszone uczenie maszynowe na dużą skalę. Jak korzystać z MLlib, TensorFlow i PyTorch

Adi Polak

(44,94 zł najniższa cena z 30 dni)

46.44 zł ~~74.90 zł (-38%)~~
Promocja

Modern systems contain multi-core CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source library for parallel computing provides APIs that make it e
- ebook
Scaling Python with Dask

Holden Karau, Mika Kimmins

(228,65 zł najniższa cena z 30 dni)

228.65 zł ~~269.00 zł (-15%)~~
Promocja

Learn how to build end-to-end scalable machine learning solutions with Apache Spark. With this practical guide, author Adi Polak introduces data and ML practitioners to creative solutions that supersede today's traditional methods. You'll learn a more holistic approach that takes you beyond specific requirements and organizational goals--allowing d
- ebook
Scaling Machine Learning with Spark

Adi Polak

(228,65 zł najniższa cena z 30 dni)

228.65 zł ~~269.00 zł (-15%)~~
Promocja

Serverless computing enables developers to concentrate solely on their applications rather than worry about where they've been deployed. With the Ray general-purpose serverless implementation in Python, programmers and data scientists can hide servers, implement stateful applications, support direct communication between tasks, and access hardware
- ebook
Scaling Python with Ray

Holden Karau, Boris Lublinsky

(169,14 zł najniższa cena z 30 dni)

169.14 zł ~~199.00 zł (-15%)~~
Promocja

If you're training a machine learning model but aren't sure how to put it into production, this book will get you there. Kubeflow provides a collection of cloud native tools for different stages of a model's lifecycle, from data exploration, feature preparation, and model training to model serving. This guide helps data scientists build production-
- ebook
Kubeflow for Machine Learning

Trevor Grant, Holden Karau, Boris Lublinsky

(135,15 zł najniższa cena z 30 dni)

135.15 zł ~~159.00 zł (-15%)~~
Promocja

When people want a way to process big data at speed, Spark is invariably the solution. With its ease of development (in comparison to the relative complexity of Hadoop), it’s unsurprising that it’s becoming popular with data analysts and engineers everywhere. Beginning with the fundamentals, we’ll show you how to get set up with Spark with minimum
- ebook
Fast Data Processing with Spark 2. Accelerate your data for rapid insight - Third Edition

Krishna Sankar, Holden Karau

(116,10 zł najniższa cena z 30 dni)

116.10 zł ~~129.00 zł (-10%)~~

Ebooka "High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark. 2nd Edition" przeczytasz na:

czytnikach Inkbook, Kindle, Pocketbook, Onyx Booxs i innych
systemach Windows, MacOS i innych

systemach Windows, Android, iOS, HarmonyOS
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi

Masz pytania? Zajrzyj do zakładki Pomoc »

Oceny i opinie klientów: High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark. 2nd Edition Holden Karau, Adi Polak, Rachel Warren

(0)

Szczegóły książki

ISBN Ebooka:: 978-10-981-4581-1, 9781098145811
Data wydania ebooka :: 2026-05-29 Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@helion.pl.
Język publikacji:: angielski
Rozmiar pliku ePub:: 6MB
Rozmiar pliku Mobi:: 14.9MB

Zgłoś erratę
Kategorie:
Serwery internetowe » Apache

Dostępność produktu

Produkt nie został jeszcze oceniony pod kątem ułatwień dostępu lub nie podano żadnych informacji o ułatwieniach dostępu lub są one niewystarczające. Prawdopodobnie Wydawca/Dostawca jeszcze nie umożliwił dokonania walidacji produktu lub nie przekazał odpowiednich informacji na temat jego dostępności.

Spis treści książki

Preface
- Second Edition Notes
- Supporting Books and Materials
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- General Acknowledgments
  - Second Edition
  - First Edition
- Personal Acknowledgments
  - Holden Karau
  - Adi Polak
1. Introduction to High Performance Spark
- What Is Spark and Why Performance Matters
- What You Can Expect to Get from This Book
- Spark Versions
- Why the Focus on Scala and Python?
  - To Be a Spark Expert You Have to Be Able to Read a Little Scala Anyway
  - The Spark Scala and Python APIs Are Easier to Use Than the Java APIs
  - Why Not Scala?
  - Learning Scala
- Conclusion
2. How Spark Works
- How Spark Fits into the Big Data Ecosystem
  - Cluster Managers
  - Catalog System
  - Multitool Pipelines
  - Spark Components
- Spark Model of Parallel Computing: RDDs
  - Lazy Evaluation
    - Performance and usability advantages of lazy evaluation
    - Lazy evaluation and fault tolerance
    - Lazy evaluation and debugging
  - In-Memory Persistence and Memory Management
  - Immutability and the RDD Interface
  - Types of RDDs
  - Functions on RDDs: Transformations Versus Actions
  - Wide Versus Narrow Dependencies
- Spark Job Scheduling
  - Resource Allocation
  - The Spark Application
  - Default Spark Scheduler and Concurrent Actions
- The Anatomy of a Spark Job
  - The DAG
  - Jobs
  - Stages
  - Tasks
- Spark Connect
- Conclusion
3. Upgrading Spark
- Finding What You Need to Change
  - Compile-Time Changes
  - Exceptions at Runtime
  - Differing Results
- Updating Your Code
  - Scala
  - Python
  - SQL
- Verifying Correctness and Performance
- Places Where Performance Can Get Worse
- Conclusion
4. Whats New in Spark 4.2 Since 2.4
- The Bad News: GraphX
- The Good News
  - No Code Changes Required
  - Partition Pruning and Predicate Pushdown
  - Joins
  - Debugging
- Additional Upgrades Required (Not Code)
- Code and Logic Changes Needed
  - General
  - Python Specific
  - Scala Specific
- Prioritizing Upgrades
- Totally New Features
  - Spark Declarative Pipelines
  - SQL UDFs/UDTFs
- Looking Ahead to the Future
- Conclusion
5. DataFrames, Datasets, and Spark SQL
- Getting Started with the SparkSession (or HiveContext or SQLContext)
- Spark SQL Dependencies
- Basics of Schemas
  - Column Metadata
- DataFrame API
  - Transformations
    - Simple DataFrame transformations and SQL expressions
    - Specialized DataFrame transformations for missing and noisy data
    - Beyond row-by-row transformations
    - Aggregates, groupBy, and coGroup
    - Windowing
    - Sorting
  - Multi-DataFrame Transformations
  - Plain Old SQL Queries and Interacting with Metastore Tables (Hive, Iceberg, etc.)
- Data Representation in DataFrames and Datasets
  - Tungsten
- Datasets
  - Interoperability with RDDs, DataFrames, and Local Collections
  - Compile-Time Strong Typing
  - Easier Functional (RDD like) Transformations
  - Relational Transformations
  - Multi-Dataset Relational Transformations
  - Grouped Operations on Datasets
- Extending with User-Defined Functions (UDFs), Aggregate Functions (UDAFs), and Expressions
  - Custom Expressions
  - Custom Encoders
  - User-Defined Table Functions
- Data Loading and Saving Functions
  - DataFrameWriter and DataFrameReader
  - Formats
    - JSON
    - JDBC
    - Parquet
    - Metastore tables
    - RDDs
    - Local collections
    - Additional formats and tables
    - Going beyond existing options
  - Save Modes
  - Merge
  - Partitions (Discovery and Writing)
- Query Optimizer
  - Logical and Physical Plans
  - Static Optimizer Rules You Should Understand (and Working Around Them)
  - Understanding Spark Adaptive Query Execution
  - Code Generation
  - Large Query Plans and Iterative Algorithms
- SparkSession Extensions and Custom Optimizers
- Debugging Spark SQL Queries
- JDBC/ODBC Server
- Conclusion
6. Joins (SQL and Core)
- Core Spark Joins
  - Choosing a Join Type
  - Choosing an Execution Plan
    - Speeding up joins by assigning a known partitioner
    - Speeding up joins using a broadcast hash join
    - Partial manual broadcast hash join
- Spark SQL Joins
  - DataFrame Joins
    - Self joins
    - Cross/cartesian joins
    - Complex join conditions
  - Concrete SQL Join Execution Operators
  - Dataset Joins
- Conclusion
7. Effective Transformations
- Narrow Versus Wide Transformations
  - Implications for Performance
  - Implications for Fault Tolerance
  - The Special Case of coalesce
- What Type of RDD Does Your Transformation Return?
- Minimizing Object Creation
  - Reusing Existing Objects
  - Using Smaller Data Structures
- Iterator-to-Iterator Transformations with mapPartitions
  - What Is an Iterator-to-Iterator Transformation?
  - Space and Time Advantages
  - An Example
- Set Operations
- Reducing Setup Overhead
  - Shared Variables
  - Broadcast Variables
  - Accumulators
- Reusing RDDs
  - Cases for Reuse
    - Iterative computations
    - Multiple actions on the same RDD
    - If the cost to compute each partition is very high
  - Deciding if Recompute Is Inexpensive Enough
  - Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files
    - Persist and cache
    - Checkpointing
    - Checkpointing example
  - Alluxio (nee Tachyon)
  - LRU Caching
  - Shuffle Files
    - Out-of-disk-space error
  - Noisy Cluster Considerations
  - Interaction with Accumulators
- Conclusion
8. Working with Key/Value Data
- The Goldilocks Example
  - Goldilocks Version 0: Iterative Solution
  - How to Use PairRDDFunctions and OrderedRDDFunctions
- Actions on Key/Value Pairs
- Whats So Dangerous About the groupByKey Function?
  - Goldilocks Version 1: groupByKey Solution
    - Why groupByKey fails
- Choosing an Aggregation Operation
  - Dictionary of Aggregation Operations with Performance Considerations
  - Preventing Out-Of-Memory Errors with Aggregation Operations
- Multiple RDD Operations
  - Co-Grouping
- Partitioners and Key/Value Data
  - Using the Spark Partitioner Object
  - Hash Partitioning
  - Range Partitioning
  - Custom Partitioning
  - Preserving Partitioning Information Across Transformations
    - Using narrow transformations that preserve partitioning
  - Leveraging Co-Located and Co-Partitioned RDDs
  - Dictionary of Mapping and Partitioning Functions PairRDDFunctions
- Dictionary of OrderedRDDOperations
  - Sorting by Two Keys with sortByKey
- Secondary Sort and repartitionAndSortWithinPartitions
  - Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function
  - How Not to Sort by Two Orderings
  - Goldilocks Version 2: Secondary Sort
    - Defining the custom partitioner
    - Filtering on each partition
    - Combine the elements associated with one key
    - Performance
  - A Different Approach to Goldilocks
    - Map to (cell value, column index) pairs
    - Sort and count values on each partition
    - Determine location of rank statistics on each partition
    - Filter for rank statistics
  - Goldilocks Version 3: Sort on Cell Values
- Straggler Detection and Unbalanced Data
  - Back to Goldilocks (Again)
  - Goldilocks Version 4: Reduce to Distinct on Each Partition
    - Aggregate to ((cell value, column index), count) on each partition
    - Sort and find rank statistics
    - Goldilocks postmortem
- Conclusion
9. Going Beyond Scala
- Beyond Scala Within the JVM
  - Custom Code Beyond Scala and Beyond the JVM
    - How PySpark works
    - PySpark RDDs
    - PySpark DataFrames and Datasets
    - Pandas API on Spark
    - User-defined functions in PySpark with and without Arrow
    - Accessing the backing Java objects and mixing Scala code
    - PySpark dependency management
    - Installing PySpark
  - How SparklyR Works
  - Spark on the Common Language Runtime (CLR)C# and Friends
- Calling Other Languages from Spark
  - Using Pipe and Friends
  - Java Native Interface
  - Java Native Access (JNA)
  - Project Panama
  - Underneath Everything Is Fortran
  - Getting to the GPU
- Going Beyond the JVM with Spark Accelerators
  - Databricks Photon
  - Apache Arrow DataFusion Comet
  - Project Gluten
    - Getting started
    - Velox + Gluten
    - ClickHouse + Gluten
    - Glutens future
  - Spark RAPIDS
    - Setting up and configuring Spark RAPIDS
    - UDFs
    - Going beyond RAPIDS on the GPU
    - Project health
  - Application-Specific Integrated Circuits (ASICs)
- Managing Memory Outside of the JVM
- The Future (from ~2026)
- Conclusion
10. Spark: A Thoughtful Step into Generative AI
- So, What Is This Generative AI Thing?
  - How Generative AI Works: Training the Creator
  - A World of Applications: Generative AI at Work
  - How Is GenAI Related to Deep Learning?
- Can You Spark It?
  - Effortless TensorFlow Scaling with Sparks TensorFlowDistributor
    - The great connector
    - The data alchemist
    - The GPU commander
    - Your time to innovate
    - TensorFlowDistributor: Show me how!
  - Scaling PyTorch with Sparks TorchDistributor
  - Supercharging Distributed Training with DeepSpeed
    - Why DeepSpeed matters
    - Key features of DeepSpeedTorchDistributor
    - Using DeepspeedTorchDistributor
    - Benefits in practice
  - Wait, Your Processor Matters!
    - Distributed training with RAPIDS and TorchDistributor
    - Distributed inference with RAPIDS: Making predictions at scale
    - Why RAPIDS matters for Spark deep learning
- Inference in the Real World: Beyond Batch Prediction
  - Batch Prediction: Scaling Predictions with Spark
  - Model as a Service: Real-Time Predictions at Scale
  - Model in a Service: Embedding Inference into Workflows
  - A World of Inference Possibilities
  - Evaluation of Models
- Case Studies
  - Fight Health Insurance
  - Ubers Enhanced Agentic-RAG (EAg-RAG): When Chatbots Deliver Near-Human Precision
    - 1. Enriched document processing (offline)
    - 2. Agentic answer generation (near real time)
    - 3. Faster iteration via automated evaluation (LLM-as-a-Judge)
      - Why LLM-as-a-Judge works well for RAG systems
      - How does Spark fit in?
    - Optimizing recommendations
- The Model Dilemma: Hosting or Leveraging a Third-Party REST API
  - Layer 1: Use Case and Domain Alignment
  - Layer 2: Performance Requirements
  - Layer 3: Scalability Needs
  - Layer 4: Cost Considerations
  - Layer 5: Data Sensitivity and Security
  - Layer 6: Integration and Ecosystem Fit
  - Layer 7: Time-to-Market
  - The Decision
  - Alexs Reflection Moment
  - The Story of an Evolving Industry
- Conclusion: The Future of Generative AI with Spark
  - Sparks Role in the Generative AI Revolution
  - Generative AI and the Creative Industries
  - The Path Forward
  - Your Next Step
11. Testing, Validation, and Side-By-Side Runs
- Unit Testing
  - Factoring Your Code for Testability
  - Mocking RDDs
  - Core Spark Jobs (Testing with RDDs)
  - Testing Streaming
    - DStream streaming
    - Structured Streaming
  - Testing DataFrames
- Testing Codegen
- Getting Test Data
  - Generating Large Datasets
  - Sampling
- Property Checking with ScalaCheck
  - Computing RDD Difference
- Integration Testing
  - Choosing Your Integration Testing Environment
    - Production or staging clusters
    - Local mode
    - Kubernetes and Docker-based
- Verifying Performance
  - Spark Counters for Verifying Performance
  - Projects for Verifying Performance
- Validation (or Audits)
  - Data Validation
    - Input tables
    - Output tables and WAP
  - Accumulators and Built-In Counters
- Side-by-Side Runs
- Table-Level Assertions
- Conclusion
12. Spark Components and Packages
- GraphX
- Using Community Packages and Libraries
- Popular Packages
- Creating a Spark Package
- Building Your Own Spark or Using Packages
- Conclusion
A. Just Enough Iceberg and Friends
- Updates and Table Maintenance
- Partitioning, Indices, and Statistics
- Query Planning
- Conclusion
B. Spark Connect
- Supported API
- Client Languages
- Different Backends
- Conclusion
C. When Not to Use Spark
- Small Data and Small Tasks
- Real-Time Updates and Online Transaction Processing
- User-Facing Interactive Use Cases
- Long and Highly Variable Task Processing Time
- Untrusted and Multitenant Queries
- Your Job Has Side Effects per Record
- Huge Individual Records/Rows
- No Ops Support and No Budget for a Platform
- Highly Unreliable Backends
- Different Functions Versus Same Function on Different Data
- Frequently Accumulated Updates
- Global In-Order Processing Required
- When You Can, Now, Use Spark
- Conclusion
D. Advanced Task Scheduling: Gangs and Resource Profiles
- Gang Scheduling/Barrier Execution Mode
- Resource Profiles
- Conclusion
E. Spark Streaming
- Simplest Example
- Driver Considerations
- Operation Modes
  - Micro-Batching
  - Real-Time Mode
  - Async Progress Tracking
- Output Modes
- Time, Aggregations, Joins, and State
  - Aggregations
    - Windows
    - Watermarking
    - User/arbitrary state
  - Stream-Stream Joins
    - State store
    - Using foreachBatch: Dropping into batch semantics inside Streaming
- Reliability: YOLO-ish, At Least Once, At Most Once, and Exactly Once
  - Checkpointing
  - Replayable and Nonreplayable Receivers
  - Idempotent and Non-Idempotent Sinks
- True Real-Time and Reasonable Expectations
- Conclusion
F. The Spark Web UI: Debugging and Optimizing Your Jobs
- A Real-World Performance Problem
  - Accessing the Spark Web UI
  - Understanding the Spark Execution Hierarchy
    - Application
    - Jobs
    - Stages
    - Tasks
  - The Jobs Tab: Our First Look at the Problem
    - What we see
    - The Event Timeline
  - The Stages Tab: Finding the Bottleneck
    - Stage summary view
    - Stage detail view
      - DAG visualization
      - Event Timeline
      - Summary Metrics
      - Aggregated metrics by executor
  - The SQL Tab: Understanding Query Plans
    - Query details
    - Key Physical Plan operators
  - The Executors Tab: Resource Utilization
  - The Problem with the Native UI
  - Enter the DataFlint Tab
    - Installation
    - Identifying the problem immediately
    - The DataFlint interface
    - Visual query plans
  - The Fix: Actionable Recommendations
    - Why this works
    - The result
  - Other DataFlint Performance Alerts
  - The Spark History Server
Index