High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark Holden Karau, Rachel Warren

(ebook) (audiobook) (audiobook)

Promocja Przejdź

High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark Holden Karau, Rachel Warren - okladka książki

High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark Holden Karau, Rachel Warren - audiobook MP3

High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark Holden Karau, Rachel Warren - audiobook CD

Autorzy:: Holden Karau, Rachel Warren
Wydawnictwo:: O'Reilly Media (Z chęcią przeczytam książkę w języku polskim)
Ocena:: Bądź pierwszym, który oceni tę książkę
Stron:: 358
Dostępne formaty:: ePub

Mobi

Ebook

135,15 zł ~~159,00 zł~~ (-15%)

0,00 zł najniższa cena z 30 dni

Dodaj do koszyka Dostępny natychmiast po opłaceniu zakupu lub Kup na prezent Kup 1-kliknięciem

Przenieś na półkę

Do przechowalni

Kup w zestawie z dodatkowym rabatem

High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark Holden Karau, Rachel Warren

Scaling Python with Ray Holden Karau, Boris Lublinsky

Scaling Python with Dask Holden Karau, Mika Kimmins

Cena zestawu: 495.63 zł

Oszczędzasz: 131,37 zł (21%)

Dodaj do koszyka

Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.

With this book, you’ll explore:

How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
The choice between data joins in Core Spark and Spark SQL
Techniques for getting the most out of standard RDD transformations
How to work around performance issues in Spark’s key/value pair paradigm
Writing high-performance Spark code without Scala or the JVM
How to test for functionality and performance when applying suggested improvements
Using Spark MLlib and Spark ML machine learning libraries
Spark’s Streaming components and external community packages

Wybrane bestsellery

Promocja

Description Rust has revolutionized modern development by providing unmatched performance and security guarantees, making it the ideal foundation for building reliable web applications. While its development has not slowed down the slightest, it already has a vibrant ecosystem to support diverse developer needs. Readers will learn to build minimali
- ebook
Web Development in Rust

Viktor Daróczi

(46,15 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja

Description Python is the industry standard for modern software development, known for its readability and ability to integrate into virtually every domain, from scripting to complex system design. This book is your practical guide to moving beyond Python basics and mastering the art of building complete, deployable applications. Each chapter blend
- ebook
Python Real-World Projects

Arun Prakash Shivakumar

(46,15 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja

Description Python is a versatile programming language that can help solve problems in various fields. With PyCharm as an IDE, you will learn to build Python applications step-by-step. This book is for beginner to intermediate software developers and data scientists who want to use Python for web development and for data science projects. This book
- ebook
Application Development with PyCharm

Muhammad Asif

(46,15 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja

Navigating the complexities of large-scale spatial data can be daunting. In order to unleash the power of massive and complex datasets, you'll need a cutting-edge tool like Apache Sedona. This innovative distributed computing system, designed specifically for spatial data, has diverse applications in fields such as mobility, telematics, agriculture
- ebook
Cloud Native Geospatial Analytics with Apache Sedona. A Hands-On Guide for Working with Large-Scale Spatial Data

Pawel Tokaj, Jia Yu, Mo Sarwat

(0,00 zł najniższa cena z 30 dni)

203.15 zł ~~239.00 zł (-15%)~~
Promocja

Description Elixir is the modern, powerful programming language designed for massive scale and reliability, perfectly suited for todays concurrent web applications. Built on the proven Erlang virtual machine (BEAM), Elixir empowers developers to build fast, fault-tolerant systems that simply do not crash. This book provides a clear, sequential path
- ebook
Elixir and Phoenix for Beginners

Karthikeyan Paramasivan

(46,15 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja

Overcome challenges in building transactional guarantees on rapidly changing data by using Apache Hudi. With this practical guide, data engineers, data architects, and software architects will discover how to seamlessly build an interoperable lakehouse from disparate data sources and deliver faster insights using your query engine of choice. Author
- ebook
Apache Hudi: The Definitive Guide. Building Robust, Open, and High-Performing Data Lakehouses

Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran

(0,00 zł najniższa cena z 30 dni)

203.15 zł ~~239.00 zł (-15%)~~
Promocja

Description Artificial intelligence is redefining how software is created, enabling developers to code faster, improve accuracy, and bring innovative ideas to life. In todays competitive technology landscape, AI-assisted programming is no longer optional; its a core skill for building modern web applications and machine learning solutions. This boo
- ebook
AI-assisted Programming for Web and Machine Learning

Dr. Muralidhar Kurni, Ramesh Krishnamaneni, Dr. Srinivasa K. G.

(46,15 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja

Revolutionize your understanding of modern data management with Apache Polaris (incubating), the open source catalog designed for data lakehouse industry standard Apache Iceberg. This comprehensive guide takes you on a journey through the intricacies of Apache Iceberg data lakehouses, highlighting the pivotal role of Iceberg catalogs. Authors Alex
- ebook
Apache Polaris: The Definitive Guide. Enriching Apache Iceberg Data Lakehouses with an Open Source Catalog

Alex Merced, Andrew Madson, Tomer Shiran

(0,00 zł najniższa cena z 30 dni)

228.65 zł ~~269.00 zł (-15%)~~
Promocja

Description .NET Aspire is a revolutionary stack created for building cloud-native microservices. It emerges as a game-changer, offering a streamlined, opinionated approach to simplify orchestrating the .NET microservices and connecting them to cloud services with ease. The book explores the development of .NET Aspire, its core concepts, and a powe
- ebook
Introduction to .NET Aspire

Naga Santhosh Reddy Vootukuri, Tommaso Stocchi

(46,15 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja

Description Given the fast-paced and dynamic nature of todays web development landscape, test automation is essential for maintaining quality across dynamic applications. Nightwatch.js stands as a powerful yet accessible end-to-end testing framework that elegantly bridges the gap between testing complexity and implementing simplicity, making automa
- ebook
Test Automation with Nightwatch.js

Pallavi Sharma

(46,15 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~

O autorze książki

Holden Karau is a software development engineer and is active in the open source. She has worked on a variety of search, classification, and distributed systems problems at IBM, Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a bachelor's of mathematics degree in computer science. Other than software, she enjoys playing with fire and hula hoops, and welding.

Holden Karau, Rachel Warren - pozostałe książki

Promocja

Modern systems contain multi-core CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source library for parallel computing provides APIs that make it e
- ebook
Scaling Python with Dask

Holden Karau, Mika Kimmins

(0,00 zł najniższa cena z 30 dni)

228.65 zł ~~269.00 zł (-15%)~~
Promocja

Serverless computing enables developers to concentrate solely on their applications rather than worry about where they've been deployed. With the Ray general-purpose serverless implementation in Python, programmers and data scientists can hide servers, implement stateful applications, support direct communication between tasks, and access hardware
- ebook
Scaling Python with Ray

Holden Karau, Boris Lublinsky

(0,00 zł najniższa cena z 30 dni)

169.14 zł ~~199.00 zł (-15%)~~
Promocja

If you're training a machine learning model but aren't sure how to put it into production, this book will get you there. Kubeflow provides a collection of cloud native tools for different stages of a model's lifecycle, from data exploration, feature preparation, and model training to model serving. This guide helps data scientists build production-
- ebook
Kubeflow for Machine Learning

Trevor Grant, Holden Karau, Boris Lublinsky

(0,00 zł najniższa cena z 30 dni)

135.15 zł ~~159.00 zł (-15%)~~
Promocja

When people want a way to process big data at speed, Spark is invariably the solution. With its ease of development (in comparison to the relative complexity of Hadoop), it’s unsurprising that it’s becoming popular with data analysts and engineers everywhere. Beginning with the fundamentals, we’ll show you how to get set up with Spark with minimum
- ebook
Fast Data Processing with Spark 2. Accelerate your data for rapid insight - Third Edition

Krishna Sankar, Holden Karau

(0,00 zł najniższa cena z 30 dni)

116.10 zł ~~129.00 zł (-10%)~~

Ebooka "High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark" przeczytasz na:

czytnikach Inkbook, Kindle, Pocketbook, Onyx Booxs i innych
systemach Windows, MacOS i innych

systemach Windows, Android, iOS, HarmonyOS
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi

Masz pytania? Zajrzyj do zakładki Pomoc »

Oceny i opinie klientów: High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark Holden Karau, Rachel Warren

(0)

Szczegóły książki

ISBN Ebooka:: 978-14-919-4315-1, 9781491943151
Data wydania ebooka :: 2017-05-25 Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@helion.pl.
Język publikacji:: angielski
Rozmiar pliku ePub:: 4.1MB
Rozmiar pliku Mobi:: 10MB

Zgłoś erratę
Kategorie:
Serwery internetowe » Apache

Dostępność produktu

Produkt nie został jeszcze oceniony pod kątem ułatwień dostępu lub nie podano żadnych informacji o ułatwieniach dostępu lub są one niewystarczające. Prawdopodobnie Wydawca/Dostawca jeszcze nie umożliwił dokonania walidacji produktu lub nie przekazał odpowiednich informacji na temat jego dostępności.

Spis treści książki

Preface
- First Edition Notes
- Supporting Books and Materials
- Conventions Used in This Book
- Using Code Examples
- OReilly Safari
- How to Contact the Authors
- How to Contact Us
- Acknowledgments
1. Introduction to High Performance Spark
- What Is Spark and Why Performance Matters
- What You Can Expect to Get from This Book
- Spark Versions
- Why Scala?
  - To Be a Spark Expert You Have to Learn a Little Scala Anyway
  - The Spark Scala API Is Easier to Use Than the Java API
  - Scala Is More Performant Than Python
  - Why Not Scala?
  - Learning Scala
- Conclusion
2. How Spark Works
- How Spark Fits into the Big Data Ecosystem
  - Spark Components
- Spark Model of Parallel Computing: RDDs
  - Lazy Evaluation
    - Performance and usability advantages of lazy evaluation
    - Lazy evaluation and fault tolerance
    - Lazy evaluation and debugging
  - In-Memory Persistence and Memory Management
  - Immutability and the RDD Interface
  - Types of RDDs
  - Functions on RDDs: Transformations Versus Actions
  - Wide Versus Narrow Dependencies
- Spark Job Scheduling
  - Resource Allocation Across Applications
  - The Spark Application
    - Default Spark Scheduler
- The Anatomy of a Spark Job
  - The DAG
  - Jobs
  - Stages
  - Tasks
- Conclusion
3. DataFrames, Datasets, and Spark SQL
- Getting Started with the SparkSession (or HiveContext or SQLContext)
- Spark SQL Dependencies
  - Managing Spark Dependencies
  - Avoiding Hive JARs
- Basics of Schemas
- DataFrame API
  - Transformations
    - Simple DataFrame transformations and SQL expressions
    - Specialized DataFrame transformations for missing and noisy data
    - Beyond row-by-row transformations
    - Aggregates and groupBy
    - Windowing
    - Sorting
  - Multi-DataFrame Transformations
    - Set-like operations
  - Plain Old SQL Queries and Interacting with Hive Data
- Data Representation in DataFrames and Datasets
  - Tungsten
- Data Loading and Saving Functions
  - DataFrameWriter and DataFrameReader
  - Formats
    - JSON
    - JDBC
    - Parquet
    - Hive tables
    - RDDs
    - Local collections
    - Additional formats
  - Save Modes
  - Partitions (Discovery and Writing)
- Datasets
  - Interoperability with RDDs, DataFrames, and Local Collections
  - Compile-Time Strong Typing
  - Easier Functional (RDD like) Transformations
  - Relational Transformations
  - Multi-Dataset Relational Transformations
  - Grouped Operations on Datasets
- Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
- Query Optimizer
  - Logical and Physical Plans
  - Code Generation
  - Large Query Plans and Iterative Algorithms
- Debugging Spark SQL Queries
- JDBC/ODBC Server
- Conclusion
4. Joins (SQL and Core)
- Core Spark Joins
  - Choosing a Join Type
  - Choosing an Execution Plan
    - Speeding up joins by assigning a known partitioner
    - Speeding up joins using a broadcast hash join
    - Partial manual broadcast hash join
- Spark SQL Joins
  - DataFrame Joins
    - Self joins
    - Broadcast hash joins
  - Dataset Joins
- Conclusion
5. Effective Transformations
- Narrow Versus Wide Transformations
  - Implications for Performance
  - Implications for Fault Tolerance
  - The Special Case of coalesce
- What Type of RDD Does Your Transformation Return?
- Minimizing Object Creation
  - Reusing Existing Objects
  - Using Smaller Data Structures
- Iterator-to-Iterator Transformations with mapPartitions
  - What Is an Iterator-to-Iterator Transformation?
  - Space and Time Advantages
  - An Example
- Set Operations
- Reducing Setup Overhead
  - Shared Variables
  - Broadcast Variables
  - Accumulators
- Reusing RDDs
  - Cases for Reuse
    - Iterative computations
    - Multiple actions on the same RDD
    - If the cost to compute each partition is very high
  - Deciding if Recompute Is Inexpensive Enough
  - Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files
    - Persist and cache
    - Checkpointing
    - Checkpointing example
  - Alluxio (nee Tachyon)
  - LRU Caching
    - Shuffle files
  - Noisy Cluster Considerations
  - Interaction with Accumulators
- Conclusion
6. Working with Key/Value Data
- The Goldilocks Example
  - Goldilocks Version 0: Iterative Solution
  - How to Use PairRDDFunctions and OrderedRDDFunctions
- Actions on Key/Value Pairs
- Whats So Dangerous About the groupByKey Function
  - Goldilocks Version 1: groupByKey Solution
    - Why GroupByKey fails
- Choosing an Aggregation Operation
  - Dictionary of Aggregation Operations with Performance Considerations
    - Preventing out-of-memory errors with aggregation operations
- Multiple RDD Operations
  - Co-Grouping
- Partitioners and Key/Value Data
  - Using the Spark Partitioner Object
  - Hash Partitioning
  - Range Partitioning
  - Custom Partitioning
  - Preserving Partitioning Information Across Transformations
    - Using narrow transformations that preserve partitioning
  - Leveraging Co-Located and Co-Partitioned RDDs
  - Dictionary of Mapping and Partitioning Functions PairRDDFunctions
- Dictionary of OrderedRDDOperations
  - Sorting by Two Keys with SortByKey
- Secondary Sort and repartitionAndSortWithinPartitions
  - Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function
  - How Not to Sort by Two Orderings
  - Goldilocks Version 2: Secondary Sort
    - Defining the custom partitioner
    - Filtering on each partition
    - Combine the elements associated with one key
    - Performance
  - A Different Approach to Goldilocks
    - Map to (cell value, column index) pairs
    - Sort and count values on each partition
    - Determine location of rank statistics on each partition
    - Filter for rank statistics
  - Goldilocks Version 3: Sort on Cell Values
- Straggler Detection and Unbalanced Data
  - Back to Goldilocks (Again)
  - Goldilocks Version 4: Reduce to Distinct on Each Partition
    - Aggregate to ((cell value, column index), count) on each partition
    - Sort and find rank statistics
    - Goldilocks postmortem
- Conclusion
7. Going Beyond Scala
- Beyond Scala within the JVM
- Beyond Scala, and Beyond the JVM
  - How PySpark Works
    - PySpark RDDs
    - PySpark DataFrames and Datasets
    - Accessing the backing Java objects and mixing Scala code
    - PySpark dependency management
    - Installing PySpark
  - How SparkR Works
  - Spark.jl (Julia Spark)
  - How Eclair JS Works
  - Spark on the Common Language Runtime (CLR)C# and Friends
- Calling Other Languages from Spark
  - Using Pipe and Friends
  - JNI
  - Java Native Access (JNA)
  - Underneath Everything Is FORTRAN
  - Getting to the GPU
- The Future
- Conclusion
8. Testing and Validation
- Unit Testing
  - General Spark Unit Testing
    - Factoring your code for testability
    - Regular Spark jobs (testing with RDDs)
    - Streaming
  - Mocking RDDs
    - Testing DataFrames
- Getting Test Data
  - Generating Large Datasets
  - Sampling
- Property Checking with ScalaCheck
  - Computing RDD Difference
- Integration Testing
  - Choosing Your Integration Testing Environment
    - Local mode
    - Docker-based
    - Yarn MiniCluster
- Verifying Performance
  - Spark Counters for Verifying Performance
  - Projects for Verifying Performance
- Job Validation
- Conclusion
9. Spark MLlib and ML
- Choosing Between Spark MLlib and Spark ML
- Working with MLlib
  - Getting Started with MLlib (Organization and Imports)
  - MLlib Feature Encoding and Data Preparation
    - Working with Spark vectors
    - Preparing textual data
    - Preparing data for supervised learning
  - Feature Scaling and Selection
  - MLlib Model Training
  - Predicting
  - Serving and Persistence
    - Saveable (internal format)
    - PMML
    - Custom
  - Model Evaluation
- Working with Spark ML
  - Spark ML Organization and Imports
  - Pipeline Stages
  - Explain Params
  - Data Encoding
  - Data Cleaning
  - Spark ML Models
  - Putting It All Together in a Pipeline
  - Training a Pipeline
  - Accessing Individual Stages
  - Data Persistence and Spark ML
    - Automated model selection (parameter search)
  - Extending Spark ML Pipelines with Your Own Algorithms
    - Custom transformers
    - Custom estimators
  - Model and Pipeline Persistence and Serving with Spark ML
- General Serving Considerations
- Conclusion
10. Spark Components and Packages
- Stream Processing with Spark
  - Sources and Sinks
    - Receivers
    - Repartitioning
  - Batch Intervals
  - Data Checkpoint Intervals
  - Considerations for DStreams
    - Output operations
  - Considerations for Structured Streaming
    - Data sources
    - Output operations
    - Custom sinks
    - Machine learning with Structured Streaming
    - Stream status and debugging
  - High Availability Mode (or Handling Driver Failure or Checkpointing)
- GraphX
- Using Community Packages and Libraries
  - Creating a Spark Package
- Conclusion
A. Tuning, Debugging, and Other Things Developers Like to Pretend Dont Exist
- Spark Tuning and Cluster Sizing
  - How to Adjust Spark Settings
  - How to Determine the Relevant Information About Your Cluster
- Basic Spark Core Settings: How Many Resources to Allocate to the Spark Application?
  - Calculating Executor and Driver Memory Overhead
  - How Large to Make the Spark Driver
  - A Few Large Executors or Many Small Executors?
    - Many small executors
    - Many large executors
  - Allocating Cluster Resources and Dynamic Allocation
    - Restrictions on dynamic allocation
  - Dividing the Space Within One Executor
  - Number and Size of Partitions
- Serialization Options
  - Kryo
    - Spark settings conclusion
- Some Additional Debugging Techniques
  - Out of Disk Space Errors
  - Logging
  - Configuring logging
  - Accessing logs
  - Attaching debuggers
  - Debugging in notebooks
  - Python debugging
  - Debugging conclusion
Index