Data is bigger, arrives faster, and comes in a variety of formats-and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, you…ll be able to:Learn Python, SQL, Scala, or Java high-level Structured APIsUnderstand Spark operations and SQL EngineInspect, tune, and debug Spark operations with Spark configurations and Spark UIConnect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or KafkaPerform analytics on batch and streaming data using Structured StreamingBuild reliable data pipelines with open source Delta Lake and SparkDevelop machine learning pipelines with MLlib and productionize models using MLflow Spis treści:ForewordPrefaceWho This Book Is ForHow the Book Is OrganizedHow to Use the Code ExamplesSoftware and Configuration UsedConventions Used in This BookUsing Code ExamplesOReilly Online LearningHow to Contact UsAcknowledgments1. Introduction to Apache Spark: A Unified Analytics EngineThe Genesis of SparkBig Data and Distributed Computing at GoogleHadoop at Yahoo!Sparks Early Years at AMPLabWhat Is Apache Spark?SpeedEase of UseModularityExtensibilityUnified AnalyticsApache Spark Components as a Unified StackSpark SQLSpark MLlibSpark Structured StreamingGraphXApache Sparks Distributed ExecutionSpark driverSparkSessionCluster managerSpark executorDeployment modesDistributed data and partitionsThe Developers ExperienceWho Uses Spark, and for What?Data science tasksData engineering tasksPopular Spark use casesCommunity Adoption and Expansion2. Downloading Apache Spark and Getting StartedStep 1: Downloading Apache SparkSparks Directories and FilesStep 2: Using the Scala or PySpark ShellUsing the Local MachineStep 3: Understanding Spark Application ConceptsSpark Application and SparkSessionSpark JobsSpark StagesSpark TasksTransformations, Actions, and Lazy EvaluationNarrow and Wide TransformationsThe Spark UIYour First Standalone ApplicationCounting M&Ms for the Cookie MonsterBuilding Standalone Applications in ScalaSummary3. Apache Sparks Structured APIsSpark: Whats Underneath an RDD?Structuring SparkKey Merits and BenefitsThe DataFrame APISparks Basic Data TypesSparks Structured and Complex Data TypesSchemas and Creating DataFramesTwo ways to define a schemaColumns and ExpressionsRowsCommon DataFrame OperationsUsing DataFrameReader and DataFrameWriterSaving a DataFrame as a Parquet file or SQL tableTransformations and actionsProjections and filtersRenaming, adding, and dropping columnsAggregationsOther common DataFrame operationsEnd-to-End DataFrame ExampleThe Dataset APITyped Objects, Untyped Objects, and Generic RowsCreating DatasetsScala: Case classesDataset OperationsEnd-to-End Dataset ExampleDataFrames Versus DatasetsWhen to Use RDDsSpark SQL and the Underlying EngineThe Catalyst OptimizerPhase 1: AnalysisPhase 2: Logical optimizationPhase 3: Physical planningPhase 4: Code generationSummary4. Spark SQL and DataFrames: Introduction to Built-in Data SourcesUsing Spark SQL in Spark ApplicationsBasic Query ExamplesSQL Tables and ViewsManaged Versus UnmanagedTablesCreating SQL Databases and TablesCreating a managed tableCreating an unmanaged tableCreating ViewsTemporary views versus global temporary viewsViewing the MetadataCaching SQL TablesReading Tables into DataFramesData Sources for DataFrames and SQL TablesDataFrameReaderDataFrameWriterParquetReading Parquet files into a DataFrameReading Parquet files into a Spark SQL tableWriting DataFrames to Parquet filesWriting DataFrames to Spark SQL tablesJSONReading a JSON file into a DataFrameReading a JSON file into a Spark SQL tableWriting DataFrames to JSON filesJSON data source optionsCSVReading a CSV file into a DataFrameReading a CSV file into a Spark SQL tableWriting DataFrames to CSV filesCSV data source optionsAvroReading an Avro file into a DataFrameReading an Avro file into a Spark SQL tableWriting DataFrames to Avro filesAvro data source optionsORCReading an ORC file into a DataFrameReading an ORC file into a Spark SQL tableWriting DataFrames to ORC filesImagesReading an image file into a DataFrameBinary FilesReading a binary file into a DataFrameSummary5. Spark SQL and DataFrames: Interacting with External Data SourcesSpark SQL and Apache HiveUser-Defined FunctionsSpark SQL UDFsEvaluation order and null checking in Spark SQLSpeeding up and distributing PySpark UDFs with Pandas UDFsQuerying with the Spark SQL Shell, Beeline, and TableauUsing the Spark SQL ShellCreate a tableInsert data into the tableRunning a Spark SQL queryWorking with BeelineStart the Thrift serverConnect to the Thrift server via BeelineExecute a Spark SQL query with BeelineStop the Thrift serverWorking with TableauStart the Thrift serverStart TableauStop the Thrift serverExternal Data SourcesJDBC and SQL DatabasesThe importance of partitioningPostgreSQLMySQLAzure Cosmos DBMS SQL ServerOther External SourcesHigher-Order Functions in DataFrames and Spark SQLOption 1: Explode and CollectOption 2: User-Defined FunctionBuilt-in Functions for Complex Data TypesHigher-Order Functionstransform()filter()exists()reduce()Common DataFrames and Spark SQL OperationsUnionsJoinsWindowingModificationsAdding new columnsDropping columnsRenaming columnsPivotingSummary6. Spark SQL and DatasetsSingle API for Java and ScalaScala Case Classes and JavaBeans for DatasetsWorking with DatasetsCreating Sample DataTransforming Sample DataHigher-order functions and functional programmingConverting DataFrames to DatasetsMemory Management for Datasets and DataFramesDataset EncodersSparks Internal Format Versus Java Object FormatSerialization and Deserialization (SerDe)Costs of Using DatasetsStrategies to Mitigate CostsSummary7. Optimizing and Tuning Spark ApplicationsOptimizing and Tuning Spark for EfficiencyViewing and Setting Apache Spark ConfigurationsScaling Spark for Large WorkloadsStatic versus dynamic resource allocationConfiguring Spark executors memory and the shuffle serviceMaximizing Spark parallelismHow partitions are createdCaching and Persistence of DataDataFrame.cache()DataFrame.persist()When to Cache and PersistWhen Not to Cache and PersistA Family of Spark JoinsBroadcast Hash JoinWhen to use a broadcast hash joinShuffle Sort Merge JoinOptimizing the shuffle sort merge joinWhen to use a shuffle sort merge joinInspecting the Spark UIJourney Through the Spark UI TabsJobs and StagesExecutorsStorageSQLEnvironmentDebugging Spark applicationsSummary8. Structured StreamingEvolution of the Apache Spark Stream Processing EngineThe Advent of Micro-Batch Stream ProcessingLessons Learned from Spark Streaming (DStreams)The Philosophy of Structured StreamingThe Programming Model of Structured StreamingThe Fundamentals of a Structured Streaming QueryFive Steps to Define a Streaming QueryStep 1: Define input sourcesStep 2: Transform dataStep 3: Define output sink and output modeStep 4: Specify processing detailsStep 5: Start the queryPutting it all togetherUnder the Hood of an Active Streaming QueryRecovering from Failures with Exactly-Once GuaranteesMonitoring an Active QueryQuerying current status using StreamingQueryGet current metrics using StreamingQueryGet current status using StreamingQuery.status()Publishing metrics using Dropwizard MetricsPublishing metrics using custom StreamingQueryListenersStreaming Data Sources and SinksFilesReading from filesWriting to filesApache KafkaReading from KafkaWriting to KafkaCustom Streaming Sources and SinksWriting to any storage systemUsing foreachBatch()Using foreach()Reading from any storage systemData TransformationsIncremental Execution and Streaming StateStateless TransformationsStateful TransformationsDistributed and fault-tolerant state managementTypes of stateful operationsStateful Streaming AggregationsAggregations Not Based on TimeAggregations with Event-Time WindowsHandling late data with watermarksSemantic guarantees with watermarksSupported output modesStreaming JoinsStreamStatic JoinsStreamStream JoinsInner joins with optional watermarkingOuter joins with watermarkingArbitrary Stateful ComputationsModeling Arbitrary Stateful Operations with mapGroupsWithState()Using Timeouts to Manage Inactive GroupsProcessing-time timeoutsEvent-time timeoutsGeneralization with flatMapGroupsWithState()Performance TuningSummary9. Building Reliable Data Lakes with Apache SparkThe Importance of an Optimal Storage SolutionDatabasesA Brief Introduction to DatabasesReading from and Writing to Databases Using Apache SparkLimitations of DatabasesData LakesA Brief Introduction to Data LakesReading from and Writing to Data Lakes using Apache SparkLimitations of Data LakesLakehouses: The Next Step in the Evolution of Storage SolutionsApache HudiApache IcebergDelta LakeBuilding Lakehouses with Apache Spark and Delta LakeConfiguring Apache Spark with Delta LakeLoading Data into a Delta Lake TableLoading Data Streams into a Delta Lake TableEnforcing Schema on Write to Prevent Data CorruptionEvolving Schemas to Accommodate Changing DataTransforming Existing DataUpdating data to fix errorsDeleting user-related dataUpserting change data to a table using merge()Deduplicating data while inserting using insert-only mergeAuditing Data Changes with Operation HistoryQuerying Previous Snapshots of a Table with Time TravelSummary10. Machine Learning with MLlibWhat Is Machine Learning?Supervised LearningUnsupervised LearningWhy Spark for Machine Learning?Designing Machine Learning PipelinesData Ingestion and ExplorationCreating Training and Test Data SetsPreparing Features with TransformersUnderstanding Linear RegressionUsing Estimators to Build ModelsCreating a PipelineOne-hot encodingEvaluating ModelsRMSEInterpreting the value of RMSER2Saving and Loading ModelsHyperparameter TuningTree-Based ModelsDecision treesRandom forestsk-Fold Cross-ValidationOptimizing PipelinesSummary11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache SparkModel ManagementMLflowTrackingModel Deployment Options with MLlibBatchStreamingModel Export Patterns for Real-Time InferenceLeveraging Spark for Non-MLlib ModelsPandas UDFsSpark for Distributed Hyperparameter TuningJoblibHyperoptSummary12. Epilogue: Apache Spark 3.0Spark Core and Spark SQLDynamic Partition PruningAdaptive Query ExecutionThe AQE frameworkSQL Join HintsShuffle sort merge join (SMJ)Broadcast hash join (BHJ)Shuffle hash join (SHJ)Shuffle-and-replicate nested loop join (SNLJ)Catalog Plugin API and DataSourceV2Accelerator-Aware SchedulerStructured StreamingPySpark, Pandas UDFs, and Pandas Function APIsRedesigned Pandas UDFs with Python Type HintsIterator Support in Pandas UDFsNew Pandas Function APIsChanged FunctionalityLanguages Supported and DeprecatedChanges to the DataFrame and Dataset APIsDataFrame and SQL Explain CommandsSummaryIndex

Kategoria: E-Beletrystyka

Producent:

krzysztof jabłoński, ursus c 330, książki kolorowanki, mnożenie

yyyyy