Apache Spark Training Content
1. Introduction
• Overview of Big Data and Apache Spark
• Spark Ecosystem and Components
• Use Cases and Advantages of Spark over Hadoop MapReduce
2. Setup
• Spark Installation (Standalone and Cluster mode)
• Environment Configuration (Local, YARN, Databricks)
• Integration with Jupyter Notebooks and IDEs
3. PySpark
• Introduction to PySpark API
• RDDs vs DataFrames
• Basic Transformations and Actions
4. DataFrame & Spark SQL
• Creating DataFrames from various sources (CSV, JSON, Parquet, etc.)
• Schema Inference and Manual Schema Definition
• Spark SQL Queries and DataFrame API Interoperability
5. Row Operations
• Accessing and Manipulating Row Data
• Filtering, Mapping, and Aggregating Rows
• Row-level UDFs and Custom Transformations
6. Column Operations
• Working with Columns: Select, Rename, Add, Drop
• Column Expressions and Functions
• Built-in Functions and UDFs for Column Transformation
7. DAG / Spark UI / Explain Plan
• Understanding Spark DAG (Directed Acyclic Graph)
• Spark Execution Plan and Lazy Evaluation
• Using Spark UI for Job Monitoring and Debugging
• explain() and debug() plans in Spark
8. Performance Tuning
• Caching and Persistence
• Partitioning and Coalescing
• Broadcast Variables and Join Optimization
• Skew Handling and Shuffle Optimization
• Resource Configuration (Executor/Driver Memory & Cores)
9. Structured Streaming
• Basics of Structured Streaming in Spark
• Reading and Writing Streams (Kafka, File Source, Socket, etc.)
• Watermarking, Windowing, and Aggregations
• Fault Tolerance and Checkpointing
10. Databricks Spark SQL and Delta Lake
• Introduction to Databricks Platform
• Collaborative Notebooks and Workspace Features
• Delta Lake Overview and Architecture
• ACID Transactions, Time Travel, and Schema Evolution
• Optimized Writes, Z-Ordering, and Delta Lake Performance Tuning