Data Engineering and Analytics with Apache Spark 3 and Python Online Course
Data Engineering and Analytics with Apache Spark 3 and Python Online Course
Data Engineering and Analytics with Apache Spark 3 and Python Online Course
This Data Engineering and Analytics with Apache Spark 3 and Python online course offers a comprehensive overview of PySpark and its stack. It covers essential concepts like Spark architecture, execution, transformations, and actions using the structured API. You will learn how to leverage Python, Java, and SQL within the Spark ecosystem, starting with setting up a Python environment for Spark. The course focuses on data collection, cleaning, and visualization techniques, including creating dashboards in Databricks. Additionally, it provides an in-depth review of RDDs and DataFrames, with hands-on challenges to reinforce the concepts and ensure a thorough understanding.
Key Benefits
- Gain expertise in applying PySpark and SQL to effectively analyze and process large datasets.
- Master the Databricks interface, enabling seamless use of Spark for scalable data processing and analytics.
- Develop a strong understanding of Spark transformations and actions through the RDD (Resilient Distributed Datasets) API, optimizing data processing workflows.
Target Audience
This course is tailored for Python developers looking to enhance their skills in data engineering and analytics using PySpark. It is also ideal for aspiring data engineering and analytics professionals, as well as data scientists and analysts interested in learning efficient analytical processing techniques that can be scaled across a big data cluster. Additionally, the course is suitable for data managers seeking a deeper understanding of data management within a cluster environment.
Learning Objectives
- Understand the Spark architecture, and gain proficiency in performing transformations and actions using the structured API.
- Set up and configure your own local PySpark environment for efficient data processing.
- Learn to interpret and work with the Directed Acyclic Graph (DAG) to optimize Spark execution.
- Gain expertise in navigating and interpreting the Spark Web UI for better monitoring and management.
- Master the RDD (Resilient Distributed Datasets) API for handling distributed data operations.
- Develop skills to visualize and present data through graphs and dashboards on Databricks, enhancing your analytical capabilities.
Course Topics
The Data Engineering and Analytics with Apache Spark 3 and Python Online Course covers the following topics -
Module 1 - Introduction to Spark and Installation
- Overview of Spark Architecture and Unified Stack
- Installation of Java, Hadoop, Python, and PySpark
- Installation of Microsoft Build Tools and Jupyter Notebooks
- Installation steps for MacOS: Java, Python, and PySpark
- Verifying Spark Installation on MacOS
- Exploring the Spark Web UI
Module 2 - Spark Execution Concepts
- Introduction to Spark Applications and Sessions
- Understanding Spark Transformations and Actions (Parts 1 & 2)
- Visualizing the Directed Acyclic Graph (DAG)
Module 3 - RDD Crash Course
- Introduction to Resilient Distributed Datasets (RDDs)
- Data Preparation and Transformations: Distinct, Filter, Map, FlatMap, and SortByKey
- RDD Actions
- Challenges: Converting Fahrenheit to Centigrade and XYZ Research (Parts 1 & 2)
Module 4 - Structured API - Spark DataFrame
- Introduction to Structured APIs
- Preparing the Project Folder for DataFrames
- Understanding PySpark DataFrame, Schema, and DataTypes
- Reading and Writing DataFrames
- Working with Structured Operations and Performance Management
- Handling Missing or Bad Data, User-Defined Functions, and Aggregations
- Challenge Part 1 & 2: Data Preparation, Removing Null Rows, and Writing Partitioned DataFrame to Parquet
- Challenge Part 3: Aggregations, Grouping, and Analyzing Sales Data
Module 5 - Introduction to Spark SQL and Databricks
- Introduction to Databricks and Spark SQL
- Registering for Databricks, Creating Clusters, and Notebooks
- Reading CSV Files into DataFrames and Creating Databases and Tables
- Inserting, Cleaning, and Analyzing Sales Data
- Creating a Dashboard to Visualize Sales Insights