Apache Spark 3 for Data Engineering and Analytics with Python
Video description
Master Python and PySpark 3.0.1 for Data Engineering / Analytics (Databricks)
About This Video
Apply PySpark and SQL concepts to analyze data
Understand the Databricks interface and use Spark on Databricks
Learn Spark transformations and actions using the RDD (Resilient Distributed Datasets) API
In Detail
Apache Spark 3 is an open-source distributed engine for querying and processing data. This course will provide you …
Apache Spark 3 for Data Engineering and Analytics with Python
Video description
Master Python and PySpark 3.0.1 for Data Engineering / Analytics (Databricks)
About This Video
Apply PySpark and SQL concepts to analyze data
Understand the Databricks interface and use Spark on Databricks
Learn Spark transformations and actions using the RDD (Resilient Distributed Datasets) API
In Detail
Apache Spark 3 is an open-source distributed engine for querying and processing data. This course will provide you with a detailed understanding of PySpark and its stack. This course is carefully developed and designed to guide you through the process of data analytics using Python Spark. The author uses an interactive approach in explaining keys concepts of PySpark such as the Spark architecture, Spark execution, transformations and actions using the structured API, and much more. You will be able to leverage the power of Python, Java, and SQL and put it to use in the Spark ecosystem.
You will start by getting a firm understanding of the Apache Spark architecture and how to set up a Python environment for Spark. Followed by the techniques for collecting, cleaning, and visualizing data by creating dashboards in Databricks. You will learn how to use SQL to interact with DataFrames. The author provides an in-depth review of RDDs and contrasts them with DataFrames.
There are multiple problem challenges provided at intervals in the course so that you get a firm grasp of the concepts taught in the course.
Who this book is for
This course is designed for Python developers who wish to learn how to use the language for data engineering and analytics with PySpark. Any aspiring data engineering and analytics professionals. Data scientists/analysts who wish to learn an analytical processing strategy that can be deployed over a big data cluster. Data managers who want to gain a deeper understanding of managing data over a cluster.
Chapter 1 : Introduction to Spark and Installation
Introduction
The Spark Architecture
The Spark Unified Stack
Java Installation
Hadoop Installation
Python Installation
PySpark Installation
Install Microsoft Build Tools
MacOS - Java Installation
MacOS - Python Installation
MacOS - PySpark Installation
MacOS - Testing the Spark Installation
Install Jupyter Notebooks
The Spark Web UI
Section Summary
Chapter 2 : Spark Execution Concepts
Section Introduction
Spark Application and Session
Spark Transformations and Actions Part 1
Spark Transformations and Actions Part 2
DAG Visualisation
Chapter 3 : RDD Crash Course
Introduction to RDDs
Data Preparation
Distinct and Filter Transformations
Map and Flat Map Transformations
SortByKey Transformations
RDD Actions
Challenge - Convert Fahrenheit to Centigrade
Challenge - XYZ Research
Challenge - XYZ Research Part 1
Challenge XYZ Research Part 2
Chapter 4 : Structured API - Spark DataFrame
Structured APIs Introduction
Preparing the Project Folder
PySpark DataFrame, Schema, and DataTypes
DataFrame Reader and Writer
Challenge Part 1 – Brief
Challenge Part 1 - Data Preparation
Working with Structured Operations
Managing Performance Errors
Reading a JSON File
Columns and Expressions
Filter and Where Conditions
Distinct Drop Duplicates Order By
Rows and Union
Adding, Renaming, and Dropping Columns
Working with Missing or Bad Data
Working with User-Defined Functions
Challenge Part 2 – Brief
Challenge Part 2 - Remove Null Row and Bad Records
Challenge Part 2 - Get the City and State
Challenge Part 2 - Rearrange the Schema
Challenge Part 2 - Write Partitioned DataFrame to Parquet
Aggregations
Aggregations - Setting Up Flight Summary Data
Aggregations - Count and Count Distinct
Aggregations - Min Max Sum SumDistinct AVG
Aggregations with Grouping
Challenge Part 3 – Brief
Challenge Part 3 - Prepare 2019 Data
Challenge Part 3 - Q1 Get the Best Sales Month
Challenge Part 3 - Q2 Get the City that Sold the Most Products
Challenge Part 3 - Q3 When to Advertise
Challenge Part 3 - Q4 Products Bought Together
Chapter 5 : Introduction to Spark SQL and Databricks
Introduction to DataBricks
Spark SQL Introduction
Register Account on Databricks
Create a Databricks Cluster
Creating our First 2 Databricks Notebooks
Reading CSV Files into DataFrame
Creating a Database and Table
Inserting Records into a Table
Exposing Bad Records
Figuring out How to Remove Bad Records
Extract the City and State
Inserting Records to Final Sales Table
What was the Best Month in Sales?
Get the City that Sold the Most Products
Get the Right Time to Advertise
Get the Most Products Sold Together
Create a Dashboard
Summary
Start your Free Trial Self paced Go to the Course We have partnered with providers to bring you collection of courses, When you buy through links on our site, we may earn an affiliate commission from provider.
This site uses cookies. By continuing to use this website, you agree to their use.I Accept