Keep Calm and Study On - Unlock Your Success - Use #TOGETHER for 30% discount at Checkout

Data Cleansing using Python Online Course

Data Cleansing using Python Online Course

Data preparation, or preprocessing, is a critical yet time-consuming step in machine learning projects. It involves transforming raw data into a format suitable for modeling, including converting non-numeric data into numbers and meeting algorithm-specific requirements. This course covers data imputation, advanced data cleansing techniques, and methods to prevent data leakage for accurate model evaluation. By the end, you'll master essential data cleaning and preprocessing skills for machine learning.


Who is this Course for?

This course is ideal for those committed to becoming machine learning engineers in real-world applications. A strong foundation in Python, basic knowledge of machine learning, and some experience with machine learning libraries are recommended.


What you will learn

  • Prepare data to prevent data leakage.
  • Identify and resolve issues with messy data.
  • Choose appropriate feature selection methods for different data types.
  • Transform input variable probability distributions.
  • Remove irrelevant and redundant input variables.
  • Reduce dimensionality by projecting variables into lower-dimensional space.


Course Table of Contents

Introduction

  • Course Introduction
  • Course Structure
  • Is this Course Right for You?

Foundations

  • Introducing Data Preparation
  • The Machine Learning Process
  • Data Preparation Defined
  • Choosing a Data Preparation Technique
  • What is Data in Machine Learning?
  • Raw Data
  • Machine Learning is Mostly Data Preparation
  • Common Data Preparation Tasks - Data Cleansing
  • Common Data Preparation Tasks - Feature Selection
  • Common Data Preparation Tasks - Data Transforms
  • Common Data Preparation Tasks - Feature Engineering
  • Common Data Preparation Tasks - Dimensionality Reduction
  • Data Leakage
  • Problem with NaÏve Data Preparation
  • Case Study: Data Leakage: Train / Test / Split NaÏve Approach
  • Case Study: Data Leakage: Train / Test / Split Correct Approach
  • Case Study: Data Leakage: K-Fold NaÏve Approach
  • Case Study: Data Leakage: K-Fold Correct Approach

Data Cleansing

  • Data Cleansing Overview
  • Identify Columns That Contain a Single Value
  • Identify Columns with Few Values
  • Remove Columns with Low Variance
  • Identify and Remove Rows That Contain Duplicate Data
  • Defining Outliers
  • Remove Outliers - The Standard Deviation Approach
  • Remove Outliers - The IQR Approach
  • Automatic Outlier Detection
  • Mark Missing Values
  • Remove Rows with Missing Values
  • Statistical Imputation
  • Mean Value Imputation
  • Simple Imputer with Model Evaluation
  • Compare Different Statistical Imputation Strategies
  • K-Nearest Neighbors Imputation
  • KNNImputer and Model Evaluation
  • Iterative Imputation
  • IterativeImputer and Model Evaluation
  • IterativeImputer and Different Imputation Order
  • Feature Selection
  • Feature Selection Introduction
  • Feature Selection Defined
  • Statistics for Feature Selection
  • Loading a Categorical Dataset
  • Encode the Dataset for Modelling
  • Chi-Squared
  • Mutual Information
  • Modeling with Selected Categorical Features
  • Feature Selection with ANOVA on Numerical Input
  • Feature Selection with Mutual Information
  • Modeling with Selected Numerical Features
  • Tuning a Number of Selected Features
  • Select Features for Numerical Output
  • Linear Correlation with Correlation Statistics
  • Linear Correlation with Mutual Information
  • Baseline and Model Built Using Correlation
  • Model Built Using Mutual Information Features
  • Tuning Number of Selected Features
  • Recursive Feature Elimination
  • RFE for Classification
  • RFE for Regression
  • RFE Hyperparameters
  • Feature Ranking for RFE
  • Feature Importance Scores Defined
  • Feature Importance Scores: Linear Regression
  • Feature Importance Scores: Logistic Regression and CART
  • Feature Importance Scores: Random Forests
  • Permutation Feature Importance
  • Feature Selection with Importance

Data Transforms

  • Scale Numerical Data
  • Diabetes Dataset for Scaling
  • MinMaxScaler Transform
  • StandardScaler Transform
  • Robust Scaling Data
  • Robust Scaler Applied to Dataset
  • Explore Robust Scaler Range
  • Nominal and Ordinal Variables
  • Ordinal Encoding
  • One-Hot Encoding Defined
  • One-Hot Encoding
  • Dummy Variable Encoding
  • Ordinal Encoder Transform on Breast Cancer Dataset
  • Make Distributions More Gaussian
  • Power Transform on Contrived Dataset
  • Power Transform on Sonar Dataset
  • Box-Cox on Sonar Dataset
  • Yeo-Johnson on Sonar Dataset
  • Polynomial Features
  • Effect of Polynomial Degrees
  • Advanced Transforms

Transforming Different Data Types

  • The ColumnTransformer
  • The ColumnTransformer on Abalone Dataset
  • Manually Transform Target Variable
  • Automatically Transform Target Variable
  • Challenge of Preparing New Data for a Model
  • Save Model and Data Scaler
  • Load and Apply Saved Scalers

Dimensionality Reduction

  • Curse of Dimensionality
  • Techniques for Dimensionality Reduction
  • Linear Discriminant Analysis
  • Linear Discriminant Analysis Demonstrated
  • Principal Component Analysis

Tags: Data Cleansing using Python Online Course, Data Cleansing using Python Online Training Course, Learn Data Cleansing using Python, Data Cleansing using Python Exam Questions, Data Cleansing using Python Quiz,