Dimensionality reduction using PCA
Implementing Principal Component Analysis in Python
- Understand what is Principal Component Analysis (PCA)
- When to use PCA
- Execute PCA using sklearn library to reduce dimensionality in the wine dataset from Kaggle.
What is Principal Component Analysis?
Principal Component Analysis (PCA) is a multivariate statistical technique which transforms a data table containing several variables, that can be inter-correlated, into a smaller dataset with a reduced number of features still containing most of the information in the original source.
Reducing the dimensionality of a dataset makes the data exploration, visualization and the model learning process easier and faster due the lower number of variables to handle with. However almost of the variance explanation of target variable should be maintained by the principal components.
When to use PCA
The PCA technique should be used when all these three points are needed in a situation:
- A reduction in the number of variables is needed, but it’s dificult to indentify those which can be avoid.
- It is bearable to have lose interpretability and some accuraccy in the target prediction.
- It is know that the variables are independent of one another.
Dimensionality Reduction in the Wine Dataset
For this demostration the dataset used will be the Wine Customer Segmentation from Kaggle. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
Reading the data as pandas dataframe:
Removing outliers and standardizing numerical features
The PCA is very sensitive to outliers and extremelly observations can noise the clusters definition causing erroneous interpretation. Given that, it is essencial to identify, remove outliers and standardize the numerical features in order to guarantee a well segmentation and an uniform impact of all the continuos variables.
Removing the ouliers:
It is important to reinforce that the standardization or normalization must be fit only into the training set then transform the data for testing.
Splitting the data into training and testing:
Standardizing the datasets based on numerical extremelly values from training set:
Reducing dimensionality using PCA
Now the PCA technique can be fitted into the training set using the sklearn library. The PCA function has some attributes like the Explained Variance and Explained Variance Ratio. The first provides the amount of variance explained generated while the second one returns the proportion of the total variance explained by each principal component.
Running the function below it can be noticed that only 3 components can explain about 70% of the total variance from the Customer Segment target variable.
If it is desired to maintain the principal components that can explain at least 95% of the total variance, 9 components will be sustained in the new dataset.
However the main two components can be sufficient to identify the three well defined costumer clusters in both training and testing sets, as can be seen in the two charts below.
This post demostrated what is PCA, when it can be used to reduce dimensionality and how to apply it in standardized numerical dataset with no outliers in order to find clusters using the principal components generated and an acceptable explained variance of the dependent variable.
If this work was helpful for you, I will be glad to connect with you on Twitter or Linkedin.