Feature Extraction
Let's assume we have ten independent variables. In feature extraction, we have to create ten "new" independent variables, where every newly created independent variable is a combination of each of the ten of our original independent variables. However, these new independent variables are created in a specific way and are ordered on the basis of "how well they predict our dependent variable?". Now that we have ordered our variables by how well they predict our dependent variable, we know which variable is most and least important, so we can rapidly drop off the least significant ones. But because of the fact that the new independent variables are a combination of our old ones, we are still retaining the fundamental attributes of our old variables.
The principal component analysis is a technique for feature extraction, in which it combines our input variables in a specific way so that we can drop the "least important/ least significant" variables while still retaining the fundamental attributes of our old variables.
How does Principal component Analysis (PCA) work?

Here, we transform five data points using Principal component analysis (PCA). The left graph is our original data; the right graph would be our transformed data.
source: setosa.io
Principal Component analysis can be broken down into five steps:
- Standardization of the range of continuous initial variables
- Computation of the covariance matrix to identify correlations
- Computation of the eigenvectors and eigenvalues of the covariance matrix to identify the principal components
- Creation of a feature vector to decide which principal components to keep
- Recasting the data along the axes of the principal component
Step 1: Standardization
In this step, we aim to bring all the variables to one standard range so that each one of them contributes equally to the analysis. The difference in the ranges of the variables will result in the dominance of variables with more significant differences within their ranges over the ones which have smaller ranges. For instance, a variable that ranges between 0 and 1000 will be dominant over a variable that has a range between 0 and 1, and this will lead to biased results. So, we will have to transform the data to comparable scales to prevent this problem.
Step 2: Covariance Matrix Computation
The main aim of this step is to understand how the variables of the input data set vary from the mean with respect to each other, that is, to determine if there exists any relationship between them. The covariance matrix is a [n × n] symmetric matrix (where n is the number of dimensions) that has as entries the covariances related to all possible pairs of the initial variables.

What do the covariance matrix entries tell us about the correlations between the variables?
It is actually the sign of the covariance that matters :
- if the sign is positive: the two variables increase or decrease together (correlated)
- if the sign is negative: One variable increases when the other decreases (Inversely correlated)
Step 3: Computing the eigenvectors and eigenvalues of the covariance matrix.
Eigenvectors and eigenvalues are the linear algebra concepts, which are required to be computed from the covariance matrix to determine the principal components of the data, which are the "new" variables that are formed from linear combinations or mixtures of the initial variables in our feature set.
Let's assume that the eigenvectors and values of the covariance matrix of our data set are 2-dimensional with two variables (x,y) are as follows:

Step 4: Creation of a feature vector
In this step, we decide if we want to keep all these attributes or discard those of lesser significance (i.e., of low eigenvalues) and, with the help of the remaining ones, form a matrix of vectors that we call Feature vectors.
This will reduce the dimensions of our feature set because if we choose to keep only p eigenvectors (components) out of n, the final data set will have only p dimensions.
Step 5: Recast the data along the axes of the principal component
In this step, we use the feature vector formed using the eigenvectors of the covariance matrix to reorient the data from the original axes to those represented by the principal components (hence the name Principal Components Analysis). To do this, we multiply the transpose of the original data set by the transpose of the feature vector.

Frequently Asked Questions
1). When should we use Principal component analysis (PCA)?
- When we want to reduce the number of variables, we aren't able to identify the variables which can be removed entirely from consideration.
- When we want to ensure your variables are independent of one another
- If we're going to make our independent variables less interpretable
2). What are the limitations of Principal component analysis (PCA)?
- Even though principal components are the linear combination of the attributes of the original variables, they are not very easy to interpret.
-
It's a trade-off between information loss and dimensionality reduction.
3). What type of data should be used for Principal component analysis (PCA)?
Principal component analysis (PCA) works best on a data set having three or more dimensions. Because, with more dimensions, it becomes increasingly challenging to make interpretations from the resultant cloud of data.
Key Takeaways
If we have a lot of independent variables to handle, we use Principal component analysis (PCA) to reduce the dimensions of our feature set.
The principal component analysis is a technique, which combines our input variables through their linear combinations and mixtures, and then we can drop the "least significant" variables while still retaining the most valuable attributes of all of the variables.