Introduction
Decision Trees: Decision Trees are the Supervised Machine learning algorithm that can be used for Classification and Regression problems. A classification problem identifies the set of categories or groups to which an observation belongs. A Decision Tree uses a tree or graphlike model of decision. Each internal node represents a "test" on attributes, and each branch represents the outcome of the test. Each leaf node represents a class label (decision taken after computing all features).
Common Examples:
Let's draw a Decision Tree for calculating the XNOR table. The table is as follows:
INPUT1 X1 
INPUT2 X2 
OUTPUT Y 
1 
1 
1 
1 
0 
0 
0 
1 
0 
0 
0 
1 
Here we have three features, INPUT1, INPUT2, and OUTPUT.
Let's make the feature INPUT1 the root node and build a Decision Tree.
Source: alliance.seas.uppnn.edu
Here we have split our features at each level depending upon the requirement, and at last, we have final decisions. In actual practice, the data is not as simple as this. The complexity of trees would increase as the data tend towards more practical realworld examples. Let's have a look at a similar kind of example:
Suppose we have to predict whether a student has been invited for an interview or not. We have three features, college type, isintern, and hasproject, and based on these, we have to make the Decision Tree.
TIER1/TIER2 
TIER3 
ISINTERN 
HASPROJECT 
GOT A CALL 
YES 
YES 
YES 
YES 

YES 
YES 
NO 
YES 

YES 
NO 
YES 
YES 

YES 
NO 
NO 
NO 

YES 
YES 
NO 
NO 

YES 
NO 
YES 
NO 

YES 
YES 
YES 
YES 
In the Machine Learning algorithm, we try to find a nearly accurate result and not the exact one. The task would be very complex if we train the algorithm to get the precise result every time. Similarly, the best version of the Decision Tree is very expensive and computationally impossible to build. So we try to develop a perfect Decision Tree that will give decent accuracy and be much easier to make. According to the above Decision Tree, if a student is from tier3 college and has a fantastic internship(say Google) but doesn't have any project, they will not get an interview call, which is practically not feasible. Mistakes will come in the way when we have exceptions that don't match the data upon which the Decision Tree is built.
Mathematics behind Decision Tree:
Following are the steps that we need to follow to build a very good Decision Tree:
1) We need to select the target attribute/feature from the given attributes. In the above example, the target attribute is “GOT A CALL,” which has two classes of values, either yes or no.
2) Calculate the information gain of the selected target attribute
Information Gain = PP+N log2(PP+N)  NP+N log2(NP+N)
Where P = number of the elements in class 1 of the target attribute.
N = number of the elements in class 2 of the target attribute.
In our case, the target attribute has two classes, “YES” and “NO”. So, P=5, and N=5.
3) Now for the remaining features/attributes.
 We calculate the entropy for all the Categorical variables.
 We take average information entropy for the current attribute.
The entropy of a given attribute is given by:
Entropy = i=1v Information gain of that i th categorical variable x Probability of i th categorical variable
Here 'V is the number of categorical variables in that attribute.
For example, if the attribute is "ISINTERN," then it has two categorical variables, "YES" and "NO."
I.G of “YES” = 47log2(47) say E1
I.G of “NO” = 37log2(37) say E2
The entropy of the attribute “ISINTERN” = E1+E2
4) Wel will now find the Gain of each attribute (except the target attribute).
Gain of Individual attribute = Information Gain of the target attribute  Entropy of that attribute.
5) The attribute having maximum Gain value will be our Root node. We will divide the root node into subsets that contain possible values for the best feature.
6) Recursively make new decision trees using the subsets of the dataset created in step 5, till the point where we don't have any features to split upon.