1. What is Data Science?
  2. What is the difference between Supervised and Unsupervised Learning?
  3. You are given a data set consisting of variables with more than 30% missing values. How will you deal with them?
  4. For the given points, how will you calculate the Euclidean distance in Python?
  5. What is Dimensionality Reduction and its benefits?
  6. How should you maintain a deployed model?
  7. What is a recommender system?
  8. How do you find RMSE and MSE in a linear regression model?
  9. How can you select k for k-means?
  10. ‘People who bought this also bought…’ recommendations seen on Amazon are a result of which algorithm?
  11. How does an ROC curve works?
  12. What is a Selection Bias?
  13. Can you explain SVM Machine Learning algorithm in detail?
  14. What are the different kernels functions in SVM?
  15. How can you explain Entropy in a Decision Tree algorithm?
  16. What is Information Gain in Decision Tree algorithm?
  17. Describe pruning in Decision Tree?
  18. What is Ensemble Learning?
  19. Can you define a Random Forest?
  20. How does a Random Forest work?
  21. What is Logistic Regression?
  22. What is Deep Learning?
  23. How can you differentiate between Machine Learning and Deep Learning?
  24. How can you explain reinforcement learning?
  25. What is Regularization? Why is it useful?
  26. What are Recommender Systems?
  27. What is ‘Naive’ in a Naive Bayes?
  28. Why is it called Naive?
  29. What are feature vectors?

Of all the newly emerged professions, data science has gained the maximum popularity due to the vast ocean of opportunities it offers.

Are you wondering how to become a Data Scientist? But don't know how to crack it? Here are 29 Data Science interview questions and answers often asked.

1. What is data science?

Data science is a mix of numerous algorithms, tools and learning principles, which assists in uncovering hidden patterns from raw data.

2. What is the difference between Supervised and Unsupervised Learning?

Supervised Learning Unsupervised Learning
Input data is labelled. Input data is not labelled.
Uses a data training set. Uses the input data set.
Used for prediction. Used for analysis.
Enables classification and regression Enables classification, density estimation and dimension reduction

3. You are given a data set consisting of variables with more than 30% missing values. How will you deal with them?

A frequently asked data science interview question, this problem can be dealt in two ways. In case of a bigger data set, remove the rows with missing data and use the rest of the data to predict the values.

In the second case, if the set is smaller, substitute the missing values with the mean or average of the rest of the data using Pandas Dataframe in python. There are different ways to achieve it; df.mean() or df.fillna(mean).

4. For the given points, how will you calculate the Euclidean distance in Python?

Given that plot 1= [1,3]
plot 2=[2,5]
The Euclidean distance can be calculated as;
Euclidean_distance= sqrt( (plot 1[0]-plot 2[0])**2+(plot 1[1]-plot2[1]**2).

5. What is Dimensionality Reduction and its benefits?

The process of converting a data set with vast dimensions into data with lesser dimensions/ fields to convey similar information concisely, is called Dimensionality Reduction.
The benefits of it are as follows;
• Helps in compressing data and reducing storage space.
• Reduces computation time
• Removes redundant features such as removal of the values of 2 different units (meters and inches).

6. How should you maintain a deployed model?

The deployed model can be maintained in the following manner;
• Monitor: Constant monitoring is required to determine their performance accurately.
• Evaluate: Evaluation data is required to determine whether new algorithms need introduction.
• Compare: Comparing is required to determine the best performing models.
• Rebuild: The best performing model is rebuilt based on the current data set.

7. What is a recommender system?

A recommender system predicts a user’s rating of a particular product based on their preferences. It can be differentiated into collaborative and content-based filtering.

8. How do you find RMSE and MSE in a linear regression model?

Mean Square Error (MSE) and Root Mean Square Error (RMSE) are the most common measures of accuracy in a linear regression model. MSE in a linear regression model is basically calculated as: whole square of Predicted minus Actual divided by n. Whereas, RMSE is calculated as the squareroot of MSE.

9. How can you select k for k-means?

The elbow method can be used to select k for k-means clustering. The idea of this method is to run k-means clustering on the data set where the number of clusters is considered ‘k.’

Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.

10. ‘People who bought this also bought…’ recommendations seen on Amazon are a result of which algorithm?

The recommendation engine is achieved through collaborative filtering. This filter helps to explain the user behavior and their purchase history in terms of selection, ratings etc.

The engine then predicts the interest of the person based on the preference of other users. However in this algorithm, item features are unknown.

Take for example, a sales page shows that a certain number of buyers are purchasing a phone along with a tempered glass. So, the next time someone buys a phone, the customer might see this recommendation as well.

11. How does an ROC curve work?

An ROC curve is a graphical representation of the contrast between true positive rates and false positive rates during various thresholds. It is often used as a proxy for the trade-off between sensitivity (true positive rate) and false positive rate.

12. What is a Selection Bias?

Selection bias occurs when sample obtained is not the representative of the population intended to be analyzed.

13. Can you explain SVM Machine Learning algorithm in detail?

Support Vector Mechanism (SVM) is a supervised machine learning algorithm. It can be used for regression and classification, both.

As for example, if you have ‘n’ features in your training data set, SVM tries to plot it in n-dimensional space with the value of each feature being the value of a particular coordinate. It basically uses hyper planes to separate different classes based on the provided kernel function.

14. What are the different kernels functions in SVM?

There are 4 types of kernels in SVM. They are as follows;

  • Linear Kernel
  • Polynomial Kernel
  • Radial basis Kernel
  • Sigmoid Kernel

15. How can you explain Entropy in Decision Tree algorithm?

A decision tree is constructed from a root node (top-down) and involves data partitions into homogeneous subsets. In order to build it, the core algorithm used is called ID3. This algorithm uses Entropy and Information Gain to construct a decision tree.

ID3 uses the subsets to check the sample’s homogeneity. In case the sample is completely homogeneous, the entropy will be 0 and if equally divided, it has entropy of 1.

16. What is Information Gain in a Decision Tree algorithm?

Used to build a decision tree, Information gain is based on the entropy’s decrease after the data set is split on an attribute. So, constructing a decision tree is about finding attributes that returns the highest information gain.

17. Can you describe pruning in Decision Tree?

The process followed to remove sub-nodes of a decision node is called pruning.

18. What is Ensemble Learning?

The art of combining diverse set of learners (Individual models) together to improvise on the stability and predictive power of the model is called an Ensemble.

19. Can you define a Random Forest?

It is a machine learning method, which is capable of performing both regression and classification tasks.

It is basically a type of ensemble learning method where a group of weak models combine to a form a powerful one. It is also used for dimentionality reduction, treating of missing and outlier values.

20. How does a Random Forest work?

In Random forest we grow multiple trees. And to classify a new object based on attributes, each tree provides with classifications. Thus the forest chooses the most voted classification (over all the trees in the forest). However, in case of regression, it picks up the average of outputs by different trees.

21. What is Logistic Regression?

Often referred to as Logit model, it is a technique to predict the binary outcome from a linear combination of predictor variables.

As for example, in order to predict whether or not a particular political leader will win the elections, the binary outcome of prediction is binary, 0/1 (win/lose). Whereas the predictor variables would be the money he/she spent for the elections, time spent in campaigning etc.

22. What is Deep Learning?

Deep learning is a sub-field of machine learning inspired by structure and function of a brain called Artificial Neural Network. We have many algorithms under machine learning like Linear regression, SVM and so on. Deep learning is simply an extension of neural networks.

In neural nets, number of hidden layers considered are small. But when it comes to deep learning, algorithms, a huge number of hidden layers is considered to better understand the input output relationship.

23. What is the difference between Machine Learning and Deep Learning?

Machine learning is a field of computer science, enabling computers with the ability to learn without being explicitly programmed. There are 3 categories in machine learning, namely Supervised, Unsupervised and Reinforcement machine learning.

Deep learning is a sub-field of machine learning, concerned with algorithms inspired by the structure and function of Artificial Neural Network.

24. How can you explain reinforcement learning?

Reinforcement learning is the process to understand what to do and how to map situations to actions. In this process, the learner is not told which action to take, but must discover which action will yield the maximum reward.

The end result of reinforcement learning is to maximize the numerical reward signal. Inspired by the learning of human beings, reinforcement learning is based on the mechanism of reward/ panelity.

25. What is Regularization? Why is it useful?

It is the process of adding tuning parameter to a model in order to induce smoothness to prevent overfitting. It is often done by adding a constant multiple to an existing weight vector. The constant is often considered L1 (Lasso) or L2 (ridge). The model then minimizes the loss function calculated on the regularized training set.

26. What are Recommender Systems?

Recommender system is basically a sub-class of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. It is mostly used in movies, research articles, music etc.

27. What is ‘Naive’ in a Naive Bayes?

The Naive Bayes Theorem is based on the Bayes’ Theorem. Bayes’ Theorem describes the probability of an event based on prior knowledge of the conditions related to the event.

28. Why is it called Naive?

The reason for calling the algorithm ‘naive’ is because its assumptions, which may or may not be correct.

29. What are feature vectors?

It is an n-dimensional vector of numerical features, representing an object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics (features) of an object in a mathematical format which is easy to analyze.

Apart from these basic interview questions,you can find below additional reads to help you crack your next big data science interview.

Also, here's are 8 top companies with data scientist jobs for you to apply, Fractal Analytics, Accenture,Amazon, Deloitte, Flipkart, Myntra, HCL Technologies, and Unilever.

Also Read: What Is Data Science And How To Become A Data Scientist

data science interview questions how to become a Data Scientist data scientist jobs Interview Advice