Technology Consultant Ernst & Young (EY) interview experience

Interview rounds

Round

Medium

Video Call

Duration60 Minutes

Interview date4 Oct 2021

Coding problem8

This round had questions mainly from Big Data and Data Warehouse. Questions from Hadoop were also asked at the end of the interview.

1. Data Warehouse Question

What do you mean by Degenerate Dimension?

Problem approach

A high cardinality attribute column in the fact table which does not have any other content except its natural key and
is required as a dimension for analysis or drill-down purpose, is called a degenerate dimension. As this degenerate
dimension is constructed from a fact table item and is placed in the fact table, it is also known as fact dimension. It
helps to reduce duplicate data by placing high cardinality dimension key in the fact table.

2. Data Warehouse Question

What are the different types of data marts in the context of data warehousing?

Problem approach

Following are the different types of data mart in data warehousing :

1) Dependent Data Mart : A dependent data mart can be developed using data from operational, external, or both
sources. It enables the data of the source company to be accessed from a single data warehouse. All data is
centralized, which can aid in the development of further data marts.

2) Independent Data Mart : There is no need for a central data warehouse with this data mart. This is typically
established for smaller groups that exist within a company. It has no connection to Enterprise Data Warehouse or any
other data warehouse. Each piece of information is self-contained and can be used independently. The analysis can
also be carried out independently. It's critical to maintain a consistent and centralized data repository that numerous
users can access.

3) Hybrid Data Mart : A hybrid data mart is utilized when a data warehouse contains inputs from multiple sources, as
the name implies. When a user requires an ad hoc integration, this feature comes in handy. This solution can be
utilized if an organization requires various database environments and quick implementation. It necessitates the least
amount of data purification, and the data mart may accommodate huge storage structures. When smaller data-centric
applications are employed, a data mart is most effective.

3. Data Warehouse Question

Difference between Fact Table and Dimension Table

Problem approach

1) Fact table contains the measuring on the attributes of a dimension table.
Dimension table contains the attributes on that truth table calculates the metric.

2) In fact table, There is less attributes than dimension table.
While in dimension table, There is more attributes than fact table.

3) In fact table, There is more records than dimension table.
While in dimension table, There is less records than fact table.

4) Fact table forms a vertical table.
While dimension table forms a horizontal table.

5) The attribute format of fact table is in numerical format and text format.
While the attribute format of dimension table is in text format.

6) The number of fact table is less than dimension table in a schema.
While the number of dimension is more than fact table in a schema.

4. Big Data Question

How to deploy a Big Data Model? Mention the key steps involved.

Problem approach

Deploying a model into a Big Data Platform involves mainly three key steps they are,

1) Data Ingestion: This process involves collecting data from different sources like social media platforms, business
applications, log files, etc.

2) Data Storage: When data extraction is completed, the challenge is to store this large volume of data in the
database in which the Hadoop Distributed File system (HDFS) plays a vital role.

3) Data Processing: After storing the data in HDFS or HBase, the next task is to analyze and visualize these large
amounts of data using specific algorithms for better data processing. And yet again, this task is more straightforward
if we use Hadoop, Apache Spark, Pig, etc.

After performing these essential steps, one can deploy a big data model successfully.

5. Big Data Question

Explain overfitting in big data? How to avoid the same.

Problem approach

Overfitting is generally a modeling error referring to a model that is tightly fitted to the data, i.e. When a modeling
function is closely fitted to a limited data set. Due to Overfitting, the predictivity of such models gets reduced. This
effect leads to a decrease in generalization ability failing to generalize when applied outside the sample data.

There are several Methods to avoid Overfitting; some of them are :

1) Cross-validation : A cross-validation method refers to dividing the data into multiple small test data sets, which can
be used to tune the model.

2) Early stopping : After a certain number of iterations, the generalizing capacity of the model weakens; in order to
avoid that, a method called early stopping is used in order to avoid Overfitting before the model crosses that point.

3) Regularization : this method is used to penalize all the parameters except intercept so that the model generalizes
the data instead of Overfitting.

6. Hadoop Question

Explain Hadoop. List the core components of Hadoop

Problem approach

Hadoop is a famous big data tool utilized by many companies globally. Few successful Hadoop users :

1) Uber
2) The Bank of Scotland
3) Netflix
4) The National Security Agency (NSA) of the United States
5) Twitter

There are three components of Hadoop are:

1) Hadoop YARN - It is a resource management unit of Hadoop.
2) Hadoop Distributed File System (HDFS) - It is the storage unit of Hadoop.
3) Hadoop MapReduce - It is the processing unit of Hadoop.

7. Hadoop Question

Mention different Features of HDFS.

Problem approach

1) Fault Tolerance :
Hadoop framework divides data into blocks and creates various copies of blocks on several machines in the cluster. So, when any device in the cluster fails, clients can still access their data from the other machine containing the exact copy of data blocks.

2) High Availability :
In the HDFS environment, the data is duplicated by generating a copy of the blocks. So, whenever a user wants to obtain this data, or in case of an unfortunate situation, users can simply access their data from the other nodes because duplicate images of blocks are already present in the other nodes of the HDFS cluster.

3) High Reliability :
HDFS splits the data into blocks, these blocks are stored by the Hadoop framework on nodes existing in the cluster. It saves data by generating a duplicate of every block current in the cluster. Hence presents a fault tolerance facility. By default, it creates 3 duplicates of each block containing information present in the nodes. Therefore, the data is promptly obtainable to the users. Hence the user does not face the difficulty of data loss. Therefore, HDFS is very reliable.

4) Replication :
Replication resolves the problem of data loss in adverse conditions like device failure, crashing of nodes, etc. It manages the process of replication at frequent intervals of time. Thus, there is a low probability of a loss of user data.

5) Scalability :
HDFS stocks the data on multiple nodes. So, in case of an increase in demand, it can scale the cluster.

8. Hadoop Question

What are the three modes that hadoop can Run?

Problem approach

1) Local Mode or Standalone Mode :
Hadoop, by default, is configured to run in a no distributed mode. It runs as a single Java process. Instead of HDFS, this mode utilizes the local file system. This mode is more helpful for debugging, and there isn't any requirement to configure core-site.xml, hdfs-site.xml, mapred-site.xml, masters & slaves. Stand-alone mode is ordinarily the quickest mode in Hadoop.

2) Pseudo-distributed Model :
In this mode, each daemon runs on a separate java process. This mode requires custom configuration ( core-site.xml, hdfs-site.xml, mapred-site.xml). The HDFS is used for input and output. This mode of deployment is beneficial for testing and debugging purposes.

3) Fully Distributed Mode :
It is the production mode of Hadoop. Basically, one machine in the cluster is designated as NameNode and another as Resource Manager exclusively. These are masters. Rest nodes act as Data Node and Node Manager. These are the slaves. Configuration parameters and environment need to be defined for Hadoop Daemons. This mode gives fully distributed computing capacity, security, fault endurance, and scalability.

Round

Medium

Video Call

Duration60 Minutes

Interview date4 Oct 2021

Coding problem8

This round started with some questions from AWS and then the interviewer switched to Python and Data Analysis questions since I had some projects related to that.

1. AWS Question

Explain AWS Lambda.

Problem approach

AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of Amazon Web Services. Therefore you don’t need to worry about which AWS resources to launch, or how will you manage them. Instead, you need to put the code on Lambda, and it runs.

In AWS Lambda the code is executed based on the response of events in AWS services such as add/delete files in S3 bucket, HTTP request from Amazon API gateway, etc. However, Amazon Lambda can only be used to execute background tasks.

AWS Lambda function helps you to focus on your core product and business logic instead of managing operating system (OS) access control, OS patching, right-sizing, provisioning, scaling, etc.

Working of AWS Lambda is as follows :

Step 1 : First upload your AWS Lambda code in any language supported by AWS Lambda. Java, Python, Go, and C# are some of the languages that are supported by AWS Lambda function.

Step 2 : These are some AWS services which allow you to trigger AWS Lambda.

Step 3 : AWS Lambda helps you to upload code and the event details on which it should be triggered.

Step 4 : Executes AWS Lambda Code when it is triggered by AWS services:

Step 5 : AWS charges only when the AWS lambda code executes, and not otherwise.

2. AWS Question

What do you understand by stopping and terminating an EC2 Instance?

Problem approach

Stopping an EC2 instance means to shut it down as you would normally do on your Personal Computer. This will not delete any volumes attached to the instance and the instance can be started again when needed.

On the other hand, terminating an instance is equivalent to deleting an instance. All the volumes attached to the instance get deleted and it is not possible to restart the instance if needed at a later point in time.

3. AWS Question

What are the advantages of AWS IAM?

Problem approach

1) AWS IAM enables an administrator to provide granular level access to different users and groups.

2) Different users and user groups may need different levels of access to different resources created.

3) With IAM, you can create roles with specific access-levels and assign the roles to the users.

4) It also allows you to provide access to the resources to users and applications without creating the IAM Roles, which is known as Federated Access.

4. Python Question

What is Lambda Function in Python?

Problem approach

A Lambda Function in Python programming is an anonymous function or a function having no name. It is a small and restricted function having no more than one line. Just like a normal function, a Lambda function can have multiple arguments with one expression.

In Python, lambda expressions (or lambda forms) are utilized to construct anonymous functions. To do so, you will use the lambda keyword (just as you use def to define normal functions). Every anonymous function you define in Python will have 3 essential parts :

i) The lambda keyword.
ii) The parameters (or bound variables), and
iii) The function body.

A lambda function can have any number of parameters, but the function body can only contain one expression. Moreover, a lambda is written in a single line of code and can also be invoked immediately. You will see all this in action in the upcoming examples.

EXAMPLE :

adder = lambda x, y: x + y
print (adder (5, 2))

OUTPUT : 5

Code Explanation :
Here, we define a variable that will hold the result returned by the lambda function.

1) The lambda keyword used to define an anonymous function.

2) x and y are the parameters that we pass to the lambda function.

3) This is the body of the function, which adds the 2 parameters we passed. Notice that it is a single expression. You cannot write multiple statements in the body of a lambda function.

4) We call the function and print the returned value.

5. Python Question

What is slicing in Python?

Problem approach

1) As the name suggests, ‘slicing’ is taking parts of.
2) Syntax for slicing is [start : stop : step].
3) start is the starting index from where to slice a list or tuple.
4) stop is the ending index or where to sop.
5) step is the number of steps to jump.
6) Default value for start is 0, stop is number of items, step is 1.
7) Slicing can be done on strings, arrays, lists, and tuples.

EXAMPLE :

numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print(numbers[1 : : 2])

#output : [2, 4, 6, 8, 10]

6. Python Question

How is memory managed in Python?

Problem approach

1) Memory management in Python is handled by the Python Memory Manager. The memory allocated by the manager is in form of a private heap space dedicated to Python. All Python objects are stored in this heap and being private, it is inaccessible to the programmer. Though, python does provide some core API functions to work upon the private heap space.

2) Additionally, Python has an in-built garbage collection to recycle the unused memory for the private heap space.

7. Python Question

Name a few libraries in Python used for Data Analysis and Scientific computations.

Problem approach

1) NumPy: It is used for scientific computing and performing basic and advanced array operations. It offers many handy features performing operations on n-arrays and matrices in Python. It helps to process arrays that store values of the same data type and makes performing math operations on arrays (and their vectorization) easier.

2) SciPy: This useful library includes modules for linear algebra, integration, optimization, and statistics. Its main functionality was built upon NumPy, so its arrays make use of this library.

3) Pandas: This is a library created to help developers work with “labeled” and “relational” data intuitively. It’s based on two main data structures: “Series” (one-dimensional, like a list of items) and “Data Frames” (two-dimensional, like a table with multiple columns).

4) SciKit: Scikits is a group of packages in the SciPy Stack that were created for specific functionalities – for example, image processing. Scikit-learn uses the math operations of SciPy to expose a concise interface to the most common machine learning algorithms.

5) Matplotlib: This is a standard data science library that helps to generate data visualizations such as two-dimensional diagrams and graphs (histograms, scatterplots, non-Cartesian coordinates graphs).

6) Seaborn: Seaborn is based on Matplotlib and serves as a useful Python machine learning tool for visualizing statistical models – heatmaps and other types of visualizations that summarize data and depict the overall distributions.

7) Plotly: This web-based tool for data visualization that offers many useful out-of-box graphics – you can find them on the Plot.ly website. The library works very well in interactive web applications.

8. Data Analysis Question

What is nominal data and ordinal data? Explain with examples.

Problem approach

Nominal data is data with no fixed categorical order. For example, the continents of the world (Europe, Asia, North America, Africa, South America, Antarctica, Oceania).

Ordinal data is data with fixed categorical order. For example, customer satisfactory rate (Very dissatisfied, dissatisfied, neutral, satisfied, very satisfied).

Round

Easy

HR Round

Duration30 Minutes

Interview date4 Oct 2021

Coding problem2

This is a cultural fitment testing round .HR was very frank and asked standard questions. Then we discussed about my role.

1. Basic HR Question

Tell me something about yourself?

Problem approach

Tip 1 : Prepare the points that you will speak in your introduction prior to the interview.
Tip 2 : Tell about your current cgpa, achievements and authenticated certification
Tip 3 : I told about my role in current internship and what all I do

2. Basic HR Question

Why should we hire you ?

Problem approach

Tip 1 : The cross questioning can go intense some time, think before you speak.

Tip 2 : Be open minded and answer whatever you are thinking, in these rounds I feel it is important to have opinion.

Tip 3 : Context of questions can be switched, pay attention to the details. It is okay to ask questions in these round,
like what are the projects currently the company is investing, which team you are mentoring. How all is the work
environment etc.

Tip 4 : Since everybody in the interview panel is from tech background, here too you can expect some technical
questions. No coding in most of the cases but some discussions over the design can surely happen.