Senior Software Engineer CGI Inc. interview experience

Interview rounds

Round

Medium

Video Call

Duration50 Minutes

Interview date9 Apr 2021

Coding problem7

This round mainly revolved around concepts from Data Warehouse and Big Data.

1. Data Warehouse Question

What is a Slowly Changing Dimension (SCD)? What are the different types of SCDs?

Problem approach

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over
time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the
history of dimension records.

There are three types of SCDs and you can use Warehouse Builder to define, deploy, and load all three types of
SCDs.

Type 1 SCDs - Overwriting :
In a Type 1 SCD the new data overwrites the existing data. Thus the existing data is lost as it is not stored anywhere
else. This is the default type of dimension you create. You do not need to specify any additional information to create
a Type 1 SCD.

Type 2 SCDs - Creating another dimension record :
A Type 2 SCD retains the full history of values. When the value of a chosen attribute changes, the current record is
closed. A new record is created with the changed data values and this new record becomes the current record. Each
record contains the effective time and expiration time to identify the time period between which the record was active.

Type 3 SCDs - Creating a current value field :
A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores the previous
value and the current value of the selected attribute. When the value of any of the selected attributes changes, the
current value is stored as the old value and the new value becomes the current value.

2. Data Warehouse Question

What do you mean by Degenerate Dimension?

Problem approach

A high cardinality attribute column in the fact table which does not have any other content except its natural key and
is required as a dimension for analysis or drill-down purpose, is called a degenerate dimension. As this degenerate
dimension is constructed from a fact table item and is placed in the fact table, it is also known as fact dimension. It
helps to reduce duplicate data by placing high cardinality dimension key in the fact table.

3. Data Warehouse Question

Explain ETL Process in Data Warehouse

Problem approach

ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process in which an ETL
tool extracts the data from various data source systems, transforms it in the staging area, and then finally, loads it into
the Data Warehouse system.

The different steps in ETL process are :

1) Extraction:
The first step of the ETL process is extraction. In this step, data from various source systems is extracted which can
be in various formats like relational databases, No SQL, XML, and flat files into the staging area. It is important to
extract the data from various source systems and store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can be corrupted also. Hence loading it directly into
the data warehouse may damage it and rollback will be much more difficult. Therefore, this is one of the most
important steps of ETL process.

2) Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or functions are applied on the
extracted data to convert it into a single standard format. It may involve following processes/tasks:
i) Filtering – loading only certain attributes into the data warehouse.
ii) Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States, and America into
USA, etc.
iii) Joining – joining multiple attributes into one.
iv) Splitting – splitting a single attribute into multiple attributes.
v) Sorting – sorting tuples on the basis of some attribute (generally key-attribute).

3) Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is finally loaded into the data
warehouse. Sometimes the data is updated by loading into the data warehouse very frequently and sometimes it is
done after longer but regular intervals. The rate and period of loading solely depends on the requirements and varies
from system to system.

4. Data Warehouse Question

What are the different types of data marts in the context of data warehousing?

Problem approach

Following are the different types of data mart in data warehousing :

1) Dependent Data Mart : A dependent data mart can be developed using data from operational, external, or both sources. It enables the data of the source company to be accessed from a single data warehouse. All data is centralized, which can aid in the development of further data marts.

2) Independent Data Mart : There is no need for a central data warehouse with this data mart. This is typically established for smaller groups that exist within a company. It has no connection to Enterprise Data Warehouse or any other data warehouse. Each piece of information is self-contained and can be used independently. The analysis can also be carried out independently. It's critical to maintain a consistent and centralized data repository that numerous users can access.

3) Hybrid Data Mart : A hybrid data mart is utilized when a data warehouse contains inputs from multiple sources, as the name implies. When a user requires an ad hoc integration, this feature comes in handy. This solution can be utilized if an organization requires various database environments and quick implementation. It necessitates the least amount of data purification, and the data mart may accommodate huge storage structures. When smaller data-centric applications are employed, a data mart is most effective.

5. Data Warehouse Question

Difference between Fact Table and Dimension Table

Problem approach

1) Fact table contains the measuring on the attributes of a dimension table.
Dimension table contains the attributes on that truth table calculates the metric.

2) In fact table, There is less attributes than dimension table.
While in dimension table, There is more attributes than fact table.

3) In fact table, There is more records than dimension table.
While in dimension table, There is less records than fact table.

4) Fact table forms a vertical table.
While dimension table forms a horizontal table.

5) The attribute format of fact table is in numerical format and text format.
While the attribute format of dimension table is in text format.

6) The number of fact table is less than dimension table in a schema.
While the number of dimension is more than fact table in a schema.

6. Big Data Question

How to deploy a Big Data Model? Mention the key steps involved

Problem approach

Deploying a model into a Big Data Platform involves mainly three key steps they are,

1) Data Ingestion: This process involves collecting data from different sources like social media platforms, business applications, log files, etc.

2) Data Storage: When data extraction is completed, the challenge is to store this large volume of data in the database in which the Hadoop Distributed File system (HDFS) plays a vital role.

3) Data Processing: After storing the data in HDFS or HBase, the next task is to analyze and visualize these large amounts of data using specific algorithms for better data processing. And yet again, this task is more straightforward if we use Hadoop, Apache Spark, Pig, etc.

After performing these essential steps, one can deploy a big data model successfully.

7. Big Data Question

Explain overfitting in big data? How to avoid the same.

Problem approach

Overfitting is generally a modeling error referring to a model that is tightly fitted to the data, i.e. When a modeling function is closely fitted to a limited data set. Due to Overfitting, the predictivity of such models gets reduced. This effect leads to a decrease in generalization ability failing to generalize when applied outside the sample data.

There are several Methods to avoid Overfitting; some of them are :

1) Cross-validation : A cross-validation method refers to dividing the data into multiple small test data sets, which can be used to tune the model.

2) Early stopping : After a certain number of iterations, the generalizing capacity of the model weakens; in order to avoid that, a method called early stopping is used in order to avoid Overfitting before the model crosses that point.

3) Regularization : this method is used to penalize all the parameters except intercept so that the model generalizes the data instead of Overfitting.

Round

Medium

Video Call

Duration60 Minutes

Interview date9 Apr 2021

Coding problem8

This round was primarily focused towards some core concepts in Database Management System and Hadoop. Towards the end of the interview, I was also asked some questions around ETL (Extract, Transform, and Load) and Data Warehouse.

1. DBMS Question

Difference b/w OLAP and OLTP

Problem approach

OLAP stands for Online analytical processing and OLTP stands for Online transaction processing. The major
differences b/w OLTP and OLAP are :

1) OLAP : Consists of historical data from various Databases.
OLTP : Consists only operational current data.

2) OLAP : It is subject oriented. Used for Data Mining, Analytics, Decision making,etc.
OLTP : It is application oriented. Used for business tasks.

3) OLAP : It reveals a snapshot of present business tasks.
OLTP : It provides a multi-dimensional view of different business tasks.

4) OLAP : Relatively slow as the amount of data involved is large. Queries may take hours.
OLTP : Very Fast as the queries operate on 5% of the data.

5) OLAP : It only need backup from time to time as compared to OLTP.
OLTP : Backup and recovery process is maintained religiously

6) OLAP : Large amount of data is stored typically in TB, PB
OLTP : The size of the data is relatively small as the historical data is archived. For ex MB, GB

2. DBMS Question

How to delete Duplicates From a Table in SQL Server?

Problem approach

To delete the duplicate rows from the table in SQL Server, we should follow these steps:

1) Find duplicate rows using GROUP BY clause or ROW_NUMBER() function.
2) Use DELETE statement to remove the duplicate rows.

3. DBMS Question

Difference b/w INNER JOIN and OUTER JOIN in SQL.

Problem approach

1) INNER JOIN : It returns the combined tuple between two or more tables.
OUTER JOIN : It returns the combined tuple from a specified table even join condition will fail.

2) INNER JOIN : Used clause INNER JOIN and JOIN.
OUTER JOIN : Used clause LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, etc.

3) INNER JOIN : When any attributes are not common then it will return nothing.
OUTER JOIN : It does not depend upon the common attributes. If the attribute is blank then here already placed
NULL.

4) INNER JOIN : If tuples are more. Then INNER JOIN works faster than OUTER JOIN.
OUTER JOIN : Generally, The OUTER JOIN is slower than INNER JOIN. But except for some special cases.

5) INNER JOIN : It is used when we want detailed information about any specific attribute.
OUTER JOIN : It is used when we want to complete information.

6) INNER JOIN : JOIN and INNER JOIN both clauses work the same.
OUTER JOIN : FULL OUTER JOIN and FULL JOIN both clauses work the same.

4. Hadoop Question

Explain Hadoop. List the core components of Hadoop

Problem approach

Hadoop is a famous big data tool utilized by many companies globally. Few successful Hadoop users:

Uber
The Bank of Scotland
Netflix
The National Security Agency (NSA) of the United States
Twitter

There are three components of Hadoop are:
1) Hadoop YARN - It is a resource management unit of Hadoop.
2) Hadoop Distributed File System (HDFS) - It is the storage unit of Hadoop.
3) Hadoop MapReduce - It is the processing unit of Hadoop.

5. Hadoop Question

Explain the Storage Unit In Hadoop (HDFS).

Problem approach

HDFS is the Hadoop Distributed File System, is the storage layer for Hadoop. The files in HDFS are split into block-size parts called data blocks. These blocks are saved on the slave nodes in the cluster. By default, the size of the block is 128 MB by default, which can be configured as per our necessities. It follows the master-slave architecture. It contains two daemons- DataNodes and NameNode.

NameNode :
The NameNode is the master daemon that operates on the master node. It saves the filesystem metadata, that is, files names, data about blocks of a file, blocks locations, permissions, etc. It manages the Datanodes.

DataNode :
The DataNodes are the slave daemon that operates on the slave nodes. It saves the actual business data. It serves the client read/write requests based on the NameNode instructions. It stores the blocks of the files, and NameNode stores the metadata like block locations, permission, etc.

6. ETL (Extract, Transform, and Load) Question

State difference between ETL and OLAP (Online Analytical Processing) tools.

Problem approach

ETL tools: The data is extracted, transformed, and loaded into the data warehouse or data mart using ETL tools. Several transformations are necessary before data is loaded into the target table in order to implement business logic. Example: Data stage, Informatica, etc.

OLAP (Online Analytical Processing) tools: OLAP tools are designed to create reports from data warehouses and data marts for business analysis. It loads data from the target tables into the OLAP repository and performs the required modifications to create a report. Example: Business Objects, Cognos etc.

7. ETL (Extract, Transform, and Load) Question

State difference between ETL testing and manual testing.

Problem approach

ETL Testing :

1) The test is an automated process, which means that no special technical knowledge is needed aside from understanding the software.

2) It is extremely fast and systematic, and it delivers excellent results.

3) Databases and their counts are central to ETL testing.

4) Metadata is included and can easily be altered.

5) It is concerned with error handling, log summary, and load progress, which eases the developer's and maintainer's workload.

Manual Testing :

1) It requires technical expertise in SQL and Shell scripting since it is a manual process.

2) In addition to being time-consuming, it is highly prone to errors.

3) Manual testing focuses on the program's functionality.

4) It lacks metadata, and changes require more effort.

5) From a maintenance perspective, it requires maximum effort.

8. ETL (Extract, Transform, and Load) Question

What do you know about the Bus Schema in ETL.

Problem approach

1) An important part of ETL is dimension identification, and this is largely done by the Bus Schema.

2) A BUS schema is actually comprised of a suite of verified dimensions and uniform definitions and can be used for handling dimension identification across all businesses.

3) To put it another way, the bus schema identifies the common dimensions and facts across all the data marts of an organization just like identifying conforming dimensions (dimensions with the same information/meaning when being referred to different fact tables).

4) Using the Bus schema, information is given in a standard format with precise dimensions in ETL.

Round

Easy

HR Round

Duration30 Minutes

Interview date9 Apr 2021

Coding problem2

This was a typical HR round with some standard Behavioral questions.

1. Basic HR Question

Tell me something about yourself?

Problem approach

Tip 1 : Prepare the points that you will speak in your introduction prior to the interview.
Tip 2 : Tell about your current cgpa, achievements and authenticated certification
Tip 3 : I told about my role in current internship and what all I do

2. Basic HR Question

Why are you looking for a job change?

Problem approach

Tip : For an experienced professional seeking a change, this is a common question. The easiest method to respond
to this question is to state that you are leaving your current work in order to advance your career. Make sure you don't
criticize or speak poorly about the company where you now work.