Tip 1 : Focus on Data Structures
Tip 2 : Focus on SQL, Coding
Tip 3 : Focus on System Design
Tip 1: Coding skills should on top for the interview
Tip 2: System Design should have skills on the design the solution
Round 1: Preliminary Round (Screening Round): Telephonic Round
This round consists of a detailed explanation of my previous project what I have worked on mix panel, Kafka, ETL concepts, Datahub Spark Lineage, Spark, and Data Model that I prepared during Experimentation (A/B testing) & on presto architecture. This round goes well for 45 minutes (telephonic round). They asked why they want to work for Walmart.


The width of each bar is the same and is equal to 1.
Input: ‘n’ = 6, ‘arr’ = [3, 0, 0, 2, 0, 4].
Output: 10
Explanation: Refer to the image for better comprehension:

You don't need to print anything. It has already been taken care of. Just implement the given function.
Approach 1 (Brute Approach): This approach is the brute approach. The idea is to:
Traverse every array element and find the highest bars on the left and right sides. Take the smaller of two heights. The difference between the smaller height and the height of the current element is the amount of water that can be stored in this array element.
I got a call from HR that my screening round is cleared and I was shortlisted for technical discussion. This round lasted for about 1 hour and 30 minutes and was taken by the Senior Data Engineer at Walmart.
This interview basically focused on Medium Level Data Structure and algorithm Questions with Hard LevelSQL-based questions, python coding questions, Big Data concepts, Spark, Kubernetes, Airflow Architectural Based Questions, Cloud Computing concepts, SDLC, Agile methodology (based on Scrum framework at a high level),
Some questions were asked on concepts of DevOps Strategy (basic level), CI/CD pipeline, NoSQL databases, AWS services-based scenario questions, and medium data structure-based questions (Array & Stack, Linkedin List with Tree).
There are two DSA Questions being asked I remember some of them
SQL Question Interview
The Employee table holds all employees including their managers. Every employee has an Id, and there is also a column for the manager Id. Write a SQL query that finds out managers with at least 5 direct report. For the above table, your SQL query should return:
How to find the nth highest salary for each department using Window Function or without using window function.
3. Given an employee table with attributes are empId, empSalary, and empDeptId and department table with attributes deptId, depName, and CourseOffered. I was asked to write an SQL query to find the employee which has the highest salary in each department using windows functions on the notepad. I used the dense_rank window function for constructing SQL queries. I was asked to explain the reason for using dense_rank instead rank function.
Some questions on Spark Optimisation with Hadoop Concepts such as
i) How Airflow Kubernetes works using Pod Concepts
ii) How the Airflow scheduler works with Worker machine with the webserver
iii) Difference between Container Deployment vs. Stateful Deployment in K8s. Explain how Kubernetes manages the fault tolerance
iv) You have a Spark job that is taking longer than expected to complete. What steps would you take to identify and troubleshoot performance bottlenecks?
v) You have a Spark cluster with limited resources. How would you allocate resources and configure the cluster for optimal performance?
vi) He asked me to write code for uploading Parquet files on the S3 bucket using the boto3 library (as I worked on AWS). I wrote the code for the same using Python and boto3 library on a notepad.
vii) How Airlfow Stored logs in the S3 bucket & how the backend database of airflow plays an essential role
There are other questions on Spark optimization, Kubernetes, Airflow, and Big Data Concepts along with Project-based explanation.
Given an employee table with attributes are empId, empSalary, and empDeptId and department table with attributes deptId, depName, and CourseOffered. I was asked to write an SQL query to find the employee which has the highest salary in each department using windows functions on the notepad. I used the dense_rank window function for constructing SQL queries.
I was asked to explain the reason for using dense_rank instead rank function.
Using the DENSE_RANK window function instead of RANK in this scenario is a good choice when you want to handle cases where multiple employees within the same department have the same salary. The DENSE_RANK function assigns the same rank to identical salary values and then continues with the next consecutive rank, without leaving gaps.
WITH RankedEmployees AS (
SELECT
empId,
empSalary,
empDeptId,
DENSE_RANK() OVER (PARTITION BY empDeptId ORDER BY empSalary DESC) AS salaryRank
FROM
Employee
)
SELECT
empId,
empSalary,
empDeptId
FROM
RankedEmployees
WHERE
salaryRank = 1;



This problem is a variation of the problem discussed Coin Change Problem. Here instead of finding the total number of possible solutions, we need to find the solution with the minimum number of coins.
The minimum number of coins for a value V can be computed using the below recursive formula.
If V == 0:
0 coins required
If V > 0:
minCoins(coins[0..m-1], V ) = min { 1 + minCoins(V-coin[i])} where, 0 <=i <= m-1 and coins[i] <= V.
Given a linked list and a value x, partition a linked list around a value x, such that all nodes less than x come before all nodes greater than or equal to x. If x is contained within the list the values of x only need to be after the elements less than x (see below). The partition element x can appear anywhere in the “right partition”; it does not need to appear between the left and right partitions.
(Learn)
I got a call from HR that my first round was cleared and I was shortlisted for technical discussion. This round lasted for about 1 hour 45 minutes and was taken by the Staff Data Engineer at Walmart.
The interview starts with the System Design. I was asked to design the Mixpanel system (which is event driven system) because I used Mixpanel in the Meesho. So I opened the draw.io & started making how the Mixpanel works. How the events are captured by different systems such as Android App, Web App & IOS App.
In the System Design, some questions are being asked.
i) How the load balancer works in the Mixpanel.
ii) How the requests are being handled. Let's suppose you open the Presto URL on Chrome, then this request goes to the DNS for IP address resolution then goes to the load balancer then the target gateway then finally to the Presto Coordinator. So I was asked to explain each concept
iii) He asked me to write a custom API using the spring-boot by writing only the service & controller classes using Springboot & Java API
iv) Some questions on Spark Coding. he asked me to write code to read data from delta lake (S3 bucket) & run the upsert command to update the data if the data already exists based on the primary key & insert the data if the data does not exists. I wrote the code using the DataFrame
v) Questions on Spark Optimisation such as Skewed Join, Broadcast Join, CBO & repartion vs coalesce
vi) Questions on Spark Tungsten & Catalyst Optimiser
vii) Now questions start on Java & Advanced Java
Questions on Java based on Java collection such as Interface, Map, LinkedList design & Garbage collection
Java Coding Question & OOPS Concepts
i) He asked me to write the java code to run the garbage collection using GC collector thread
ii) He asked me to explain the concept of multithreading. Then he asked me to write code for Synchronisation using Synchronised Thread
iii) Some Questions on Serialisation vs Deserialisation.(Learn)
iv) Explain the use case of the transient keyword in java.
Questions on System Design Conceptual & Synchronisation
i) What is the Semaphore variable? How do you prevent deadlock in the system. (Learn)
ii) He asked me to complete the Semaphore code for the synchronisation achievement. So I wrote Semphore in Java
import java.util.*;
class Semaphore_Interview_Round_Tedchnical {
public enum Value { Zero, One }
public Queue q = new LinkedList();
public Value value = Value.One;
public void P(Semaphore s, Process p)
{
if (s.value == Value.One) {
s.value = Value.Zero;
}
else {
q.add(p);
p.Sleep();
}
}
public void V(Semaphore s)
{
if (s.q.size() == 0) {
s.value = Value.One;
}
else {
Process p = q.peek();
q.remove();
p.Wakeup();
}
}
}
This is the code I submitted
The last Questions they asked on ETL concepts & data warehouse concepts which are general questions
i) What is the difference between Snowflake vs Star Schema
ii) How you design the data warehouse from scratch if you have new requirements. So that time I explained, the Snowflake & Databricks that I set in Morgan Stanely from the beginning
iii) Normalisation Concepts & What is SCD-2 Type with Example
iv) I was asked a question on Presto. How to onboard delta lake catalog to Presto.
v) He asked me about Agile. I explained Agile with an agile framework (Scrum) by taking concepts of a sprint, Jira Board, and iterative approach in detail. Why Agile is preferred over the waterfall model.
How the requests are being handled. Let's suppose you open the Presto URL on Chrome, then this request goes to the DNS for IP address resolution then goes to the load balancer then the target gateway then finally to the Presto Coordinator. So I was asked to explain each concept
DNS Resolution:
When you type the Presto URL in Chrome, the browser needs to resolve the domain name to an IP address.
The Domain Name System (DNS) is responsible for this resolution process. It translates human-readable domain names like "presto.example.com" into IP addresses like "192.0.2.1".
Your browser sends a DNS query to a DNS resolver, which may be provided by your ISP or a public DNS service like Google DNS or Cloudflare DNS.
The DNS resolver looks up the IP address associated with the domain name and returns it to the browser.
Load Balancer:
Once the browser has the IP address of the Presto server, it sends an HTTP request to that IP address.
In many modern web applications, especially those with high traffic or multiple server instances, there is often a load balancer in front of the servers.
The load balancer distributes incoming requests across multiple servers to ensure efficient resource utilization and improve reliability and scalability.
The load balancer forwards the request to one of the available Presto Coordinator nodes.
Target Gateway:
The request is received by the Presto Coordinator, which acts as the entry point for queries into the Presto cluster.
The Coordinator is responsible for parsing SQL queries, planning query execution, and coordinating with other nodes in the cluster to execute the query.
The Coordinator also maintains metadata about the cluster, including information about available worker nodes and data distribution.
Presto Coordinator:
The Presto Coordinator processes the incoming query request.
It parses the SQL query, optimizes it, and generates a query plan.
The query plan may involve accessing data stored in various data sources, such as HDFS, S3, or a relational database.
The Coordinator coordinates the execution of the query across multiple Presto worker nodes.
It distributes tasks to the worker nodes and aggregates the results before returning them to the client.
In summary, when you open the Presto URL in Chrome, the request undergoes DNS resolution to find the IP address of the Presto server. The request then passes through a load balancer, which forwards it to one of the Presto Coordinator nodes. The Coordinator processes the query, plans its execution, and coordinates with worker nodes to execute the query in a distributed manner. Finally, the results are aggregated and returned to the client.
Round 4: Techno-Managerial Interview (Managerial Round): 1 hour 10 minutes
The interview started with my introduction, my expertise, tech skillset that I had worked on. Most of the questions were asked based on Data Modeling, Databricks, Datahub, PySpark and architecture Design (ETL Design).
One- two questions were asked based on batch processing & stream processing using Spark.
He asked me to explain my project on Mixpanel and how you create the data model on delta tables so that instead of lot of raw tables get created. I explained our complete platform that I worked at Meesho such as how the data source comes. I also explained the complete data pipeline I set on data bricks to take silver or mix panel data & run the multi-tasking job to create an aggregated table based on business requirement
He asked me what the open source projects worked on then I explained on Datahub, and Spark Lineage build (which helps to find the source & destination table for the Spark application). For that I explained how I create Spark jar with Spark listener & spline package.
Question on Cost Optimisation:
✅Can you share an example of a project you worked on that had a significant impact on your organization?
✅ How did you contribute to cost optimization initiatives while working with cloud technologies?
✅ Could you describe a specific cost optimization strategy you implemented in the cloud and its results?
I was asked how you capture the event logs or what is happening on the data bricks and what user activities such as who is creating the cluster & who is running the jobs. So for that I explained the open source project I used Overwatch (Databricks Open Source Job)
Questions asked based on Spark monitoring & Spark performance management. I explained all the answers in deep dive by taking practical examples.
Some questions on JIRA or Scrum Different Calls. How you will manage multiple tasks using Agile methodology.
Question on Cost Optimisation:
✅Can you share an example of a project you worked on that had a significant impact on your organization?
✅ How did you contribute to cost optimization initiatives while working with cloud technologies?
✅ Could you describe a specific cost optimization strategy you implemented in the cloud and its results?
Round 5: Director Round (Behavioral & Technical Round)
This interview was taken by the Director of Walmart. This round lasted for about 45–60 minutes. I was asked to introduce myself. Then, there is a discussion on the Meesho & Morgan Stanley projects (I explained the Datahub Spark Lineage Project Tenant Project) with that I had worked on, my Data engineer experience at Meesho roles & responsibilities in the project. I was also asked to explain my research papers on Web Crawler for Ranking of Websites Based on Web Traffic and Page Views that I published in International Conferences of IEEE & Springer. Some of the questions were related to the core principles or core values of Walmart and my inspirations. Then he asked questions related to team management & leadership qualities. I was mainly asked questions that were situation-based such as ” Tell me about a time when you faced a challenging situation at work and how you handled it.”. Then, he jumped into my resume and asked me some technical questions related to Presto vs Spark working as both are using Distributed architecture, Databricks, AWS & Delta Lakes concepts with Data Governance. Some questions that I remembered are:
i) What is Avro file format & what is its significance in delta tables?(Learn)
ii) Difference between presto vs. spark underlying architecture
iii) Can Presto work with Near Real-Time Data ( Streaming Data Source)?
iv) How did you develop the Datahub using Open Source Projects such as Spline & Datahub?
v) What do you think about Data uncertainty?
I told him that I am a Gold Medalist of Uttarakhand State of B.Tech. He was very impressed with the answers that I gave during the director's round
Tell me about a time when you faced a challenging situation at work and how you handled it.”. Then, he jumped into my resume and asked me some technical questions related to Presto vs Spark working as both are using Distributed architecture, Databricks

Here's your problem of the day
Solving this problem will increase your chance to get selected in this company
What is recursion?