Algorithm
The Chandy-Lamport algorithm works as discussed in steps below:
1. Initiator process:
- One process acts as the initiator and decides to take a snapshot of the system.
- The initiator process records its own local state and sends a "marker" message to all other processes.
2. Marker messages:
- When a process receives a marker message for the first time, it records its own local state and marks the state of the incoming channel from the sender as empty.
- The process then sends marker messages to all other processes, except the one from which it received the marker.
3. Recording channel states
- After a process has recorded its local state and sent out marker messages, it starts recording the messages received on each incoming channel until it receives a marker message on that channel.
- The process stores these messages as part of the channel state for the snapshot.
4. Termination:
- The snapshot is complete when all processes have recorded their local states and the states of all incoming channels.
- The initiator process can collect the local states and channel states from all processes to construct the global snapshot.
Let’s look at a code implementation of this algorithm in python to understand how it works :
Python
from collections import defaultdict
import threading
class Process:
def __init__(self, pid):
self.pid = pid
self.local_state = None
self.channel_states = defaultdict(list)
self.markers_received = set()
self.snapshot = None
def send_message(self, dst, message):
dst.receive_message(self.pid, message)
def receive_message(self, src, message):
if self.snapshot is None:
self.channel_states[src].append(message)
else:
self.snapshot.channel_states[src].append(message)
def take_snapshot(self):
self.local_state = f"Local state of process {self.pid}"
self.snapshot = Snapshot(self.pid, self.local_state)
self.markers_received.add(self.pid)
for process in processes:
if process.pid != self.pid:
self.send_message(process, "marker")
while len(self.markers_received) < len(processes):
pass
self.snapshot.channel_states = self.channel_states.copy()
self.snapshot.markers_received = self.markers_received.copy()
return self.snapshot
class Snapshot:
def __init__(self, pid, local_state):
self.pid = pid
self.local_state = local_state
self.channel_states = {}
self.markers_received = set()
def chandy_lamport_snapshot(initiator_process):
snapshot = initiator_process.take_snapshot()
return snapshot
# Example usage
if __name__ == "__main__":
# Create processes
process1 = Process(1)
process2 = Process(2)
process3 = Process(3)
processes = [process1, process2, process3]
# Send some messages between processes
process1.send_message(process2, "Hello from process 1 to process 2")
process2.send_message(process3, "Hello from process 2 to process 3")
process3.send_message(process1, "Hello from process 3 to process 1")
# Initiate snapshot from process 1
snapshot = chandy_lamport_snapshot(process1)
# Print snapshot results
print(f"Snapshot initiated by process {snapshot.pid}")
print(f"Local state: {snapshot.local_state}")
print("Channel states:")
for src, messages in snapshot.channel_states.items():
print(f" From process {src}: {messages}")

You can also try this code with Online Python Compiler
Run Code
In this implementation, we define two classes: `Process` and `Snapshot`. The `Process` class represents a process in the distributed system, and it has methods for sending and receiving messages, as well as taking a snapshot. The `Snapshot` class represents a snapshot object that holds the local state and channel states of a process.
The `chandy_lamport_snapshot` function is the entry point for initiating a snapshot. It takes the initiator process as an argument and returns the snapshot object.
In the example usage section, we create three processes and send some messages between them. Then, we initiate a snapshot from process 1 using the `chandy_lamport_snapshot` function. Finally, we print the snapshot results, including the initiator process ID, its local state, and the channel states.
When you run this code, you will get below mentioned output:
Snapshot initiated by process 1
Local state: Local state of process 1
Channel states:
From process 3: ['Hello from process 3 to process 1']
This output shows the snapshot initiated by process 1, including its local state and the channel states captured during the snapshot process.
Need to take snapshots or recording the global state of the system
Taking a snapshot or recording the global state of a distributed system is essential for many reasons :
1. Debugging and error detection
- Snapshots allow developers to capture the state of the system at a particular point in time.
- By analyzing the captured state, developers can identify and diagnose issues, bugs, or inconsistencies in the system.
- Snapshots provide a valuable tool for understanding the behavior of the distributed system and troubleshooting problems.
2. Recovery and fault tolerance
- In case of failures or crashes, snapshots can be used to restore the system to a consistent state.
- By periodically taking snapshots, the system can recover from failures by rolling back to a previous consistent state.
- Snapshots enable the implementation of checkpoint and recovery mechanisms, enhancing the fault tolerance of the distributed system.
3. Monitoring and analysis
- Snapshots provide a way to monitor and analyze the state of the distributed system over time.
- By comparing snapshots taken at different points in time, administrators can observe changes, detect anomalies, and gain insights into the system's behavior.
- Snapshots can be used for performance analysis, resource utilization monitoring, and capacity planning.
4. Consistent global state
- In a distributed system, each process has its own local state, and the global state is the combination of all local states.
- Taking a snapshot ensures that the captured global state is consistent, meaning that it represents a valid state of the system at a particular point in time.
- Consistency is crucial for reasoning about the system's behavior and making accurate decisions based on the captured state.
5. Distributed algorithms and protocols
- Many distributed algorithms and protocols rely on the ability to capture a consistent global state.
- Snapshots are used in algorithms such as distributed deadlock detection, global predicate detection, and distributed garbage collection.
- These algorithms require a consistent view of the system's state to perform their tasks correctly.
Note: With the use of snapshots, distributed systems can achieve better reliability, fault tolerance, and observability. The Chandy-Lamport algorithm provides an efficient & reliable way to capture a consistent global state in a distributed environment without halting the entire system.
Frequently Asked Questions
Can the Chandy-Lamport algorithm handle network failures or message losses?
No, the algorithm assumes reliable communication channels and does not handle network failures or message losses during the snapshot process.
Does the Chandy-Lamport algorithm require a global clock or synchronized clocks?
No, the algorithm does not require a global clock or synchronized clocks. It relies on the causal ordering of events and the FIFO property of communication channels.
How does the Chandy-Lamport algorithm ensure a consistent global state?
By using marker messages and recording channel states, the algorithm captures a consistent global state where the local states of processes and the channel states are coordinated.
Conclusion
In this article, we have discussed the Chandy-Lamport algorithm, also known as the snapshot algorithm. We explained the assumptions made by the algorithm, its working principles, and the need for taking snapshots in distributed systems. The Chandy-Lamport algorithm provides a way to capture a consistent global state across multiple processes or nodes without halting the entire system. By recording local states and channel states, the algorithm enables debugging, error detection, recovery, and monitoring of distributed systems.
You can also check out our other blogs on Code360.