Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
Privacy is a very important factor for any organization to ensure its information is safe from malicious activities. So companies aim to protect their sensitive data without any disruptions in their applications. To achieve this, transparent encryption is used.
In this article, we will discuss ‘Transparent Encryption in HDFS’. We will also discuss in brief about HDFS and transparent encryption. Moving forward, we will look into key concepts and architecture of transparent encryption.
HDFS
HDFS stands for Hadoop Distributed File System. It is a file system that stores and manages huge amounts of data across a combination of computers. HDFS plays an important role in the Apache Hadoop framework for processing big data.
Using the Hadoop Distributed file system, we can efficiently store and process large datasets and perform parallel processing, resulting in maximum utilization of the storage system. Therefore it provides multiple benefits such as data reliability, fault tolerance, data replication, high throughput, Hadoop integration, scalability, etc.
Apart from these amazing advantages, Hadoop also provides us with the functionality of Transparent data encryption, something that is in high demand.
‘Transparent Encryption,’ as the name suggests, refers to achieving data encryption and decryption by without disrupting the application and user experience. It is generally used to improve security by ensuring authorization and minimizing complexity. This means that only authorized users can access certain information.
Let's understand Transparent encryption through an example; let's say you are working on a web application project where you need to handle sensitive data of users. Therefore, your team decided to use transparent encryption and introduce a functionality where the data of the user is automatically encrypted before storing it in the database and decrypted when we access the data.
Encryption →Converting plain text into ciphertext, i.e., encrypted data.
Decryption→Converting cipher text into plain text.
Transparent Encryption in HDFS
Transparent Encryption in HDFS is used for protecting the sensitive information stored in Hadoop clusters. It refers to the encryption and description mechanism for data in HDFS without any manual intervention.
Transparent encryption ins HDFS is also known as client-side encryption. This means that encryption and decryption occur on the client side.
Transparent Encryption fulfills the two main encryption requirements:
at-rest encryption→Encrypting data on persistent media (i.e., data remains the same after it has been written)
in-transit encryption→ data moving over the network.
There are four main layers of traditional data encryption:
Application Level: This is the safest and most flexible layer of encryption. It has control over the encrypted data and can display the needs of the user.
Database-level: Database-level encryption is similar to application-level encryption, but there is a performance trade-off due to the additional processing involved at this level of encryption.
File-system-level: This level of encryption provides transparency and efficient performance. It has an easy deployment process, but the application-level policies are not used.
Disk-level encryption: Disk-level encryption protects the application only from physical theft and is quite inflexible but is easy to deploy.
Transparent encryption comes between file-system level encryption and database-level encryption and provides advantages such as:
Efficient performance,
Prevention from malicious attacks,
Securing data in both rest and in transit.
Architecture
Let's discuss the key concepts and architecture of transparent encryption in HDFS.
Encryption zone
For performing transparent encryption in HDFS, a new concept is introduced known as an encryption zone. An encryption zone refers to a directory where the information is encrypted transparently while writing data and decrypted for reading the data.
There may be many encryption zones, and when the encryption zones are created, each one is associated with an encryption zone key.
Keys are the cryptography elements that protect the data and enable encryption and decryption. These can be referred to as confidential codes for securing the data.
The file in the encryption zone contains a unique DEK (data encryption key). HDFS only manages EDEK (encrypted data encryption key) and not DEK. Therefore, when the client has to read and write the data, it decrypts an EDEK, and then the corresponding DEK is used.
Nested Encryption Zones
A nested encryption zone refers to creating encryption zones within other zones. You can think of it as boxes within boxes. Below are some of the major points regarding nested encryption zones.
The outermost encryption zone has a higher level of security that further consists of other zones. Nested encryption provides many advantages, such as ensuring encryption and providing flexibility with different keys.
Encryption of the outermost layer ensures that all the zones within it are also well encrypted and guarantees that everything is safe.
It also provides the flexibility of creating different keys for the nested zones. You can provide a unique key to each nested zone in a file system. Let's understand this through an example.
For example,
Let's say you have a File system named ‘NinjaDocs’, i.e., the oyster encryption zone. Within this, there are further nested encryption zones named ‘NinjaData’ with key A and ‘NinjaReport’ with key B. Therefore, you can provide different keys to both of them.
→NinjaDocs: Root directory →NinjaData: Nested Zone: Key A →NinjaReport: Nested Zone: Key B
Key Management Server (KMS)
The key management server in Hadoop provides access to encryption keys in HDFS. It acts as a bridge between HDFS clients, the Hadoop cluster, and the external keystore. Hadoop KMS contains two main components:
Key server: The key server mages and stores the encryption keys and acts as a central component.
Client API: It consists of methods and functions for requesting, retrieving, or storing the EDEKs (encrypted data keys).
KMS provides the advantage of securely storing and managing encryption keys by integrating with external key storage systems. Below are some of the major functionality of KMS.
Gives access to the stored encryption zone keys.
Acts as a central key management system.
Upon creating an encryption zone, a unique encryption key is generated for the particular zone by KMS.
Generating new EDEKs (encrypted data encryption keys) for storing them on NameNode.
KMS retrieves the EDEK→ decrypts it →and provides the decrypted key to the HDFS client.
Steps for Transparent Encryption in HDFS
Let's discuss how to create transparent encryption in HDFS using the CLI commands: ‘hdfs crypto’ and ‘hadoop key’. These two are the new CLI commands. Below are the general steps that can be followed for setting up encryption in HDFS.
Step 1: Creating a new encryption key
The command below creates an encryption key named ‘NinjaKey’.
hadoop key create NinjaKey
Step 2: Cresting encryption zone
The command below creates a directory named ‘NinjaZone’ in HDFS.
hadoop fs -mkdir /NinjaZone
Step 3: Assigning encryption key to the encryption zone
The key‘NinjaKey’ is associated with the encryption zone at the path ‘/NinjaZone’.
Transparent Encryption in HDFS is used for protecting the sensitive information stored in Hadoop clusters. It refers to the encryption and description mechanism for data in HDFS without any manual intervention. Transparent encryption ins HDFS is also known as client-side encryption. This means that only the client can encrypt and decrypt the data.
What is HDFS?
HDFS stands for Hadoop Distributed File System. It is a file system that stores and manages huge amounts of data across a combination of computers. HDFS plays an important role in the storage of data in the Apache Hadoopframework for processing big data. Using the Hadoop Distributed file system, we can scale our datasets and perform parallel processing resulting utilization of the storage system.
What is an encryption zone?
For performing transparent encryption in HDFS, a new abstraction is introduced known as an encryption zone. An encryption zone refers to adirectory where the information is encrypted transparently while writing data and decrypted for reading the data.
What is Key Management Server (KMS)?
The key management server in Hadoop provides access to encryption keys in HDFS. It acts as a bridge between HDFS clients, the Hadoop cluster, and the external keystore. KMS provides the advantage of securely storing and managing encryption keys by integrating with external key storage systems.
Conclusion
In this article, we have learned about transparent Encryption in HDFS. We have also discussed in brief about HDFS and transparent encryption. Moving forward, have also looked into key concepts and architecture of transparent encryption and how to access files within an encryption zone. To learn more about Hadoop, you can refer to the articles below and take your preparation journey to the next level.
You can read more such descriptive articles on our platform, Coding Ninjas Studio. You will find straightforward explanations of almost every topic on the platform. So take your coding journey to the next level using Coding Ninjas.