Table of contents
1.
Introduction
2.
HDFS
3.
Transparent Encryption
4.
Transparent Encryption in HDFS
5.
Architecture 
5.1.
Encryption zone
5.2.
Nested Encryption Zones
5.3.
Key Management Server (KMS)
6.
Steps for Transparent Encryption in HDFS 
6.1.
Step 1: Creating a new encryption key
6.2.
Step 2: Cresting encryption zone
6.3.
Step 3: Assigning encryption key to the encryption zone
6.4.
Step 4: Setting the ownership
6.5.
Step 5: Storing the file in the encryption zone and reading it.
6.6.
Step 6: Retrieving the encryption information
7.
Frequently Asked Questions
7.1.
What is Transparent Encryption in HDFS?
7.2.
What is HDFS?
7.3.
What is an encryption zone?
7.4.
What is Key Management Server (KMS)?
8.
Conclusion
Last Updated: Mar 27, 2024

Transparent Encryption in HDFS

Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Privacy is a very important factor for any organization to ensure its information is safe from malicious activities. So companies aim to protect their sensitive data without any disruptions in their applications. To achieve this, transparent encryption is used. 

Transparent Encryption in HDFS

In this article, we will discuss ‘Transparent Encryption in HDFS’. We will also discuss in brief about HDFS and transparent encryption. Moving forward, we will look into key concepts and architecture of transparent encryption.

HDFS

HDFS stands for Hadoop Distributed File System. It is a file system that stores and manages huge amounts of data across a combination of computers. HDFS plays an important role in the Apache Hadoop framework for processing big data. 

Using the Hadoop Distributed file system, we can efficiently store and process large datasets and perform parallel processing, resulting in maximum utilization of the storage system. Therefore it provides multiple benefits such as data reliabilityfault tolerance, data replication, high throughput, Hadoop integration, scalability, etc.

Apart from these amazing advantages, Hadoop also provides us with the functionality of Transparent data encryption, something that is in high demand. 

Recommended read-Hadoop Distributed File System(HDFS) 

Transparent Encryption

‘Transparent Encryption,’ as the name suggests, refers to achieving data encryption and decryption by without disrupting the application and user experienceIt is generally used to improve security by ensuring authorization and minimizing complexity. This means that only authorized users can access certain information.

Let's understand Transparent encryption through an example; let's say you are working on a web application project where you need to handle sensitive data of users. Therefore, your team decided to use transparent encryption and introduce a functionality where the data of the user is automatically encrypted before storing it in the database and decrypted when we access the data.

Encryption →Converting plain text into ciphertext, i.e., encrypted data.

Decryption→Converting cipher text into plain text.

Transparent Encryption in HDFS

Transparent Encryption in HDFS is used for protecting the sensitive information stored in Hadoop clusters. It refers to the encryption and description mechanism for data in HDFS without any manual intervention

Transparent encryption ins HDFS is also known as client-side encryption. This means that encryption and decryption occur on the client side.
 

Transparent Encryption fulfills the two main encryption requirements:

  • at-rest encryption→Encrypting data on persistent media (i.e., data remains the same after it has been written)
     
  • in-transit encryption→ data moving over the network.
     

There are four main layers of traditional data encryption:

  • Application Level:
    This is the safest and most flexible layer of encryption. It has control over the encrypted data and can display the needs of the user.
     
  • Database-level:
    Database-level encryption is similar to application-level encryption, but there is a performance trade-off due to the additional processing involved at this level of encryption.
     
  • File-system-level:
    This level of encryption provides transparency and efficient performance. It has an easy deployment process, but the application-level policies are not used.
     
  • Disk-level encryption:
    Disk-level encryption protects the application only from physical theft and is quite inflexible but is easy to deploy.
     

Transparent encryption comes between file-system level encryption and database-level encryption and provides advantages such as:

  • Efficient performance,
     
  • Prevention from malicious attacks,
     
  • Securing data in both rest and in transit.

Architecture 

Let's discuss the key concepts and architecture of transparent encryption in HDFS.

key concepts and architecture of transparent encryption

Encryption zone

For performing transparent encryption in HDFS, a new concept is introduced known as an encryption zone. An encryption zone refers to a directory where the information is encrypted transparently while writing data and decrypted for reading the data.

There may be many encryption zones, and when the encryption zones are created, each one is associated with an encryption zone key.

Keys are the cryptography elements that protect the data and enable encryption and decryption. These can be referred to as confidential codes for securing the data.

The file in the encryption zone contains a unique DEK (data encryption key). HDFS only manages EDEK (encrypted data encryption key) and not DEK. Therefore, when the client has to read and write the data, it decrypts an EDEK, and then the corresponding DEK is used.

Nested Encryption Zones

A nested encryption zone refers to creating encryption zones within other zones. You can think of it as boxes within boxes. Below are some of the major points regarding nested encryption zones.

  • The outermost encryption zone has a higher level of security that further consists of other zones. Nested encryption provides many advantages, such as ensuring encryption and providing flexibility with different keys.
     
  • Encryption of the outermost layer ensures that all the zones within it are also well encrypted and guarantees that everything is safe.
     
  • It also provides the flexibility of creating different keys for the nested zones. You can provide a unique key to each nested zone in a file system. Let's understand this through an example. 
     
  • For example,
     

Let's say you have a File system named ‘NinjaDocs’, i.e., the oyster encryption zone. Within this, there are further nested encryption zones named ‘NinjaData’ with key A and ‘NinjaReport’ with key B. Therefore, you can provide different keys to both of them. 

→NinjaDocs: Root directory 
    →NinjaData: Nested Zone: Key A 
    →NinjaReport: Nested Zone: Key B

Key Management Server (KMS)

The key management server in Hadoop provides access to encryption keys in HDFS. It acts as a bridge between HDFS clients, the Hadoop cluster, and the external keystore. Hadoop KMS contains two main components:

  • Key server:  The key server mages and stores the encryption keys and acts as a central component.
     
  • Client API: It consists of methods and functions for requesting, retrieving, or storing the EDEKs (encrypted data keys).
Key Management Server (KMS)

KMS provides the advantage of securely storing and managing encryption keys by integrating with external key storage systems. Below are some of the major functionality of KMS.   

  • Gives access to the stored encryption zone keys.
     
  • Acts as a central key management system.
     
  • Upon creating an encryption zone, a unique encryption key is generated for the particular zone by KMS.
     
  • Generating new EDEKs (encrypted data encryption keys) for storing them on NameNode.
     
  • KMS retrieves the EDEK→ decrypts it →and provides the decrypted key to the HDFS client.

Steps for Transparent Encryption in HDFS 

Let's discuss how to create transparent encryption in HDFS using the CLI commands‘hdfs crypto and ‘hadoop key’. These two are the new CLI commands. Below are the general steps that can be followed for setting up encryption in HDFS.

Step 1: Creating a new encryption key

The command below creates an encryption key named ‘NinjaKey’.

hadoop key create NinjaKey

Step 2: Cresting encryption zone

The command below creates a directory named ‘NinjaZone’ in HDFS.

hadoop fs -mkdir /NinjaZone

Step 3: Assigning encryption key to the encryption zone

The key ‘NinjaKey’ is associated with the encryption zone at the path ‘/NinjaZone’.

hdfs crypto -createZone -keyName mykey -path /NinjaZone

Step 4: Setting the ownership

The command below transfers the ownership to the ‘/NinjaUser’ of the ‘/NinjaZone’ directory.

Hadoop fs -chown NinjaUser:NinjaUser /zone`

Step 5: Storing the file in the encryption zone and reading it.

The command below is used for uploading the file (helloNinja) to the specified directory (NinjaZone).

hadoop fs -put helloNinja /NinjaZone 


The command below is used for displaying the content of the specified file (helloNinja) in the particular directory.

hadoop fs -cat /NinjaZone/helloNinja

Step 6: Retrieving the encryption information

The command below is used for accessing encryption information about the specified file (helloNinja)  in the particular directory (NinjaZone).

hdfs crypto -getEncryptionInfo -path /NinjaZone/helloNinja

Frequently Asked Questions

What is Transparent Encryption in HDFS?

Transparent Encryption in HDFS is used for protecting the sensitive information stored in Hadoop clusters. It refers to the encryption and description mechanism for data in HDFS without any manual intervention. Transparent encryption ins HDFS is also known as client-side encryption. This means that only the client can encrypt and decrypt the data.

What is HDFS?

HDFS stands for Hadoop Distributed File System. It is a file system that stores and manages huge amounts of data across a combination of computers. HDFS plays an important role in the storage of data in the Apache Hadoop framework for processing big data. Using the Hadoop Distributed file system, we can scale our datasets and perform parallel processing resulting utilization of the storage system. 

What is an encryption zone?

For performing transparent encryption in HDFS, a new abstraction is introduced known as an encryption zone. An encryption zone refers to a directory where the information is encrypted transparently while writing data and decrypted for reading the data.

What is Key Management Server (KMS)?

The key management server in Hadoop provides access to encryption keys in HDFS. It acts as a bridge between HDFS clients, the Hadoop cluster, and the external keystore. KMS provides the advantage of securely storing and managing encryption keys by integrating with external key storage systems.

Conclusion

In this article, we have learned about transparent Encryption in HDFS. We have also discussed in brief about HDFS and transparent encryption. Moving forward, have also looked into key concepts and architecture of transparent encryption and how to access files within an encryption zone. To learn more about Hadoop, you can refer to the articles below and take your preparation journey to the next level.
 

Recommended read-

You can read more such descriptive articles on our platform, Coding Ninjas Studio. You will find straightforward explanations of almost every topic on the platform. So take your coding journey to the next level using Coding Ninjas.

Happy coding!

Live masterclass