Table of contents
1.
Introduction
2.
Hive
3.
Hive Partitioning 
4.
Static Partitioning
5.
Dynamic Partitioning in Hive
5.1.
Syntax
6.
How Dynamic Partition Works
6.1.
Step 1: Create a non-partitioned table
6.2.
Code
6.2.1.
Output
6.3.
Explanation
6.4.
Step 2: Load table
6.5.
Code
6.5.1.
Output
6.6.
Explanation
6.7.
Step 3: Create a Partitioned Table
6.8.
Code
6.8.1.
Output
6.9.
Explanation
6.10.
Step 4: Insert the Data
6.11.
Code
6.11.1.
Output
6.12.
Explanation
6.13.
Step 5: Display the table
6.14.
Code
6.14.1.
Output
6.15.
Explanation
7.
Advantages of Dynamic Partition
8.
Disadvantages of Dynamic Partition 
9.
Difference between Static and Dynamic Partition in Hive
10.
Frequently Asked Questions
10.1.
How do you handle data skew when using dynamic Partitioning?
10.2.
Can dynamic Partitioning be used with different file formats in Hive (e.g., Parquet, ORC)?
10.3.
Can dynamic Partitioning be combined with other features in Hive, such as bucketing or sorting?
10.4.
How does dynamic partitioning work in Hive?
10.5.
What are the advantages of using dynamic Partitioning over static Partitioning?
11.
Conclusion
Last Updated: Mar 27, 2024
Easy

Dynamic Partition in Hive

Author Vidhi Sareen
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Hey Ninjas!! Have you ever wondered how to restore a piece of data from a large data set? Partitioning in Hive can be used to do this. For example, you could partition a table of student marks according to the exam date to restore a student's data. Then, in this case, you can run a query instead of scanning the whole table. 

Dynamic Partition in Hive

We can use Dynamic Partitioning to restore or retrieve a specific subset of data from a large dataset. In this article, we will explore how Dynamic Partition in Hive works.

Hive

It is a data warehousing package that is built on top of Hadoop. It is primarily used in data analysis by data analysts. It runs SQL queries called HQL (Hive Query Language). The open-source data warehouse system helps analyze large data sets stored in many files. It is highly scalable. 

work flow

A user submits a query to a Hive. The query gets analyzed by Hive and converts this SQL query into Map Reduce Jobs. A megastore is attached to the Hive and provides data about the data. The Hadoop clusters receive and process the jobs. This Hadoop cluster is a combination of Master and worker nodes.The master node assigns tasks to the worker nodes, and upon job completion, the workers return the results to the system, which delivers the final result to the user. To explore more about Hive, check out the link.

Hive Partitioning 

Partitioning in Hive refers to dividing the table into small parts based on the values of a particular column, like name, date, course, city, etc. Partitioning allows effective data organization, and it improves query performance. Executing a query reduces the amount of scanned data.

example

Hadoop Distributed File System (HDFS) stores massive amounts of data in the petabyte range. It is challenging to achieve any query on this vast amount of data by Hadoop users. Hive narrows down the restraint for searching. We convert this SQL query into MapReduce Jobs and pass them to the Hadoop clusters. As a result, running MapReduce Jobs over such a vast table becomes very ineffective. There are two types of Hive Partitioning:

  • Static Partitioning
     
  • Dynamic Partitioning
     

This article will explore Dynamic Partitioning in Hive and discuss some related points.

Also see, Recursive Relationship in DBMS

Static Partitioning

It specifies each record's partition key and value while inserting data into the partitioned table. Static partitioning is commonly used when dealing with a small number of partitions. In static partitions, the individual files are loaded based on the partition. When partitioning data, we must ensure that specific data is placed accurately in the correct position. To use the Static partition, we should use the property as follows:

Mapred.mode = strict


We can use this in the hive-site.xml configuration file.

Dynamic Partitioning in Hive

Dynamic partition refers to a single insert to the partition table. There is no requirement to create the partition over the table explicitly. When using Dynamic Partitioning in Hive, multiple sub-directories are created. It provides more flexibility. It offers more flexibility. 

Syntax

To enable the dynamic partition, we use this Hive Command

Set hive.exec.dynamic.partition = true


We will enable dynamic Partitioning for our application.

Set hive.exec.dynamic.partition.mode = nonstrict


We will set the mode to non-strict to demonstrate how to perform dynamic Partitioning.

Dynamic Partitions are also called variable Partitioning. Variable Partitionary means nodes that are not pre-configured, dynamically configured during runtime based on the file size or partition. It ensures proper distribution of memory and RAM. In the dynamic partition, every individual row of data is read and assigned with the Map-reduce job. Dynamic Partitioning by disable by default in the Hive prevents accidental partitioning.

Also read - multiple granularity in dbms

How Dynamic Partition Works

Step 1: Create a non-partitioned table

First, we create a non-partitioned table that stores the data.

Code

Create table stud_report_card (
  id int, Student_name string, marks int, 
  subject string
) ROW FORMAT delimited fields terminated by “, 
” LINES TERMINATED BY ‘\n’ STORED AS TEXTFILE;

Output

output

Explanation

We initially created a non-partitioned table because the "LOAD" command cannot insert data into a partitioned table. Only the "INSERT" command inserts data into a partitioned table. When we run the "INSERT" command, the MapReduce process starts, and this method subsequently places the data in the proper partition of the table.

Step 2: Load table

Second, we have to load the table from a local system.

Code

Load data local inpath ‘/home/VIDHI/Hive/hive/details_student’ into the stud_report_card;

Output

output

Explanation

Data has been loaded into the system. We have used the ‘local’ keyword to load this data, which is present in the local system. If your data is in the HDFS, we can remove the keyword and execute our command.

NOTE: It's essential to set the property. If you do not use it, it will show you an error stating to enable the property. You can execute this code below to set the property before making a partitioned table.

Set hive.exec.dynamic.partition.mode = nonstrict;

Step 3: Create a Partitioned Table

Third, create a partitioned table where you want to insert the data with a dynamic partition.

Code

create table student_detail (
  id int, Student_name string, marks int
) partitioned by (subject string);

Output

output

Explanation

We have created a partition table named ‘student_details.’ We have defined a partition column called ‘subject.’ The values of partitions are added dynamically based on the data inserted.

Step 4: Insert the Data

Fourth, insert the data as per your requirement with the partition.

Code

Insert into student_details partition(subject) 
Select 
  id, 
  Student_name, 
  marks, 
  subject 
from 
  stud_report_card 
where 
  subject = ’Python’;

Output

output

Explanation

This command can insert data with a dynamic table partition over the column subject. We can insert data using the 'subject' column with a dynamic table partition.

Step 5: Display the table

Fifth, display the table after executing the query.

Code

Select 
  * 
from 
  student_detail;

Output

output

Explanation

This query will return all rows from the table. The '* Select' keyword helps display all rows. We observe that response time also gets improves. 

Advantages of Dynamic Partition

There are many advantages of Dynamic Partition in Hive:

  • Beneficial for loading large files into a table.
     
  • Data is read row-wise.
     
  • Partitions are created based on memory and RAM usage.
     
  • We observed a reduction in query processing time.
     
  • Load is distributed horizontally across the cluster.

Disadvantages of Dynamic Partition 

There are some disadvantages of Dynamic Partition in Hive:

  • The possibility of creating many small partitions is present.
     
  • Loading data with dynamic partitioning takes more time compared to static partitioning.
     
  • It can not perform altered operations.

Difference between Static and Dynamic Partition in Hive

Static Partitioning Dynamic Partitioning
We explicitly define partitions and their values during table creation or loading. Partitions are created automatically based on the value encountered during data insertion.
Manual creation of partition directories. Automatically created partition directories.
It is difficult to add new partition values. More flexible as compared to Static Partition.
A deep understanding of all possible partition values is required Advanced knowledge of partition values is optional. 
Having fewer partitions and less metadata overhead The high number of partitions and potential metadata overhead
Partition values are predetermined and fixed. The data itself derives the partition values.

Also see, Checkpoint in DBMS

Also read -  Aggregation in DBMS

Frequently Asked Questions

How do you handle data skew when using dynamic Partitioning?

We can handle skew data by using bucketing with dynamic Partitioning to distribute the data evenly across the partitions. We can even perform techniques like data sampling to improve query performance.

Can dynamic Partitioning be used with different file formats in Hive (e.g., Parquet, ORC)?

We can perform dynamic Partitioning with different file formats like Parquet and ORC( Optimized Row Columnar). Hive dynamic Partitioning is independent of the file format used for storing the data.

Can dynamic Partitioning be combined with other features in Hive, such as bucketing or sorting?

We can combine dynamic Partitioning with bucketing and sorting. We can use a dynamic partition with bucketing to distribute the data evenly. Sorting can increase the efficiency of the data within the partitions.

How does dynamic partitioning work in Hive?

Dynamic Partitioning in Hive automatically determines and creates partitions based on the values encountered during data insertion. There is no need for explicit partition specification in dynamic Partitioning.

What are the advantages of using dynamic Partitioning over static Partitioning?

The advantages of dynamic Partitioning are that they provide more flexibility and scalability and simplify the data-loading process. It can automatically create directories. No advanced knowledge of Partitioning is required.

Conclusion

This article has explored various aspects of Hive and Dynamic Partitioning. We have defined its overview in this article. Dynamic Partitioning in Hive is a powerful tool that provides flexibilityscalability, and simply the process of data loading. We have explained the concept of dynamic Partitioning in-depth and its working. We demonstrated its working, highlighting its automatic partition creation based on data values. 

Further, we examined the advantages and disadvantages of Dynamic Partition in Hive. Lastly, we compared static and dynamic Partitioning to understand their differences. This article provides a comprehensive overview of Hive partitioning and its dynamic variant.

To learn more about it, check out the link below.

Mining Big Data with Hive

What is Hadoop

Storing Big Data with Hbase

Apache Spark

Joins in dbms

You can find more informative articles or blogs on our platform. You can also practice more coding problems and prepare for interview questions from well-known companies on your platform, Coding Ninjas Studio.

Live masterclass