Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
So what is external sort algorithm?
The external Sorting algorithm is used when a large amount of data is present. This data is so huge that it does not settle in the main memory(Random Access Memory -RAM). So, it has to stay on the hard drive. The hard drive is an external source of memory.
We use the hybrid merge sorting technique while working with external sorting. When the sorting phase starts, the chunks of data are read and sorted, then kept in the temporary file for a short period. When the merge phase begins, the system combines these files into one large file.
Basic Idea and Approach
The following Input is given:
in_file: Name of the input file. in.txt
out_file: Name of the output file, out.txt
run_sz: Size of a run that can fit in a RAM
numWays: Number of runs that we have to merge
Follow these steps to carry out the external sorting:
Read the input file so that, at most, run_sz’ elements are read simultaneously.
Use MergeSort to sort the run.
Now you should store the sorted array in a file.
Merge the sorted files.
Let us now focus on the implementation to understand what is external sort algorithm in a practical way.
Implementation in C++
// Below is the implementation of what is external sort algorithm and how that works.
#include <bits/stdc++.h>
using namespace std;
struct MinimumHeap {
// Mention the element we have to store.
int store;
// Index of the array is represented by i
int i;
};
// To swap two min heap nodes
void swap(MinimumHeap* x, MinimumHeap y);
// A class for Min Heap
class Miniheap {
// Pointer to an array of elements in the heap
MinimumHeap* harr;
// Size of min heap
int heap_sz;
public:
// A min
// Heap of a given size
Miniheap(MinimumHeap a[], int sz);
// To heapify a subtree with
// Root at the given index
void Miniheapify(int);
// To get the index of the left child
// of node at index i
int leftSide(int i) { return (2 * i + 1); }
// To get the index of the right child
// of node at index i
int rightSide(int i) { return (2 * i + 2); }
// To get the root
MinimumHeap getMin() { return harr[0]; }
// To replace the root with the new node
// x and heapify() new root
void replaceMin(MinimumHeap x)
{
harr[0] = x;
Miniheapify(0);
}
};
// Constructor: Builds a heap from
// a given array a[] of given sz
Miniheap::Miniheap(MinimumHeapa[], int sz)
{
heap_sz = sz;
harr = a; // It stores address of array
int i = (heap_sz - 1) / 2;
while (i >= 0) {
Miniheapify(i);
i--;
}
}
// A recursive method to heapify
// A subtree with root
// At a provided index.
void Miniheap::Miniheapify(int i)
{
int l = leftSide(i);
int r = rightSide(i);
int smallest = i;
if (l < heap_sz && harr[l].store < harr[i].store)
smallest = l;
if (r < heap_sz
&& harr[r].store< harr[smallest].store)
smallest = r;
if (smallest != i) {
swap(&harr[i], &harr[smallest]);
Miniheapify(smallest);
}
}
// A function to swap two elements
void swap(MinimumHeap* m, MinimumHeap* n)
{
MinimumHeap tp = *m;
*m = *n;
*n = tp;
}
// Merges two subarrays of arr[].
void mergeFunc(int arr[], int l, int m, int r)
{
int i, j, k;
int n1 = m - l + 1;
int n2 = r - m;
// Create temporary arrays
int L[n1], R[n2];
/* Now this code will copy data to temp arrays L[] and R[] */
for (i = 0; i < n1; i++)
L[i] = arr[l + i];
for (j = 0; j < n2; j++)
R[j] = arr[m + 1 + j];
// Initial index of first, second, and third subarray
i = 0;
j = 0;
k = l;
while (i < n1 && j < n2) {
if (L[i] <= R[j])
arr[k++] = L[i++];
else
arr[k++] = R[j++];
}
/* Copy the remaining elements of L[] */
while (i < n1)
arr[k++] = L[i++];
/* Copy the remaining elements of R[],
if there are any */
while (j < n2)
arr[k++] = R[j++];
}
void mergeSort(int arr[], int l, int r)
{
if (l < r) {
// This is the same as (l+r)/2, but avoids overflow for
// large l and h
int m = l + (r - l) / 2;
// Sort first and second halves
mergeSort(arr, l, m);
mergeSort(arr, m + 1, r);
mergeFunc(arr, l, m, r);
}
}
FILE* openFile(char* fileName, char* mode)
{
FILE* fp = fopen(fileName, mode);
if (fp == NULL) {
perror("We encountered an error while opening the file.\n");
exit(EXIT_FAILURE);
}
return fp;
}
// Names of files are set
// to be 1, 2, 3, ... k. It merges K files.
void Filemerge(char* output_file, int n, int k)
{
FILE* in[k];
for (int i = 0; i < k; i++) {
char fileName[2];
// Convert i to string
snprintf(fileName, szof(fileName), "%d", i);
// Open output files in read mode.
in[i] = openFile(fileName, "r");
}
// FINAL OUTPUT FILE
FILE* out = openFile(output_file, "w");
// Create a min heap with a k heap
// Nodes. Every heap node
// Has the first element of Scratch
// Output file
MinimumHeap* harr = new MinimumHeap[k];
int x;
for (x = 0;x < k; x++) {
// You should break if no output file is empty and
// index i will be equal to the number of input files
if (fscanf(in[i], "%d ", &harr[i].store) != 1)
break;
// Index of scratch output file
harr[i].i = i;
}
// Create the heap
Miniheap hp(harr, i);
int count = 0;
// Now, one by one, get the
// minimum element from min
// Heap and replace it with
// Next element.
// Run
while (count != i) {
// Get the minimum element
// And store it in the output file
MinimumHeap root = hp.getMin();
fprintf(out, "%d ", root.store);
// Find the next element
if (fscanf(in[root.i], "%d ", &root.store) != 1) {
root.store = INT_MAX;
count++;
}
// Replace root with next
// Element of input file
hp.replaceMin(root);
}
// Close input and output files
for (int i = 0; i < k; i++)
fclose(in[i]);
fclose(out);
}
// Using a merge-sort algorithm,
// Create the initial runs
// Divide them evenly among
// the output files
void createInitialRuns(char* input_file, int run_sz,
int numWays)
{
// For a big input file
FILE* in = openFile(input_file, "r");
// Output scratch files
FILE* out[numWays];
char fileName[2];
for (int i = 0; i < numWays; i++) {
// Convert i to string
snprintf(fileName, szof(fileName), "%d", i);
// Open output files in write mode.
out[i] = openFile(fileName, "w");
}
// Allocate a dynamic array large enough
// To accommodate runs of sz run_sz
int* arr = (int*)malloc(run_sz * szof(int));
bool more_input = true;
int nextOutputFile = 0;
int i;
while (more_input) {
// Write run_sz elements
for (i = 0; i < run_sz; i++) {
if (fscanf(in, "%d ", &arr[i]) != 1) {
more_input = false;
break;
}
}
// Use merge sort
mergeSort(arr, 0, i - 1);
for (int j = 0; j < i; j++)
fprintf(out[nextOutputFile], "%d ", arr[j]);
nextOutputFile++;
}
// Close input and output files
for (int i = 0; i < numWays; i++)
fclose(out[i]);
fclose(in);
}
void externalSort(char* input_file, char* output_file,
int numWays, int run_sz)
{
// Read the input file,
// Create the initial runs,
// Assign the runs to
// The scratch output files and merge
createInitialRuns(input_file, run_sz, numWays);
Filemerge(output_file, run_sz, numWays);
}
int main()
{
// The number of partitions.
int numWays = 10;
// Define The size of each partition
int run_sz = 1000;
char input_file[] = "in.txt";
char output_file[] = "out.txt";
FILE* in = openFile(input_file, "w");
srand(time(NULL));
//First of all, generate the input
for (int x = 0; x < numWays * run_sz; x++)
fprintf(in, "%d ", rand());
fclose(in);
externalSort(input_file, output_file, numWays,
run_sz);
return 0;
}
You can also try this code with Online C++ Compiler
The below picture visualizes what is external sort algorithm.
There is an essential point that we should take care of. The above code does not work well for the online compiler. This is because file creation permission is needed during the implementation. It works as expected when we run the same code on a system.
Time Complexity
O(N * log N).
The time taken in the case of merge sort is O(runs * run_sz * log run_sz), which is equal to O(N log run_sz).
The time complexity is O(N * log runs) to merge the sorted array.
The overall time complexity is O(N * log run_sz + N * log runs).
Carrying out the further calculations, log run_size + log runs = log run_size*runs = log N, the result time complexity will be O(N * log N).
O(run_sz) is the space occupied. run_sz is the space needed to store the array.
Frequently Asked Questions
What is the need for an external sort algorithm?
External Sorting is used when the input data is too large to apply traditional sorting algorithms.
How many parameters are given as input?
Four parameters are given the input file's input name, the output file's name, the run size, and the number of runs we have to merge.
What is the space complexity for external sorting?
O(run_sz) is the space occupied. Here run_sz is the space needed to store the array.
Why is this algorithm not compatible with online compilers?
This algorithm requires permission to create files. Therefore it does not work well with online compilers.
What is the time complexity for external sorting?
The time complexity will be O(N * log N).
Conclusion
This article has explained what is external sort algorithm. The external sort algorithm is used when a large amount of data is present. This data has to stay on the hard drive. We hope you got clarity on what is external sort algorithm.