Table of contents
1.
Introduction
2.
Basic Idea and Approach 
3.
Implementation in C++
3.1.
Input File
3.2.
Output File
4.
Time Complexity
5.
Space Complexity 
6.
Frequently Asked Questions 
6.1.
What is the need for an external sort algorithm? 
6.2.
How many parameters are given as input?
6.3.
What is the space complexity for external sorting?
6.4.
Why is this algorithm not compatible with online compilers?
6.5.
What is the time complexity for external sorting?
7.
Conclusion
Last Updated: Mar 27, 2024
Hard

What is external sort algorithm?

Author Geetika Dua
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

So what is external sort algorithm?

The external Sorting algorithm is used when a large amount of data is present. This data is so huge that it does not settle in the main memory(Random Access Memory -RAM). So, it has to stay on the hard drive. The hard drive is an external source of memory. 
 

Introduction

We use the hybrid merge sorting technique while working with external sorting. When the sorting phase starts, the chunks of data are read and sorted, then kept in the temporary file for a short period. When the merge phase begins, the system combines these files into one large file.

Basic Idea and Approach 

The following Input is given: 

  • in_file: Name of the input file. in.txt
  • out_file: Name of the output file, out.txt
  • run_sz: Size of a run that can fit in a RAM
  • numWays: Number of runs that we have to merge

Follow these steps to carry out the external sorting:

  • Read the input file so that, at most, run_sz’ elements are read simultaneously. 
  • Use MergeSort to sort the run.
  • Now you should store the sorted array in a file.
  • Merge the sorted files.

Let us now focus on the implementation to understand what is external sort algorithm in a practical way.

Implementation in C++

// Below is the implementation of what is external sort algorithm and how that works.

#include <bits/stdc++.h>
using namespace std;
 
struct MinimumHeap {
 
    // Mention the element we have to store.
    int store;
 
    // Index of the array is represented by i
    int i;
};
// To swap two min heap nodes
void swap(MinimumHeap* x, MinimumHeap y);
 
// A class for Min Heap
class Miniheap {
 
    // Pointer to an array of elements in the heap
    MinimumHeap* harr;
 
    // Size of min heap
    int heap_sz;
 
public:
 
    // A min
    // Heap of a given size
    Miniheap(MinimumHeap a[], int sz);
 
    // To heapify a subtree with
    // Root at the given index
    void Miniheapify(int);
 
    // To get the index of the left child
    // of node at index i
    int leftSide(int i) { return (2 * i + 1); }
 
    // To get the index of the right child
    // of node at index i
    int rightSide(int i) { return (2 * i + 2); }
 
    // To get the root
    MinimumHeap getMin() { return harr[0]; }
 
    // To replace the root with the new node
    // x and heapify() new root
    void replaceMin(MinimumHeap x)
    {
        harr[0] = x;
        Miniheapify(0);
    }
};
 
// Constructor: Builds a heap from
// a given array a[] of given sz
Miniheap::Miniheap(MinimumHeapa[], int sz)
{
    heap_sz = sz;
    harr = a; // It stores address of array
    int i = (heap_sz - 1) / 2;
    while (i >= 0) {
        Miniheapify(i);
        i--;
    }
}
 
// A recursive method to heapify
// A subtree with root
// At a provided index.
void Miniheap::Miniheapify(int i)
{
    int l = leftSide(i);
    int r = rightSide(i);
    int smallest = i;
 
    if (l < heap_sz && harr[l].store < harr[i].store)
        smallest = l;
 
    if (r < heap_sz
        && harr[r].store< harr[smallest].store)
        smallest = r;
 
    if (smallest != i) {
        swap(&harr[i], &harr[smallest]);
        Miniheapify(smallest);
    }
}
 
// A function to swap two elements
void swap(MinimumHeap* m, MinimumHeap* n)
{
    MinimumHeap tp = *m;
    *m = *n;
    *n = tp;
}
 
// Merges two subarrays of arr[].
void mergeFunc(int arr[], int l, int m, int r)
{
    int i, j, k;
    int n1 = m - l + 1;
    int n2 = r - m;
 
    // Create temporary arrays 
    int L[n1], R[n2];
 
    /* Now this code will copy data to temp arrays L[] and R[] */
    for (i = 0; i < n1; i++)
        L[i] = arr[l + i];
    for (j = 0; j < n2; j++)
        R[j] = arr[m + 1 + j];
    // Initial index of first, second, and third subarray
    i = 0;
    j = 0;
    k = l;
    while (i < n1 && j < n2) {
        if (L[i] <= R[j])
            arr[k++] = L[i++];
        else
            arr[k++] = R[j++];
    }
 
    /* Copy the remaining elements of L[] */
    while (i < n1)
        arr[k++] = L[i++];
 
    /* Copy the remaining elements of R[],
        if there are any */
    while (j < n2)
       arr[k++] = R[j++];
}
 
 
void mergeSort(int arr[], int l, int r)
{
    if (l < r) {
        // This is the same as (l+r)/2, but avoids overflow for
        // large l and h
        int m = l + (r - l) / 2;
 
        // Sort first and second halves
        mergeSort(arr, l, m);
        mergeSort(arr, m + 1, r);
 
        mergeFunc(arr, l, m, r);
    }
}
 
FILE* openFile(char* fileName, char* mode)
{
    FILE* fp = fopen(fileName, mode);
    if (fp == NULL) {
        perror("We encountered an error while opening the file.\n");
        exit(EXIT_FAILURE);
    }
    return fp;
}
 
// Names of files are set
// to be 1, 2, 3, ... k. It merges K files.
void Filemerge(char* output_file, int n, int k)
{
    FILE* in[k];
    for (int i = 0; i < k; i++) {
        char fileName[2];
 
        // Convert i to string
        snprintf(fileName, szof(fileName), "%d", i);
 
        // Open output files in read mode.
        in[i] = openFile(fileName, "r");
    }
 
    // FINAL OUTPUT FILE
    FILE* out = openFile(output_file, "w");
 
    // Create a min heap with a k heap
    // Nodes. Every heap node
    // Has the first element of Scratch
    // Output file
    MinimumHeap* harr = new MinimumHeap[k];
    int x;
    for (x = 0;x < k; x++) {
        // You should break if no output file is empty and
        // index i will be equal to the number of input files
        if (fscanf(in[i], "%d ", &harr[i].store) != 1)
            break;
 
        // Index of scratch output file
        harr[i].i = i;
    }
    // Create the heap
    Miniheap hp(harr, i);
 
    int count = 0;
 
    // Now, one by one, get the
    // minimum element from min
    // Heap and replace it with
    // Next element.
    // Run 

    while (count != i) {
        // Get the minimum element
        // And store it in the output file
        MinimumHeap root = hp.getMin();
        fprintf(out, "%d ", root.store);
 
        // Find the next element 
        if (fscanf(in[root.i], "%d ", &root.store) != 1) {
            root.store = INT_MAX;
            count++;
        }
 
        // Replace root with next
        // Element of input file
        hp.replaceMin(root);
    }
 
    // Close input and output files
    for (int i = 0; i < k; i++)
        fclose(in[i]);
 
    fclose(out);
}
 
// Using a merge-sort algorithm,
// Create the initial runs
// Divide them evenly among
// the output files
void createInitialRuns(char* input_file, int run_sz,
                    int numWays)
{
    // For a big input file
    FILE* in = openFile(input_file, "r");
 
    // Output scratch files
    FILE* out[numWays];
    char fileName[2];
    for (int i = 0; i < numWays; i++) {
        // Convert i to string
        snprintf(fileName, szof(fileName), "%d", i);
 
        // Open output files in write mode.
        out[i] = openFile(fileName, "w");
    }
 
    // Allocate a dynamic array large enough
    // To accommodate runs of sz run_sz
    int* arr = (int*)malloc(run_sz * szof(int));
 
    bool more_input = true;
    int nextOutputFile = 0;
 
    int i;
    while (more_input) {
        // Write run_sz elements
    
        for (i = 0; i < run_sz; i++) {
            if (fscanf(in, "%d ", &arr[i]) != 1) {
                more_input = false;
                break;
            }
        }
 
        // Use merge sort
        mergeSort(arr, 0, i - 1);
        for (int j = 0; j < i; j++)
            fprintf(out[nextOutputFile], "%d ", arr[j]);
 
        nextOutputFile++;
    }
 
    // Close input and output files
    for (int i = 0; i < numWays; i++)
        fclose(out[i]);
 
    fclose(in);
}
 
void externalSort(char* input_file, char* output_file,
                int numWays, int run_sz)
{
    // Read the input file,
    // Create the initial runs,
    // Assign the runs to
    // The scratch output files and merge
    createInitialRuns(input_file, run_sz, numWays);
    Filemerge(output_file, run_sz, numWays);
}
int main()
{
    // The number of partitions.
    int numWays = 10;
 
    // Define The size of each partition
    int run_sz = 1000;
 
    char input_file[] = "in.txt";
    char output_file[] = "out.txt";
 
    FILE* in = openFile(input_file, "w");
 
    srand(time(NULL));
 
    //First of all, generate the  input
    for (int x = 0; x < numWays * run_sz; x++)
        fprintf(in, "%d ", rand());
 
    fclose(in);
 
    externalSort(input_file, output_file, numWays,
                run_sz);
 
    return 0;
}
You can also try this code with Online C++ Compiler
Run Code

Input File

Input

Output File

Output

The below picture visualizes what is external sort algorithm.

Working of algo

There is an essential point that we should take care of. The above code does not work well for the online compiler. This is because file creation permission is needed during the implementation. It works as expected when we run the same code on a system.

Time Complexity

O(N * log N). 

  • The time taken in the case of merge sort is O(runs * run_sz * log run_sz), which is equal to O(N log run_sz). 
  • The time complexity is O(N * log runs) to merge the sorted array. 
  • The overall time complexity is O(N * log run_sz + N * log runs). 
  • Carrying out the further calculations, log run_size + log runs = log run_size*runs = log N, the result time complexity will be O(N * log N).

Read More - Time Complexity of Sorting Algorithms

Space Complexity 

O(run_sz) is the space occupied. run_sz is the space needed to store the array.

Frequently Asked Questions 

What is the need for an external sort algorithm? 

External Sorting is used when the input data is too large to apply traditional sorting algorithms.

How many parameters are given as input?

Four parameters are given the input file's input name, the output file's name, the run size, and the number of runs we have to merge.

What is the space complexity for external sorting?

 O(run_sz) is the space occupied. Here run_sz is the space needed to store the array.

Why is this algorithm not compatible with online compilers?

This algorithm requires permission to create files. Therefore it does not work well with online compilers.

What is the time complexity for external sorting?

The time complexity will be O(N * log N).

Conclusion

This article has explained what is external sort algorithm. The external sort algorithm is used when a large amount of data is present. This data has to stay on the hard drive. We hope you got clarity on what is external sort algorithm.

Recommended Problem - Merge K Sorted Arrays

Check our coding-related articles on the links:


Refer to our guided paths on Coding Ninjas Studio to learn more about DSA, Competitive Programming, JavaScript, System Design, etc. Enroll in our courses and refer to the mock test and problems available. Take a look at the interview experiences and interview bundle for placement preparations.

Happy Coding!

Live masterclass