Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Problem Statement
2.1.
Example 
3.
Solution Approach
3.1.
Approach 1: Naive Approach
3.2.
Python Implementation 
3.2.1.
Input
3.2.2.
Output
3.3.
Complexities
3.3.1.
Time Complexity
3.3.2.
Space Complexity
3.4.
Approach 2: KMP Pattern Searching algorithm
3.5.
Python Implementation
3.5.1.
Input
3.5.2.
Output
3.6.
Complexities
3.6.1.
Time Complexity
3.6.2.
Space Complexity
3.7.
Approach 3: Rabin Karp Algorithm
3.8.
Python Implementation
3.8.1.
Input
3.8.2.
Output
3.9.
Complexity
3.9.1.
Time Complexity
3.9.2.
Space Complexity
4.
FAQs
5.
Key Takeaways
Last Updated: Mar 27, 2024

Inappropriate Words Filter

Author Teesha Goyal
0 upvote
Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

Inappropriate words are wrong, indecent words that should not be used in the workplace. This article deals with filtering inappropriate words in the given text and returns 'passed' if the text does not contain any inappropriate word as a substring.

Problem Statement

You are given a set of inappropriate words as an input and then check whether the input text contains any of those words or not. 
The first line consists of an integer n, the number of inappropriate words to be checked, followed by n lines, each containing one word. The following line contains an integer m which is the number of lines of text to be checked. The following m lines include the text.
The output gives two space-separated integers, the inappropriate word's line, and position. If the text has no inappropriate words, the output is 'Passed'.

Example 

Input

4
dear
dream
baby
lovingly
5
I was feeling sleepy, so I went to bed early
I did my toothbrush, and the next thing I
remember, is being a spaceship, 
which I was dreaming about floating in
It was a good dream and a great sleep.


Output

4 13


Explanation

Here, at line 4, position 13, the substring dream occurs, which was passed as an inappropriate word earlier hence the output 4 13.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Solution Approach

Approach 1: Naive Approach

The brute force solution idea is very straightforward, and we will discuss that here. The idea is to traverse first for each line in the text provided and then search for each word from the inappropriate word if it is present as the substring in the text or not. 

If the word is found, then the line and position are printed, and if not, then "Passed" is printed.

The steps of implementation are:

  • Take the inputs of n, n number of inappropriate words in words_list, m, m lines of text in the text list.
  • Traverse over each line or each element from the text list 
  • For each word from the words_list, check for its presence in the text line.
  • Call search function to check whether the word is present or not in the text line.
  • The search function will take two arguments - pattern and text, declare m as the length of pattern and n as the length of text.
  • Now implement a for loop to slide the word over the text line; this loop will run from 0 to n-m+1 
  • For each index in the outer loop, run another while loop to traverse the pattern length and search for the pattern in the text.
  • Return the index value if the word is found or return -1 if not found
  • Check whether the return value is not equal to -1 and print the index+1 and return value of the search function. 
  • If the return value is -1, then print 'Passed.'

Python Implementation 

def search(pat, text):
    pl = len(pat)
    tl = len(text)
   
    #loop to slide pat over the text one by one
    for i in range(tl-pl+1):
        j = 0
       
        #check for the pattern at the current index i
        while(j<pl):
            if pat[j] != text[i+j] :
                break
            j += 1
           
        #return the i if complete pattern is found in text at i
        if j == pl:
            return(i)
           
    #return -1 if not found
    return -1
   
def main():
    #creating an empty list of inappropriate words and text
    words_list = []  
    text = []
   
    #Taking the value of n as input
    n = int(input())
   
    #taking input of n words
    for i in range(n):
        words_list.append(input())
       
    #Taking the value of m
    m = int(input())
   
    #Taking input of text
    for i in range(m):
        text.append(input())
   
    #loop for each line in text
    for index, text_line in enumerate(text):
       
        #loop for each pattern in words_list
        for pat in words_list:
            tmp = search(pat, text_line)
            if tmp != -1:
               
                #printing the line and position
                print(index+1, tmp+1, sep = ' ')
                return
       
    # if no inappropriate word is found, print "Passed."
    print("Passed")

#execute main
main()


Input

4
dear
dream
Baby
lovingly
5
I was feeling sleepy, so I went to bed early
I did my toothbrush, and the next thing I
remember, is being in a spaceship, 
which I was dreaming about floating in
It was a good dream and a great sleep.


Output

4 13

Complexities

Time Complexity

O(m*n*(tl-pl+1)*pl), where m is the number of lines, n is the number of words, tl is the length of a line in the search function, and pl is the length of a word in the search function.

Reason: In the search function, we have two loops, so the complexity becomes O((tl-pl+1)*pl), then in the main function, first of all, we are traversing for the number of lines and then for the number of words, so that gives the overall complexity of O(m*n*(tl-pl+1)*pl)

Space Complexity

O(n+m), where n is the number of words and m is the number lines in the text

Reason: To store the n number of words, we have created the words_list, and to keep the m lines of text, we have created the text list. Both the list depends on the value of n and m.

Approach 2: KMP Pattern Searching algorithm

We have seen that the brute force approach is not good enough for large values of m and n. KMP is the abbreviation for Knuth Morris Pratt. It makes use of the degenerating property, which reduces the algorithm's time complexity and optimizes the program. The algorithm takes into consideration the fact that while looking for a pattern, we also look at part of the characters in the following window, so this algorithm uses this information to avoid checking the windows which we already know are not going to match.

The implementation of this algorithm has two parts- the first is the creation of a temporary list for each word, and the second is the search part.

The Steps of implementation for preprocessing part:

  • For each word in the words_list, do the following.
  • Create a list of lengths same as the length of the word with all elements as 0 (say, temp)
  • The value at the first index is always 0
  • Loop over the word with i and initialize j =0; if word[i] is equal to the word[j], increment both i and j and store the value of j to the temp[i]
  • If word[i] is not equal to the word[j] and j is not equal to 0, then change j to the value of temp[j-1] and continue the process
  • Else If j is 0 then assign temp[i] to 0 and increment i.
     

This implementation will create a temporary list that gives the length of the suffix, which is also a prefix for each position in the word.

The steps of implementation for the search part:

  • Assign the length of the word to m and the size of the line of text to n 
  • Initialize both i and j with 0 to traverse the text line and word respectively and then loop over i while i is less than n
  • Check if word[j] is equal to text[i], if equal then increment both i and j
  • Now, if j is equal to m, then the word is found, and we return the index and exit the function
  • Now, if word[j] is not equal to text[i] and i is still less than n, then we reassign j the value of temp[j-1] if j is not equal to 0. Otherwise, i is incremented.

Python Implementation

def create_temp(pat, temp):
    m = len(pat)
    j, i = 0, 1
 
    #loop to traverse over the text line
    while i < m:
        #if the value is equal increment both i and j and stores the value at j in temp[i]
        if pat[i]== pat[j]:
            j += 1
            temp[i] = j
            i += 1
        else:
           
            if j == 0:
                temp[i] = 0
                i += 1

            else:
                j = temp[j-1]
               

def KMPsearch(pat, text,temp):
    m = len(pat)
    n = len(text)
    #Initialize i and j with 0 to traverse the text and pat respectively
    j, i = 0, 0

    #loop over to traverse the text
    while i < n:
        if pat[j] == text[i]:
            i += 1
            j += 1

        #check if pattern is found or not and return index if found
        if j == m:
            return i-j
 
        elif pat[j] != text[i] and i < n:

            if j != 0:
                j = temp[j-1]
            else:
                i += 1

    return -1

def main():
    #creating an empty list of inappropriate words and text
    words_list = []  
    text = []
   
    #Taking the Inputs
    n = int(input())

    for i in range(n):
        words_list.append(input())

    m = int(input())

    for i in range(m):
        text.append(input())

    #To create temporary list for each word
    temp_list =[]
    for pat in words_list:
        temp = [0] * len(pat)
        create_temp(pat, temp)
        temp_list.append(temp)

    # Loop over each line in the text
    for index, text_line in enumerate(text):

        #loop over each word to check for its presence in the text line
        for index2, pat in enumerate(words_list):
            temp = temp_list[index2]
            res = KMPsearch(pat,text_line,temp)

            if res != -1:
                print(index+1, res+1, sep = ' ')
                return
   
    print("Passed")

#execute main
main()


Input

4
dear
dream
Baby
lovingly
5
I was feeling sleepy, so I went to bed early
I did my toothbrush, and the next thing I
remember, is being in a spaceship, 
which I was dreaming about floating in
It was a good dream and a great sleep.


Output

4 13

Complexities

Time Complexity

O(n*m*l), where n is the number of words and m is the number of lines, and l is the length of the longest line of text.

Reason: We have to traverse for each line and then for each word and then traverse the line of text; hence there are three loops; therefore, the complexity of this algorithm is O(n*m*l)

Space Complexity

O(n+m+n), where n is the number of words and m is the number lines in the text

Reason: To store the n number of words, we have created the words_list, and to keep the m lines of text, we have created the text list. Also, the temporary list is created, so another n. Both the list depends on the value of n and m.

Approach 3: Rabin Karp Algorithm

The Rabin Karp approach to pattern search is based on the hash function notion. The hash value of the pattern is compared to the hash value of the substrings of the text to be checked. For all the substrings whose hash value matches the pattern, individual characters are traversed and checked.

The steps of implementation are:

  • Take the input for the number of words n, n number of words, the number of lines in text m, and m lines of text.
  • Loop over the text line to take into account each line and then loop over the words_list to consider each word
  • Take the value of d, the number of characters equal to 256, the value of q, any prime number (say101)
  • Calculate the value of h 
  • calculate the hash value of the pattern as well as the first window 
  • Slide over the text and check if the hash value of the window matches the hash value of the pattern. If the value matches, then check the individual characters to see if the pattern matches the window
  • If the pattern matches the window, return the index and exit loop.
  • Otherwise, calculate the hash value of the next window by adding the leading character and removing the trailing character.
  • Repeat the step until a pattern matches or all the windows are traversed

Python Implementation

def RKsearch(pat, text):
    d = 256  #number of characters in alphabet generally 256
    q = 101  #any prime number which will control integer overflow
    m = len(pat)
    n = len(text)
    i, j = 0, 0  # to traverse text and pattern
    phash = 0    # to store the hash value for pattern
    thash = 0    # to store the hash value for txt
    h = 1        # initialize

    #calculate the value of hash h
    for i in range(m-1):
        h = (h*d)%q

    #calculate the hash value of pattern
    for i in range(m):
        phash = (d*phash + ord(pat[i]))%q

    #calculate the hash value of first window
    for i in range(m):
        thash = (d*thash + ord(text[i]))%q

    #Loop over to traverse the text
    for i in range(n-m+1):

        #first check if the hash values of the current window match with the hash value of the pattern
        if thash == phash:
            #now check for the individual characters
            for j in range(m):
                if text[i+j] != pat[j]:
                    break
                else: j+=1

                if j == m:
                    return(i)   #if the pattern is found return index

        #now we calculate the hash of next window
        if i < n-m:
            thash = (d*(thash-ord(text[i])*h) + ord(text[i+m]))%q   #add the leading character and remove the trailing character

        #check if the current hash is negative
        if thash < 0:
            thash += q #use modular ability to change it to positive

    return -1


def main():
    #creating an empty list of inappropriate words and text
    words_list = []  
    text = []
   
    #Taking the Inputs
    n = int(input())

    for i in range(n):
        words_list.append(input())

    m = int(input())

    for i in range(m):
        text.append(input())

    #loop for each line in text
    for index, text_line in enumerate(text):
       
        #loop for each pattern in words_list
        for pat in words_list:
            tmp = RKsearch(pat, text_line)
            if tmp != -1:
                #printing the line and position
                print(index+1, tmp+1, sep = ' ')
                return
       
    # if no inappropriate word is found, print "Passed."
    print("Passed")


#execute main
main()


Input

4
dear
dream
Baby
lovingly
5
I was feeling sleepy, so I went to bed early
I did my toothbrush, and the next thing I
remember, is being in a spaceship, 
which I was dreaming about floating in
It was a good dream and a great sleep.


Output

4 13

Complexity

Time Complexity

O(n*m*lw*lt), where n is the number of words, m is the number of lines in the text, lw is the length of the longest word, and lt is the length of the longest line of text.

Reason: We have to traverse for each line and then for each word, so that gives the n*m complexity. Now each line and word, we have to traverse the text to slide the window and get substrings, and then when the hash values match, the word is crossed to check for individual characters.

Space Complexity

O(n+m), where n is the number of words and m is the number lines in the text

Reason: To store the n number of words, we have created the words_list, and to keep the m lines of text, we have created the text list. Both the list depends on the value of n and m.

FAQs

  1. What is hashing?
    The process of converting a given key into another value is known as hashing. Here we have converted the string into numeric value to reduce the computation cost and reduce the run time.
     
  2. What is String matching algorithms?
    It is a type of algorithm where we find one, several, or all occurrences of a pattern in the given string. Knuth Morris Pratt and Rabin Karp are examples of such algorithms.
     
  3. How is the hash value of the next window calculated?
    To calculate the hash value of the next window, the hash of the trailing character is removed, and the hash of the leading character is added to the current hash value. 
     
  4. What are spurious hits in Rabin-Karp?
    Spurious hits in Rabin Karp occur when the hash values of the pattern and the window match, but the actual window does not match the pattern.
     
  5. What are the worst case and best case complexity of the Rabin Karp algorithm?
    The best-case complexity of the Rabin Karp algorithm is O(m+n), and the worst-case complexity is O(nm).

Key Takeaways

In this article, we learned how to solve the problem of "Inappropriate Words filtering". The brute force approach is not suitable for larger values, so we use optimized algorithms. The KMP and Rabin Karp are the two algorithms that optimize the problem and give a better and more efficient solution.

Are you planning to ace the interviews of reputed product-based companies like Amazon, Google, Microsoft, and more? 
Attempt our Online Mock Test Series on Coding Ninjas Studio now!

Happy Coding!

Previous article
The number of times array B appears as a subarray in A
Next article
Decode Morse Code
Live masterclass