UTF-8 Validation

Easy
0/40
Average time to solve is 15m
profile
Contributed by
3 upvotes
Asked in companies
AmazonApple

Problem statement

A character in UTF-8 can be from 1 to 4 bytes long, subjected to the following rules:

1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.

2. For n-bytes characters, the first n-bits are all one’s, and the n + 1 bit is 0, followed by n - 1 bytes, with the most significant 2 bits being 10.

You are given an array 'DATA' of integers representing the data, Your task is to find out whether the given array is a valid UTF-8 encoding.

Note:

The given input contains only integers, and only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Example:
Given data = [196, 128, 1], which represents the octet sequence: 11000101 10000010 00000001.

Return true because this is a valid UTF-8 sequence for a 2-byte character followed by a 1-byte character.
Detailed explanation ( Input/output format, Notes, Images )

Input format :

The first line of input contains an integer T denoting the number of test cases.

The first line of each test case contains a single integer 'N', where ‘N’ is the number of elements of the array.

The second line of each test case contains 'N' space-separated integers, denoting the array/list 'DATA'.

Output format :

For each test case, return whether the array denotes a correct UTF-8 sequence or not.

The output of each test case will be printed in a separate line.

Note:

You don’t need to print anything, it has been already taken care of. You just need to implement the given function.

Constraints:

1 <= T <= 5
1 <= N <= 4
1 <= DATA[ i ] < 256

Where ‘DATA[ i ]’ is array element at index ‘i’.

Time limit: 1 sec
Sample Input 1 :
2
3
232 136 4
3
196 128 1
Sample Output 1 :
false
true
Explanation of the Sample Output 1:
In test case 1,The array which represented the octet sequence: 11101000 10001100 00000100.

The first 3 bits are all ones, and the 4th bit is 0 means it is a 3-byte character.

The next byte is a continuation byte that starts with 10, and that’s correct, but the second continuation byte does not start with 10, so it is invalid and hence we return false.

In test case 2, The array which represented the octet sequence: which represents the octet sequence: 11000100 10000000 00000001.

It is a valid UTF-8 encoding for a 2-byte character followed by a 1-byte character, therefore we return true.
Sample Input 2 :
2
1 
1
1
255
Sample Output 2 :
true
false
Hint

Can you check for validity by determining the types of the bytes?

Approaches (1)
Using Bitwise

The most important observation is for ‘N’ byte UTF 8 sequence looks like this for different ‘N’ according to the UTF-8 rules: 

 

For ‘N’ = 1, it should be 0_ _ _ _ _ _ _

For ‘N’ = 2, it should be 1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _ _ 

For ‘N’ = 3, it should be 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _  1 0 _ _ _ _ _ _

For ‘N’ = 4, it should be 1 1 1 1 0 _ _ _  1 0 _ _ _ _ _ _  1 0 _ _ _ _ _ _  

1 0 _ _ _ _ _ _  

 

As long as every byte in the array is of the right type, it is a valid UTF-8 encoding.    

 

The steps are as follows:

 

  • Start from index 0, determine each byte’s type and check its validity.
  • There are five kinds of valid byte type: 0**, 10**, 110**,1110**, and 11110**
  • Give them type numbers 0, 1, 2, 3, 4, which are the index of the first 0 from left.
  • So, the index of the first ‘0’ determines the byte type because that could only make a UTF-8 of 1 byte according to the rules.
  • If a byte belongs to one of them: if it is type 0, continue if it is type 2 or 3 or 4, check whether the following 1, 2, and 3 byte(s) are of type 1 or not.
  • If not, return false; else, if a byte is type 1 or not of valid type, return false.
  • If we found a valid set of UTF-8 sequences as described above, we will return true.
Time Complexity

O(N), Where N is the size of the given array.

 

Since we have to traverse the array completely. Therefore, the time complexity will be O(N).

Space Complexity

O(1).

 

Since we are not using extra space to keep track of the elements that are present.

Code Solution
(100% EXP penalty)
UTF-8 Validation
Full screen
Console