UTF-8 Validation

Given data = [196, 128, 1], which represents the octet sequence: 11000101 10000010 00000001. Return true because this is a valid UTF-8 sequence for a 2-byte character followed by a 1-byte character.

The first line of input contains an integer T denoting the number of test cases. The first line of each test case contains a single integer 'N', where ‘N’ is the number of elements of the array. The second line of each test case contains 'N' space-separated integers, denoting the array/list 'DATA'.

In test case 1,The array which represented the octet sequence: 11101000 10001100 00000100. The first 3 bits are all ones, and the 4th bit is 0 means it is a 3-byte character. The next byte is a continuation byte that starts with 10, and that’s correct, but the second continuation byte does not start with 10, so it is invalid and hence we return false. In test case 2, The array which represented the octet sequence: which represents the octet sequence: 11000100 10000000 00000001. It is a valid UTF-8 encoding for a 2-byte character followed by a 1-byte character, therefore we return true.

Using Bitwise

The most important observation is for ‘N’ byte UTF 8 sequence looks like this for different ‘N’ according to the UTF-8 rules:

For ‘N’ = 1, it should be 0_ _ _ _ _ _ _

For ‘N’ = 2, it should be 1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _ _

For ‘N’ = 3, it should be 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _

For ‘N’ = 4, it should be 1 1 1 1 0 _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _

1 0 _ _ _ _ _ _

As long as every byte in the array is of the right type, it is a valid UTF-8 encoding.

The steps are as follows:

Start from index 0, determine each byte’s type and check its validity.
There are five kinds of valid byte type: 0**, 10**, 110**,1110**, and 11110**
Give them type numbers 0, 1, 2, 3, 4, which are the index of the first 0 from left.
So, the index of the first ‘0’ determines the byte type because that could only make a UTF-8 of 1 byte according to the rules.
If a byte belongs to one of them: if it is type 0, continue if it is type 2 or 3 or 4, check whether the following 1, 2, and 3 byte(s) are of type 1 or not.
If not, return false; else, if a byte is type 1 or not of valid type, return false.
If we found a valid set of UTF-8 sequences as described above, we will return true.

Time Complexity

O(N), Where N is the size of the given array.

Since we have to traverse the array completely. Therefore, the time complexity will be O(N).

Space Complexity

O(1).

Since we are not using extra space to keep track of the elements that are present.

Problem statement

Sample Input 1 :

Sample Output 1 :

Explanation of the Sample Output 1:

Sample Input 2 :

Sample Output 2 :