A character in UTF-8 can be from 1 to 4 bytes long, subjected to the following rules:
1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.
2. For n-bytes characters, the first n-bits are all one’s, and the n + 1 bit is 0, followed by n - 1 bytes, with the most significant 2 bits being 10.
You are given an array 'DATA' of integers representing the data, Your task is to find out whether the given array is a valid UTF-8 encoding.
Note:
The given input contains only integers, and only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Example:
Given data = [196, 128, 1], which represents the octet sequence: 11000101 10000010 00000001.
Return true because this is a valid UTF-8 sequence for a 2-byte character followed by a 1-byte character.
Input format :
The first line of input contains an integer T denoting the number of test cases.
The first line of each test case contains a single integer 'N', where ‘N’ is the number of elements of the array.
The second line of each test case contains 'N' space-separated integers, denoting the array/list 'DATA'.
Output format :
For each test case, return whether the array denotes a correct UTF-8 sequence or not.
The output of each test case will be printed in a separate line.
Note:
You don’t need to print anything, it has been already taken care of. You just need to implement the given function.
Constraints:
1 <= T <= 5
1 <= N <= 4
1 <= DATA[ i ] < 256
Where ‘DATA[ i ]’ is array element at index ‘i’.
Time limit: 1 sec
2
3
232 136 4
3
196 128 1
false
true
In test case 1,The array which represented the octet sequence: 11101000 10001100 00000100.
The first 3 bits are all ones, and the 4th bit is 0 means it is a 3-byte character.
The next byte is a continuation byte that starts with 10, and that’s correct, but the second continuation byte does not start with 10, so it is invalid and hence we return false.
In test case 2, The array which represented the octet sequence: which represents the octet sequence: 11000100 10000000 00000001.
It is a valid UTF-8 encoding for a 2-byte character followed by a 1-byte character, therefore we return true.
2
1
1
1
255
true
false
Can you check for validity by determining the types of the bytes?
The most important observation is for ‘N’ byte UTF 8 sequence looks like this for different ‘N’ according to the UTF-8 rules:
For ‘N’ = 1, it should be 0_ _ _ _ _ _ _
For ‘N’ = 2, it should be 1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _ _
For ‘N’ = 3, it should be 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _
For ‘N’ = 4, it should be 1 1 1 1 0 _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _
1 0 _ _ _ _ _ _
As long as every byte in the array is of the right type, it is a valid UTF-8 encoding.
The steps are as follows:
O(N), Where N is the size of the given array.
Since we have to traverse the array completely. Therefore, the time complexity will be O(N).
O(1).
Since we are not using extra space to keep track of the elements that are present.