Last Updated: 13 Mar, 2021

# UTF-8 Validation

Easy

## Problem statement

#### Note:

``````The given input contains only integers, and only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
``````
##### Example:
``````Given data = [196, 128, 1], which represents the octet sequence: 11000101 10000010 00000001.

Return true because this is a valid UTF-8 sequence for a 2-byte character followed by a 1-byte character.
``````

#### Input format :

``````The first line of input contains an integer T denoting the number of test cases.

The first line of each test case contains a single integer 'N', where ‘N’ is the number of elements of the array.

The second line of each test case contains 'N' space-separated integers, denoting the array/list 'DATA'.
``````

#### Output format :

``````For each test case, return whether the array denotes a correct UTF-8 sequence or not.

The output of each test case will be printed in a separate line.
``````

#### Note:

``````You don’t need to print anything, it has been already taken care of. You just need to implement the given function.
``````

#### Constraints:

``````1 <= T <= 5
1 <= N <= 4
1 <= DATA[ i ] < 256

Where ‘DATA[ i ]’ is array element at index ‘i’.

Time limit: 1 sec
``````

## Approaches

### 01 Approach

The most important observation is for ‘N’ byte UTF 8 sequence looks like this for different ‘N’ according to the UTF-8 rules:

For ‘N’ = 1, it should be 0_ _ _ _ _ _ _

For ‘N’ = 2, it should be 1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _ _

For ‘N’ = 3, it should be 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _  1 0 _ _ _ _ _ _

For ‘N’ = 4, it should be 1 1 1 1 0 _ _ _  1 0 _ _ _ _ _ _  1 0 _ _ _ _ _ _

1 0 _ _ _ _ _ _

As long as every byte in the array is of the right type, it is a valid UTF-8 encoding.

The steps are as follows:

• Start from index 0, determine each byte’s type and check its validity.
• There are five kinds of valid byte type: 0**, 10**, 110**,1110**, and 11110**
• Give them type numbers 0, 1, 2, 3, 4, which are the index of the first 0 from left.
• So, the index of the first ‘0’ determines the byte type because that could only make a UTF-8 of 1 byte according to the rules.
• If a byte belongs to one of them: if it is type 0, continue if it is type 2 or 3 or 4, check whether the following 1, 2, and 3 byte(s) are of type 1 or not.
• If not, return false; else, if a byte is type 1 or not of valid type, return false.
• If we found a valid set of UTF-8 sequences as described above, we will return true.