Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction 
2.
Multibyte Characters in Ruby
3.
Encoding Method in Ruby
3.1.
Changing an Encoding
3.1.1.
force_encoding() method
3.1.2.
encode() method
3.2.
Encoding Compatibility Check
3.3.
Check out this article - C++ String Concatenation
4.
Frequently Asked Questions
4.1.
What are Objects in Ruby?
4.2.
What is a block in Ruby?
4.3.
What is Multibyte Character in Ruby?
4.4.
What is a Global Variable in Ruby?
4.5.
What are Global Functions in Ruby?
5.
Conclusion
Last Updated: Mar 27, 2024

String Encodings and Multibyte Characters in Ruby

Author Rajat Agrawal
0 upvote
Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction 

Ruby is a general-purpose, dynamic, reflective, object-oriented programming language. Everything in Ruby is an object. Ruby's development aimed to create a user interface between human programmers and the underlying computational machinery.

The string data type is the most common data type found in all the programming languages.

A string is a collection of one or more characters, which might be letters, numbers, or symbols. In contrast to other programming languages, Ruby's strings are objects that may be modified in place rather than replaced entirely.

Let’s learn about string encodings and multibyte characters in Ruby.

Multibyte Characters in Ruby

A character made up of sequences of one or more bytes is known as a Multibyte Character. Each byte sequence represents the extended character set as a single character.

The concept of Multibyte characters is introduced in Ruby 1.9.

In Ruby 1.9, the String class was rebuilt to accommodate multibyte characters. The major change in Ruby 1.9 is multibyte support, however, it is not a very noticeable change because code that uses multibyte strings just functions. However, it is important to comprehend how it functions.

The number of bytes and the number of characters do not match when a string comprises multibyte characters. The new bytesize function in Ruby 1.9 returns the number of bytes, whereas the length and size methods return the number of characters in a string.

Let’s understand the concept of Multibyte characters with the help of an example.

# -*- coding: utf-8 -*- # Specify Unicode UTF-8 characters

# Here str is a string literal containing a multibyte multiplication character
str = "2×2=4"

# The string contains 6 bytes which encode 5 characters
puts str.length 
puts str.bytesize 


Output:

5
6

In the above example, you can see that the string contains 6 bytes which encode 5 characters.

5: Characters: '2' '×' '2' '=' '4'

6: Bytes (hex): 32 c3 97 32 3d 34

Note: In the above code, the first line is a coding comment that sets the source encoding to UTF-8. Without this comment, the Ruby interpreter would not be able to convert the string literal's series of bytes into a sequence of characters.

It is no longer possible to map straight from character index to byte offset in a string when the characters are encoded with different numbers of bytes. For example, in the above code, the second character of str starts at the second byte. However, the third character starts at the fourth byte. This means you cannot trust that random access to any character inside a string will be a quick operation. The Ruby implementation must internally traverse progressively over the string to locate the required character index when the [ ] operator is used to access a character or substring inside a multibyte string. Therefore, you should generally attempt to do your string processing using sequential algorithms. Specifically, avoid repeatedly calling the [ ] operator and use each_char iterator when necessary.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Encoding Method in Ruby

The String class in Ruby 1.9 includes an encoding method that returns the encoding of a string and returns a value in the form of an Encoding object.

Let’s understand the encoding method with the help of an example.

# -*- coding: utf-8 -*-
str1 = "5×5=5" # Note multibyte multiplication character
puts str1.encoding

str2 = "3+3=6" # All characters are in the ASCII subset of UTF-8
puts str2.encoding 


Output:

UTF-8
UTF-8

In the above example, the first string str1 contains a multibyte multiplication character, and in the second string str2, all the characters are in the ASCII subset of UTF-8. Therefore, the output of both the strings is UTF-8.

Note: A string literal's encoding is determined by the file's source encoding, where it first appears. On the other hand, its encoding isn't always the same as the source encoding. Even though the source encoding is UTF-8 (a superset of ASCII), if a string literal only includes 7-bit ASCII characters, then its encoding function will return ASCII. Additionally, even if the source encoding is different, if a string literal contains \u escapes, its encoding will be UTF-8.

Changing an Encoding

We can change encoding using two different methods:-

1.) String.force_encoding()

2.) String.encode()

force_encoding() method

The force_encoding() method sets the Encoding of a string to a new Encoding without changing the internal byte representation of the string.

Let’s understand this with an example.

# -*- coding: utf-8 -*-

str1 = "Coding" # All characters are in the ASCII subset of UTF-8
str2 = "Ninjas" # All characters are in the ASCII subset of UTF-8

#Changing encoding of str1 from UTF-8 to ISO_8859_1
str1.force_encoding(Encoding::ISO_8859_1)

puts str1.encoding
puts str2.encoding


Output:

ISO-8859-1
UTF-8

In the above example, we can see that initially both the strings str1 and str2 have the same encoding UTF-8, but then we changed the encoding of str1 to ISO_8859_1 with the help of the force_encoding() method.

encode() method

With encode() method it is possible to transcode a string, i.e., translate its internal byte representation to another encoding. Its associated encoding is also set to the other encoding.

Let’s understand this with an example.

# -*- coding: utf-8 -*-

str1 = "Coding" # All characters are in the ASCII subset of UTF-8
str2 = "Ninjas" # All characters are in the ASCII subset of UTF-8

#Changing encoding of str1 from UTF-8 to ISO_8859_1
str3 = str1.encode(Encoding::ISO_8859_1)

puts str1.encoding
puts str2.encoding
puts str3.encoding


Output:

UTF-8
UTF-8
ISO-8859-1

In the above example, we can see that initially both the strings str1 and str2 have the same encoding UTF-8, but then we changed the encoding of str1 to ISO_8859_1 with the help of the encode() method and stored it in a third variable str3. In that way, the encoding of str1 and str2 remains the same, and str3 got the new encoding.

Encoding Compatibility Check

Two strings must have compatible encodings in order to perform certain string operations, such as concatenation and pattern matching.

The class method Encoding.compatible? allows you to determine whether two strings have compatible encodings. It returns the argument that is the superset of the other if the encodings of the two arguments are compatible. If the encodings are incompatible, it returns nil.

Let’s understand this with an example.

Example1:

# -*- coding: utf-8 -*-
str1 = "\xa1"
str2  = "\xa1\xa1"

#Changing the encoding of str1
str1.force_encoding("iso-8859-1")

#Changing the encoding of str2
str2.force_encoding("euc-jp")

#Checking Compatibility
puts Encoding.compatible?(str1,str2)


Output: 

NIL

In the above example, we can see that we have changed the encoding of both strings str1 and str2 using the force_encoding() method. Therefore, str1 is incompatible with str2, and the output is NIL.

Example2:

# -*- coding: utf-8 -*-
str1 = "Coding"
str2  = "Ninjas"

#Checking Compatibility
puts Encoding.compatible?(str1,str2)


Output:

UTF-8

In the above example, we can see that the encoding of both strings str1 and str2 is the same, i.e., UTF-8. Therefore, str1 is compatible with str2, and the output is UTF-8.

Check out this article - C++ String Concatenation

Frequently Asked Questions

What are Objects in Ruby?

In Ruby, everything is an object. All objects have a unique identification; they can also maintain a state and exhibit behaviour in response to messages. Usually, these messages are sent out via method calls. A string is an example of a Ruby object.

What is a block in Ruby?

In Ruby, a block is similar to a method. Ruby blocks enable us to execute any calculations and manipulation in the same manner as we would inside of any method. Therefore, we can say that while blocks in Ruby are identical to any method, they do not belong to any objects.

What is Multibyte Character in Ruby?

A character made up of sequences of one or more bytes is known as a Multibyte Character. Each byte sequence represents the extended character set as a single character.

What is a Global Variable in Ruby?

The variable with a global scope and can be accessed from anywhere in the program is known as Global Variable.

What are Global Functions in Ruby?

Kernel-specified methods and any methods defined at the top-level, outside of any classes, are Global Functions. Global functions are defined as private methods of the Object class.

Conclusion

In this article, we have extensively discussed String Encodings and Multibyte characters in Ruby with the help of code examples.

If you want to learn more, check out our articles on Object Marshalling in RubyTainting Objects in RubyCopying Objects In RubyHow To Invoke Global Functions In Ruby?, and Object References in Ruby.
 

Recommended problems -

Do upvote our blog to help other ninjas grow.

Happy Coding!

Previous article
Iterating Strings in Ruby
Next article
ASCII and BINARY Encodings in Ruby
Live masterclass