Understanding Unicode: Universal Character Encoding Explained

Author: Kostas Tsakiridis, 2BrightSparks Pte. Ltd.

Unicode is a computing standard aiming to provide a common encoding and representation of characters, and any symbols in general, that are being used in most of the world's written languages.

Before Unicode

Basically, computers can understand and communicate only with numbers. We may be seeing text on our computer screens but beneath, inside the computer circuits, everything is encoded as numbers in binary form, with each letter or symbol being represented by a number. The mapping of letters/symbols to numbers is done via a character set which is a predefined list of characters and their assigned numbers recognized by the computer hardware and software.

One of the most adopted character sets, named ASCII, uses the numbers 0 through 127 to represent all English characters as well as special control characters. European ISO character sets are like ASCII, but they contain additional characters for European languages.

Until recently, compatibility issues among computer system using different character sets were very common. A typical example are FTP servers where file names contain text using a different character set compared to a user’s computer using a FTP client application. In this example the server may be using an east Asian character set (e.g. Japanese) and the user’s computer running the FTP client using a European character set (e.g. Central European). The server’s file listing on the user’s screen will be unreadable, making no sense at all, due to the different character sets being used with conflicting letters/symbols assignments since both sides are using different languages to interpret the same number associated to letters.

Similar problems existed with web pages written in languages using character sets not automatically “understood” by the web browsers, but the users had to tell the browser which encoding to use in order to be able to render the pages properly.

Enter Unicode

The decades long incompatibility problems lead to the development and introduction of the Unicode Standard. It has changed all that by defining a unique number for each and every possible character/symbol used around the world, regardless of computer, operating system, application or language.

It has been adopted by all modern operating systems and software providers (including 2BrightSparks and its SyncBack applications) and now allows data to be transported through many different platforms, devices and applications without corruption and without the need of using translation tables. In addition, it allows user interfaces to be displayed in multiple languages on the same device or typing a document in a word processor in more than one language scripts, since the device is now capable of displaying multiple languages.

SyncBack applications support Unicode allowing correct filename preservation between source and destination during file transfers with the precondition that source and destination locations also support Unicode. Also, Unicode allows SyncBack users to display the application’s user interface in different languages instantly on any modern Windows installation without the need of installing separate language packs.

Pitfalls

Unicode may have solved many problems but as good as it may be, it can still cause issues to users. An example of this is the issue of file name uniqueness.

Take for example the two following file names (which you can download in a 7-Zip file):

êà.txt - which has the Unicode code points 65 302 61 300
êà.txt - which has the Unicode code points ea e0

They look identical but the underlying byte implementation for each one is different. This is called Unicode equivalence and if files were using the above names then Windows would treat them as different files and it would be possible to have them both in the same directory simultaneously. This means two separate files might be created with the same visual filename (i.e. the filenames visually look identical) but with a different byte implementation of the filename (i.e. the filenames are not actually the same).

This may cause problems for certain applications and/or workflows, and as far SyncBack is concerned, those file are being treated as being different, so it’s possible to backup/restore them as long as the target storage location can accept them.

Understanding Unicode: Universal Character Encoding Explained

Author: Kostas Tsakiridis, 2BrightSparks Pte. Ltd.

Before Unicode

Enter Unicode

Pitfalls

Noted Customers